Data cataloging and visualization: two imperatives in today’s organizations

With the proliferation of data, cataloging becomes a necessity for all large companies. Once this system is in place, using data visualization techniques that tell a story can bring great benefits. Here, Priya Iragavarapu of AArete discusses modern data cataloging systems and three factors in designing good data visualization.

Data cataloging: more than just a “nice to have”

As the volume and variety of data increases exponentially, so does the importance of data catalogs and data visualization. The uncontrolled growth of data with evolving attributes poses a significant challenge, it makes metadata management increasingly difficult.

Enterprise data management is particularly affected by this data glut. With complex nested data attributes, it is difficult for stakeholders to take a snapshot of the data, explore the metadata, and then create a data catalog or business glossary, reusing it as a reference in perpetuity.

Therefore, data cataloging is not only an unavoidable imperative, but it is necessary to do it in real time, which enables mining datasets to identify metadata. These data catalogs do two things: document metadata accurately and efficiently, and flag any anomalous metadata for discrepancies.

Another reason for the increased need for data cataloging is the prevalence of cross-collaborative hybrid teams with dotted connections within matrix organizations. Each team, throughout the data lifecycle, must understand the data beyond its immediate field of expertise to effectively perform its function. When data cataloging is operated in this way, it helps to follow the data lineage to understand how the data catalog evolves and changes at each stage of the data pipeline.

Organizations should look for data cataloging solutions with the following key features. First, the data cataloging solution must be able to explore data automatically and dynamically detect data attributes, data types, and data profiles. Additionally, many leading solutions integrate user input to create a data dictionary or business glossary. Desirable data cataloging programs are also capable of translating statistics into user-friendly visuals. Finally, a robust data catalog solution should not just display metadata, but allow users to take action based on that information.

However, there are tradeoffs when comparing the new augmented data catalog capabilities to more traditional approaches. The traditional approach is to create a custom script to analyze the data and write metadata-relevant data to a table for further analysis.

It’s also a rather manual process to keep track of when and how often the script runs, which has the disadvantage of batch processing. The most sophisticated custom solutions consist of real-time crawlers, which determine metadata and detect any changes in real time. This program is ideal for many low latency applications. However, these advanced data cataloging solutions pose challenges of resources, computational complexity, and cost.

Complex programs can also pose a security risk. The systems that offer the most opportunities for automated discovery cause the most concern among operational IT professionals. They are asked to either allow a breach in their firewall so that a cloud-based solution can access it, or install a new on-premises system.

If these concerns deter an organization from adopting the modern approach, there are many off-the-shelf products that organizations can leverage for data cataloging solutions. These can or should be better integrated, depending on the technology stack and legacy systems present within the organization. But organizations need to identify where they fall on the spectrum, from creating a custom solution to using an out-of-the-box product. It all depends on the nature of the data and the needs of the organization.

Data visualization: it should tell a story

Once a data cataloging system has been chosen and implemented, organizations need to determine how to get the most out of that data.

Data visualization technology has advanced significantly over the past decade, producing advanced software such as Tableau, Power BI, Qlik, Looker, and IBM Cognos. Modern tech companies are eager to incorporate data visualization into their practices, but many struggle to choose a program that best suits their needs. Here are several aspects that organizations should consider before deciding what data

Size and source of data to visualize

The first consideration is the size as well as the source of the data. These qualities determine which software is appropriate and whether two tools should come together to appropriately meet the data visualization needs of the organization. For example, a company stores its data in cold storage such as S3, and this S3 is directly connected to Tableau. Even though Tableau provides this connector, the performance of the visualization task will suffer. Tableau is a great visualization tool, but putting the query load on Tableau will affect performance and latencies. In this case, Qlik is a much better tool because it has a built-in query engine, efficiently running a query on large datasets and cold storage. Again, this is not a Tableau review; it simply means that the user should adequately assess the strengths and weaknesses of the visualization tools and align them with their organization’s goals.

Organization Technology Stack

Another factor is the technology stack of the organization. This should be carefully considered before committing to any individual data visualization tool. For example, an organization may already have invested in an Azure Cloud or IBM ecosystem or a different technology stack of their choice. Some examples: If a company uses the IBM ecosystem, it would make sense to use IBM Cognos; or if the organization was using Azure Cloud, Power BI would be the smarter choice. Tools can only be mixed and matched when there is no relevant unified strategy for a one-stop tech stack. For the most part, most tools are built in such a way that they have connectors to mix and match with other tools.

The extent of data pre-processing required

The final factor to consider is data preprocessing. Ideally, visualization queries should query data directly and be able to filter, sort, and aggregate data within the tool. If the preprocessing is complicated, it puts extra load on the data visualization program, which affects performance. Therefore, the data pre-processing engineering work must be done outside of the tool. An assortment of pre-processing tools match their data visualization counterparts. For example, Tableau uses Tableau Prep. By carefully considering the extent of data preparation required, the user can predict the performance of data visualization and the rate at which data is visualized.

In addition to the considerations above, organizations choosing data visualization initiatives should recognize that color, chart type, and visualization type choices determine the impact that data visualization will have on their business. The most effective data visualization solutions combine art and science.

More importantly, powerful data visualization software doesn’t just create scatter plots, heatmaps, pie charts, or bar tells a story. Industry leaders rely on these tools because they can create story arcs without sacrificing the ability to experiment with multiple approaches. As data visualization technology advances, these trends will become increasingly apparent, with leading companies using visualization tools to effectively develop data products that are increasingly in line with consumer demand.

With the proliferation of data comes potential benefits and serious liabilities for organizations. To be more effective, they need systems to understand what they have, ensure data is up-to-date and retrievable, and turn data into visualizations that help tell a story. There are many tools that, used wisely, can help organizations achieve all of these goals. They need to know what to use and how to use it.

How up-to-date is your organization’s data visualization and cataloging? Let us know on Facebook, Twitterand LinkedIn.


Comments are closed.