The importance of interactive data visualization

10 minute read

Share on

In this blog post, we will see how to make interactive visualizations using the popular JavaScript library D3. We will go through the different visualization techniques, before coding an interactive plot called Parallel Coordinates. The goal is to demonstrate the effectiveness of the interactive visualization to convey the message hidden in the data, especially for high dimensional data.

Why interactive visualization is important

We perceive most of the information visually. We have the capability to instantly recognize and localize objects. We group them in the same category regardless of their shape, size, color, distance, cluttered or not. Most of the time, in order to understand some concepts it is always easier to make a sketch. This is due to the immense sophistication of the human visual cortex. That means it is in our nature to comfortably consume and understand visuals.

The same principles hold when we interpret tabular and highly-structured data. We draw 2D or 3D plots, histograms, scatter plots, heat maps, etc. We do this because it is not straightforward to understand the tabular representation of the data. On top of this, to make the experience more immersive, we make the plots interactive. However, we are cursed to perceive the information up to 3 dimensions. Representing data with more than 3 dimensions is a challenging task and we need different strategies.

For this reason, in this blog post, I will show techniques on how to escape the three dimensions married with interactivity for even better perception, by using D3. Stay tuned!

The visualization zoo

We can better understand the world and transfer a clear message by plotting the data we have because a plot is the most convenient and intuitive means to do that. In addition, we can easily upgrade the plots and make them interactive. In this way, the human-computer interaction is more immersive and the results are more interpretable. For instance, just take a look at the amazing interactive plots from Our World in Data. With a simple and yet non-biased and profound analysis and interactive visualization of many open-access data sets, we can clearly comprehend a plethora of the world's phenomena.

Depending on the aspects and the findings to express, we use different types of plots. The most commonly used plots are nicely summarized in the paper A Tour Through the Visualization Zoo. Generally, they are divided into five categories: Time-Series, Statistical Distributions, Maps, Hierarchies, and Networks. To find out the most recent developments and application of these visualization techniques, follow the work of the Interactive Data Lab at the University of Washington.

Figure 1. Credits: D3. From the D3 Gallery of plots.

There are plenty of tools to generate interactive visualizations. They range from very low-level, customizable and hard to automate tools to high-level automatic tools. In the open-source domain, we have the well established low-level JavaScript library D3. The main advantage of D3 is that we can create very specific and custom plots. Furthermore, we have more high-level visualization tools, Vega and Vega-Lite, both open-source. Vega is built on top of D3 and its aim is to provide a better and faster way to create high-quality graphics only by using JSON syntax. Vega-Lite is even more lightweight and enables quick production of the common statistical plots. On top of this ecosystem, is Voyager, a tool that automates the generation of interactive charts. Voyager is inspired by Tableau, a commercialized and industry adopted tool. This list is not exhaustive of course, there are numerous other tools and libraries, which I include in the Appendix.

Plotting high-dimensional data

Plotting and understanding 2D and 3D data are simple since we live in a three-dimensional world. For this purpose, we use histograms to plot distributions and line charts to draw relations. Reaching more than 3D is difficult. For this reason we might use different shapes, sizes, colors, and text if some of the data is discrete and finite. For instance, we can use a Bubble Chart, like the one depicted in the figure below.

Bubble plot showing the CO2 emissions per capita vs GDP per capita

Figure 2. Credits: Interactive data visualization for CO2 emissions from Our World in Data.

In fact, we have a five-dimensional data consisting of two continuous dimensions (GDP and CO2 emissions per capita) and three categorical dimensions (country, population, and geo-location). With a simple trick, we represent each country with a circle labeled with the country name, the total population with the size of the diameter, and the geo-location with a color schema.

So far so good! However, this type of plotting has its limitations. We can't represent a considerably high number of dimensions by using these tricks, especially if they are all continuous. For this reason, we have to use some other means to communicate the data. As mentioned in the paper A Tour Through the Visualization Zoo, we can use a plot called Parallel Coordinates.

The Parallel Coordinates plot enables us to explore and find patterns in high-dimensional data. Each dimension is represented as a vertical line parallel to the others where the range of values is distributed. Thus, one point in the high-dimensional space is represented as a poly-line connecting the corresponding values on the parallel axes. Additionally, we can select a subset of values in one or multiple axes to filter out the entries. An example is shown in the figure below, which is a visualization of car characteristics.

Figure 3. Credits: A Parallel Coordinates plot from Blocks.

Hands-on Parallel Coordinates with D3

The main goal of this post is to demonstrate the effectiveness of the interactive visualization, in particular the Parallel Coordinates plot. For this reason, we will show how to give a visual interpretation of a given problem.

Problem Statement

Suppose we want to buy a laptop with certain characteristics that well suits our needs. Usually, we do not have a clue about all the possible options. To read the specifications and compare the different offers, we might get a huge table that resembles like the table below:

Company	Model Name	Operating System	Screen Size	Screen Resolution	RAM Memory	SSD Memory	HDD Memory	CPU Model	CPU Clock Rate	GPU Model	Weight	Price
Apple	MacBook Pro	Mac OS	13.3 in.	2560x1600	8 GB	128 GB	0 GB	Intel Core i5	2.3 GHz	Intel Iris	1.37 kg	1339 Eur.
Dell	Inspiron 3567	Windows 10	15.6 in.	1920x1080	8 GB	256 GB	0 GB	Intel Core i7	2.7 GHz	AMD Radeon	2.2 kg	745 Eur.
Acer	Aspire 7	Linux	15.6 in.	1920x1080	8 GB	0 GB	1024 GB	Intel Core i7	2.8 GHz	Nvidia GeForce	2.4 kg	779 Eur.
MSI	GE63VR 7RF	Windows 10	15.6 in.	1920x1080	16 GB	256 GB	1024 GB	Intel Core i7	2.8 GHz	Nvidia GeForce	2.8 kg	2099 Eur.
Lenovo	ThinkPad P70	Windows 7	17.3 in.	3840x2160	16 GB	512 GB	0 GB	Intel Core i7	2.7 GHz	Nvidia Quadro	2.4 kg	2968 Eur.

Table 1: Tabular representation of the Laptop Prices data set. Although it is nicely formatted and summarizes the laptops neatly, we might easily get lost once we scale to a few hundred rows. There are 12 different axes to compare at the same time which poses a major problem to keep track of. Instead, we can visualize all entries in the table using the Parallel Coordinates plot and query them interactively.

Demo

For the purpose of the demo, we use the Laptop Prices data set from Kaggle which you can find it here. It contains 1300 entries with the following 13 columns: Company, Product, Type, Inches, Screen Resolution, CPU, RAM, Memory, GPU, Operating System, Weight, Price.

First, we pre-process the data set in order to clean it and adjust it for plotting, using this Jupyter Notebook. We split the column Memory into two columns SSD Memory and HDD Memory and express the quantities only in Gigabytes. Similarly, we split the column CPU into two other columns, CPU Model Name and CPU Clock Rate. The former contains the model of the CPU, while the latter its clock rate expressed in GHz. For the GPU column, we only take the part containing the model name. Furthermore, the values in the column named Ram contain the suffix GB, which is redundant, thus we remove it. Finally, for the Screen Resolution column, we only take the part containing the screen resolution in pixels, discarding the rest of the description. In the end, the data set format is like the one shown in Table 1.

Using the transformed data set we create the Parallel Coordinates plot. For convenience reasons, we only use 10 columns, assigning each one a vertical axis in the plot. Every axis has a description on top of it and a range of values depending on the type of data it holds: 1. numeric or 2. string. In the case of numeric data, the range is from the minimum to the maximum, separated with equidistant ticks ascending from bottom to top. In the case of strings, the range is from the first to the last alphabetically sorted string, each represented with one tick. Thus, one laptop is fully specified with one poly-line stretching from the most left to the most right axis. The poly-lines are colored according to the Company value in order to distinguish them easily.

The plot is interactive, such that, on each axis, we can select a subset of values in which we are interested, by dragging the mouse pointer over it. In addition, we can slide this range over the axis. Automatically, not selected values are filtered out. To augment the search, the table below the plot dynamically updates with the 5 cheapest laptops in the current selection. To restart the selection, one needs to click out of the selected range.

Now, the laptop search is much easier and intuitive. We can set different constraints for every column and notice the changes and compare the fewer options. Try it out yourself below!

The main advantage of this plot is that we can intuitively understand and plot multi-dimensional data. However, if the number of dimensions is very big, we can't plot all of them, since the plot will be too condensed and confusing. Moreover, it is not well suited for categorical data. The number of categories must be sufficiently small in order to place them all on one of the axes.

The full code to run this interactive visualization can be found here. For more information please follow me on Twitter.

If you liked what you just saw, it would be really helpful to subscribe to the mailing list below. You will not get spammed that's a promise! You will get updates for the newest blog posts and visualizations from time to time.

Conclusion

In this blog post, we see the importance of interactive visualization to understand our data better and make our searches faster. There are many different types of plots depending on the data nature and the task to perform. One of them is the Parallel Coordinates plot that enables us to scale easily to more than 3 dimensions. Out of the many existing tools for plotting, we use D3 for a hands-on experience to create a Parallel Coordinates interactive plot augmented with a dynamic table in the laptop searching domain. This plot has many potential uses and next time we will see how to apply it in the domain of Machine Learning.

Appendix

JavaScript libraries

The following list includes popular open-source JavaScript libraries and tools for visualization

Chart JS: it offers 8 fully responsive charts for mobile and web developers
Crossfilter: a library optimized for loading and exploring huge data sets, with millions of entries
DC.js: used to create multiple charts on the same data set that update dynamically together
Plotly.js: built on top of D3, it offers more than 40 charts
Cola.js: for plotting graph-based data
Leaflet: for plotting mobile-friendly interactive maps
MetricsGraphics.js: optimized for plotting time-series data

Python Libraries

The following list includes popular open-source Python libraries and tools for visualization

Matplotlib: mainly and mostly used for scientific plotting
Seaborn: based on Matplotlib for drawing more appealing chars
Bokeh: interactive visualizations mainly used for web applications
Altair: a Python variant of the Vega-Lite visualization grammar
Plotly: the Python wrapper of Plotly.js

Learning Resources

Check out how to use the parallel coordinates plot for tracking and visualizing Machine Learning models.

The following list contains an awesome set of resources to learn to plot with D3

Vladimir Ilievski

The importance of interactive data visualization

Why interactive visualization is important

The visualization zoo

Plotting high-dimensional data

Hands-on Parallel Coordinates with D3

Problem Statement

Demo

Conclusion

Appendix

JavaScript libraries

Python Libraries

Learning Resources

Leave a comment

Vladimir Ilievski

Subscribe

Why interactive visualization is important

The visualization zoo

Plotting high-dimensional data

Hands-on Parallel Coordinates with D3

Problem Statement

Demo

Conclusion

Appendix

JavaScript libraries

Python Libraries

Learning Resources

Leave a comment