Professional Documents
Culture Documents
Technical Report
NWU-CS-05-12
April 20, 2005
Abstract
Graphs are a widely used and understood visualization. However, they quickly break
down when the visualized data is quite complex, requiring hundreds or thousands of
graphs. Our Histographs system builds on the techniques introduced with Information
Murals [7] to enable meaningful visualization of such data. Histographs map the
frequency of data elements at each display location to luminance, revealing data density
and trends. Our improvements include contrast-weighted histogram equalization to
improve the frequency-luminance mapping, splatting to make outliers visible, a second
derivative modulation to reveal changes in data trends, and the use of line integral
convolution show local data flow. Different data-to-space mappings can be implemented
interactively. A linked correlation matrix display and highlights inter-graph relationships.
Users can zoom in on data, as well as select with shape- and correlation-based brushing.
Histographs are a useful way of obtaining data overviews, and revealing hidden structure
in complex data sets.
Keywords: information visualization, data streams, splatting, histogram equalization, image
processing, market visualization, computer system visualization
Histographs: Interactive Visualization of Complex Data with Graphs
Figure 1: NYSE TAQ trading data for May, 2004. On the left, a log(price) histograph. Dark horizontal stripes indicate more trades in the middle
price range, thinner vertical stripes show increased activity as trading days close and open. Right, trades in each stock plotted relative to its
mean (SDMean(log(price))), and illustrates price trends with line integral convolution (LIC). Monthly and daily trends are much clearer.
in the data space appear as connected line segments joining points corresponding to display resolution. The images for all the graphs
on these axes. Dimensional stacking [9] maps N-dimensional data are then summed, forming a composite frequency array Fx×y = Σi
into two dimensions by choosing two dimensions and creating Fi,x×y. This array then becomes input for the FL mapping (see
graph axes for them, subtracting those dimensions from the data Section 4).
set, and then recursively embedding smaller graphs inside this The data-space mapping in each graph Gi is formed by
graph using the same technique, until graphs with only one or two choosing one of the N dimensions in the data set and mapping it
dimensions can be created. Histographs address the to the horizontal axis in all graphs. We call this the abscissal
dimensionality problem by choosing one axis from the data set dimension. In all our work to date, we use time as our abscissal
and mapping it to the horizontal axis in N-1 2D graphs, each with dimension, meaning that we treat our data as time series [2][15].
one of the remaining N-1 dimensions mapped to its vertical axis. We currently use a simple linear mapping of time to the horizontal
These axes are then stacked. We illustrate this approach below axis. We map each of the remaining N-1 ordinal dimensions to
with New York Stock Exchange (NYSE) trading data containing the vertical axis on one of the N-1 graphs in the stack. Users can
thousands of stocks, and a 262-dimension Windows system select interactively from various vertical mappings, including
performance data set. linear mappings between the Global minimum and maximum data
High-dimensional or large data is not only difficult to map to values over all the N-1 ordinal dimensions, and between the Local
the display, but also to visualize clearly. Often, many data points minimum and maximum data values in the single ordinal
map to the same display location, obscuring data visibility. Zhai et dimension corresponding to the current graph Gi. Users can also
al. [16] address this occlusion problem with translucence, which transform the data interactively before it is mapped. Three
blends the colors of overlapping data objects. This is effective if particularly useful transforms are log(), deriv() and gain(). The
the number of data objects at any pixel is limited. Jerding and latter subtracts the first data value in abscissal order in the current
Stasko's IM system [7] permitted large graphs to be scaled down graph Gi from all subsequent local data values. For time series,
to small display sizes by mapping the number of data points at this means subtracting the oldest value from all subsequent values
destination pixels to luminance. Trutschl et al. [13] reposition (Figures 3, 5 and 10).
occluded points using a smart jittering algorithm. Although none When each of the N-1 ordinal dimensions is measured using the
of these techniques were designed to handle a very large number same units (e.g. trades for thousands of stocks measured in US
of occlusions, the frequency-luminance (FL) mapping used by IM dollars), these data-space mappings are quite useful. They are less
is easily adapted to this purpose. Wegman & Luo [14] and Artero useful when units differ, as they do in our Tlab Windows
et al. [1] use this approach to resolve the occlusions generated by performance data set. For such cases, our system includes an
large numbers of data records in parallel coordinates projections. SDMean mapping. This operation finds a local mean µi and a
Histographs use this approach to resolve the occlusions generated local standard deviation σi for each graph Gi and the data in its
when high-dimensional data is projected into 2D. matching ordinal dimension. Data is then linearly mapped
between -4σi and 4σi, with µi located at the center of Gi’s vertical
3 BASIC HISTOGRAPHS axis. Data values outside this range are clamped to the range. This
Like a histogram summarizes the distribution of samples using a mapping places main trends at the middle of each graph’s vertical
frequency-space mapping, a histograph summarizes a large dimension, and outliers above and below it (Figures 1, 2, 6-9 and
number (a “stack”) of graphs using an FL mapping. To generate a 11-12).
histograph, the data elements in each 2D graph Gi is organized To allow flexible, interactive zooming, we construct a temporal
and counted into a 2D data frequency image Fi,x×y, with resolution hierarchy on the data. This hierarchy is built bottom-up during
Figure 3: gain (and loss) in stock price for Dec 7, 2004. On the left Figure 4: NYSE data for Dec 1, 2004. Left, a histograph with
CWE defines the luminance mapping, on the right histogram splatting. Right, the same histograph without splatting. Note the
equalization does. CWE has higher overall contrast while increased visibility of splatted outliers.
preserving visibility of outliers and high-frequency data features.
independent luminances in the range [0,1]. As a first step, it
precomputation from leaves that sample time regularly. Data produces an intermediate array of summed contrasts C|F|:
points need not be so regularly spaced, so when a leaf contains
multiple points, the points are aggregated using a simple or DistinctF|F| = a list of all distinct frequencies in Fw×h
weighted (e.g. trading volume) average. This process continues Set all elements in C|F| to 0
recursively, with children being aggregated into their parents until For each element F in Fw×h do
the entire hierarchy is filled. For each element N of F’s 8 neighbors do
Two data sets drive our work and illustrate this paper. The C|F|[F] += |F - N|
NYSE TAQ data set records every trade made on that Wall Street End for
market, to a temporal resolution of one second. Each data record End for
includes the ticker symbol of the traded stock, the time and date of
the trade, as well as trade price and volume. The Tlab Windows FL|F|[0] = 0
system monitoring data measures the performance of a small SumC = 0
cluster of PC in a departmental student lab. The data includes 29 TotalC = sum of elements in C|F|
different performance measures for each of the nine machines in For i = 1 to |F|-1 do
the cluster, recorded at a 1Hz rate for two months in 2001. The SumC += C[DistinctF|F|[i]]
measures include processor user time, memory usage, and the FL|F|[i] =SumC/TotalC
number of sent TCP packages. End for
4 FREQUENCY-LUMINANCE MAPPINGS The results of CWE can be seen in Figures 2 and 3. Note that if all
The FL mapping is central to the utility of histographs, and should the inner loop in the first half of this pseudocode is replaced by
effectively reveal variation in data density across the graph stack, the simple operation C|F|[F] += 1 then the above is equivalent to
even when data in certain regions of the stack is quite sparse. HE.
To ensure that all data frequencies in our histographs are
4.1 Contrast-Weighted Histogram Equalization visible, we often find it useful to transform the FL mapping to a
In the attempt to maximize contrast and minimize quantization, device-dependent luminance range with a non-zero minimum.
visualizations (e.g. [1]) typically map data to luminance by When the number of data frequencies |F| in input is less than the
equating the lowest and highest data values to the lowest and device’s luminance resolution, we ensure that each input
highest luminance values, and using a linear mapping between frequency maps to a distinct output luminance.
them. Unfortunately, if the highest and lowest data values are
outliers, this linear mapping causes uneven distribution of actual 4.2 Splatting to Increase Visibility of Isolated Data
data values across luminance, resulting in poor contrast (Figure Even with a good FL mapping based on CWE, isolated data
2). elements can be hard to see (Figure 4). This occurs when data
In image processing, histogram equalization (HE) [11] elements represent outliers, or after user zooming has reduced
addresses a similar problem by accepting an input luminance data density in the current histograph. Such isolated data elements
image and producing a mapping of each distinct luminance in that can be important precisely because they are unusual or isolated,
image to a new luminance. The differences between output and should be visible even when the FL mapping quite
luminances are proportional to the frequency with which each appropriately makes them low-contrast features in the
original luminance occurs in the input image. To apply HE to our visualization. To address this problem, we add lower spatial
FL mapping problem, we replace the input luminance image with frequencies to isolated data elements using splatting, increasing
an image-sized array of data frequencies. HE then eliminates the visual salience of these elements without distorting the FL
sensitivity to data outliers, but can overemphasize small mapping. These goals are quite different from van Liere and de
differences in data frequency, simply because they are common Leeuw [10], who use splatting uniformly throughout their graphs
(Figure 3). to blur and reveal global structure.
Our solution is to adjust HE’s mapping to reflect the contrast To focus splatting on low-contrast, spatially isolated data
between input values, rather than the frequency with which they elements, our splats are adaptive both to the number of data
occur. We call the resulting algorithm contrast-weighted elements nearby, and to the luminance of the data elements
equalization (CWE). When applied to FL mapping, CWE accepts themselves. We begin with a simple neighborhood search around
an image-sized array of data frequencies Fw×h and outputs a each pixel Pxy containing data elements to find the radius rk that
mapping FL|F|, where |F| is the number of distinct data frequencies defines the circular neighborhood around the pixel that contains
in Fw×h. At completion, FL|F| contains an increasing set of device-
Figure 5: gain in stock price in May 2004, colored to reveal local Figure 6: trading for Dec 1, 2004. LIC reveals pricing trends in the
price trends. Red shows falling prices, while green shows rising lower price range between 10 and 11AM.
prices. High saturation indicates a broader trend. This histograph
reveals many events in the middle of the trading day.
Figure 7: Tlab Windows monitoring data for May 1-4, 2001, visualized relative to each parameter’s mean. Left, data frequency determines
nd
luminance; right, frequency is modulated by 2 derivative, highlighting several system events.
exactly k other data elements (we currently use k = 4). The splat is When two splats overlap, the maximum luminance is applied;
then defined by the exponential function splats are not additive. Splats never affect the luminance of pixels
containing data points.
0.75 ⋅ lum( P xy ) ⋅ exp⎛⎜ ⋅ log(1 lum( P xy )) ⎞⎟
Dist ( P xy,Q xy )
⎝ rk
⎠ 5 VISUALIZING HIGHER-ORDER TRENDS
By interpolating between data points, line graphs visualize an
where lum() returns the luminance of a pixel, Dist() the distance approximation of slope, giving viewers a sense of trend and flow.
between two pixels, and Qxy is the pixel being shaded. Note that As dense scatterplots, simple histographs do not visualize these
the exponential falls off both as a function of the local sparseness higher-order data characteristics. We have implemented a number
of data elements rk, and the luminance of the splatted pixel Pxy. of improvements to address these shortcomings.
Figure 9: Tlab Windows monitoring data for May 2001 with a linked correlation matrix view on the right. Green indicates positive correlation,
red negative correlation, with correlation strength mapped to color luminance. Each matrix row or column shows correlations of one system
parameter to all others. Brushing a correlated range in the matrix image selects similar data streams in parameter view.