Multivariate exploration using extended scatter plot

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University Linköpings Universitet

SE-601 74 Norrköping, Sweden

601 74 Norrköping

LiU-ITN-TEK-A--08/127--SE

Multivariate exploration

using extended scatter plot

Nannur Hassan

(2)

LiU-ITN-TEK-A--08/127--SE

Multivariate exploration

using extended scatter plot

Examensarbete utfört i medieteknik

vid Tekniska Högskolan vid

Linköpings universitet

Nannur Hassan

Handledare Mikael Jern

Examinator Mikael Jern

Norrköping 2008-12-17

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

The information age has resulted in masses of data in many domains(e.g.,demographic analysis, physical simulations, biochemical data or environmental data analysis) which has multivariate properties. How can this type of data be explored to enable analysis and reveal pattern and features? One way is to visualize data correla-tions through a scatter plot. The project aims to develop an extended scatter plot in the GAV framework a component-based class library developed in Microsoft’s C# .NET platform using low-level DirectX graphics library. GAV uses an atomic component approach to increase customization and scalability of the scatter plot and depicts a layer where a specific idea or task is implemented. The scatter plot consists of several such layers.

Some of the implemented tasks to improve a basic scatter plot is adding support for four numerical dimensions, to fit a linear least square regression line showing bivariate relationship explicitly, to allow the user to control focus and context view, to reveal internal characteristics and distribution of data based on percentile calculation, to incorporate multivariate interpolation, a process of assigning values to unknown points by using values from usually scattered set of known points. Intuitive interaction is also one of the goals of the thesis. Relevant snapshots with implementation details are provided as results and several ideas are mentioned for mproving and developing further.

(5)

(6)

List of Figures

2.1 The composition of three types of component layers of an applica-tion development for NCVA projects . . . 10

2.2 Layered architecture in GAV . . . 11 3

(7)

4 Contents

2.3 A typical layout of a layer . . . 12

2.4 Atomic layer composition of the scatter plot component . . . 14

3.1 Screen shots of different aspects of glyph rendering . . . 16

3.2 Brushing, selection and tooltip . . . 17

3.3 Detail information about selected glyphs . . . 17

3.4 Regression line plotting from a sample dataset . . . 18

3.5 Data distribution through percentile background . . . 19

3.6 Controlling focus and context by attached range sliders . . . 20

3.7 Results of annotation algorithm . . . 21

3.8 Uniform gridding phenomenon . . . 22

3.9 Neighborhood radius illustration . . . 23

3.10 Interpreting the true inverse distance interpolation algorithm . . . 23

3.11 Interpreting the mean inverse distance interpolation method . . . . 24

3.12 Interpreting the minimum inverse distance interpolation method . 24 3.13 Tooltip containing interpolated values . . . 25

4.1 Zoom in and out through an extra rectangular view . . . 28

4.2 Illustrating the limitations of the algorithm for generating good annotation values . . . 29

4.3 Interpolation results for a finer grid . . . 29

5.1 Depicting percentile sliders with edge values . . . 32

List of Tables

3.1 A sample data set . . . 19

(8)

Acknowledgments

I would like to thank professor Mikael Jern for the opportunity to work on this project and support throughout the work.

My sincere thanks go to my supervisor Tobias Åström for enormous number of discussion, advice and proof reading.

I would also like to thank to my NCVA colleagues and friends for sharing their work experience.

(9)

(10)

Chapter 1 Introduction

1.1 Preface

This is a report for a master thesis project carried out at NCVA(National Center for Visual Analytics) at Campus Norrköping, Linköping University, Sweden. The thesis is for a master of science degree in Advanced Computer Graphics. It serves as documentation of my work during the study, which has been made from May 2008 until December 2008.

GAV framework[8] is a component-based library developed in Microsoft’s .NET[5] platform using low-level DirectX graphics library[2] and C#[6]. The GAV frame-work has been developed within NVIS1 _{for some time and has formed the basis}

for visualization analytics(VA) effort of its authors. Currently NVIS has following affiliated research group [14]:

• Scientific Visualization

• Information & Geo Visualization • Computer Graphics & Virtual Reality • Structure & Civil Engineering

In May 2008, professor Mikael Jern, Linköping University, founded NCVA which aims to conduct research on information and geo visualization field, a NVIS affiliated research group. NCVA implements visual analytics techniques and ap-plications through its partnership [9]:

• Industry(Swedish national and International) • Swedish Government Agencies

• Academia (Swedish National and international)

Since then, NCVA team has been taken responsibility of GAV framework and continues its ongoing development through version control system.

1_{Norrköping Visualization and Interaction Studio, Linköping University}

(11)

8 Introduction

1.2 Background

The idea of GAV is to use pre-built components to develop user specific applica-tions. A two-dimensional scatter plot is such a component and is the focus of this thesis project.

The previously built scatter plot component in the GAV was very inflexible, i.e. difficult to extend. This thesis introduces a new extended two-dimensional scatter plot which addresses the problems of scalability, usability, effectiveness and uncer-tainty etc. with innovative visually driven methods of exploratory data analysis. An atomic layered component architecture has been recently incorporated in some of the components of the GAV framework and our implementation is based on this layered approach.

1.3 Objective

The analysis of multivariate information is a widely spread and important task. Many domains generate and handle data of multiple attributes, e.g. demographics analysis, physical simulations, biochemical data or environment data analysis. The raw data itself contains a lot of knowledge which is hard to reveal without analysis. Many techniques have been developed to support the knowledge discovery and numerous solutions for visualization of multivariate data exist. The objective for this thesis is to develop an extended scatter plot in the GAV framework based on the new atomic component architecture. The extended scatter plot component will support the following tasks implemented as different layers:

• Simple glyphs - Filled and hollow dots, with size and color options.

• Attached range sliders that control focus and context for a two-dimensional scatter plot.

• Regression line

• Background - based on percentile calculation or mean values.

• Gridding sort- Slider selects grid size, conditional gridding based on other attributes.

(12)

Chapter 2 Methods and Tools

This Chapter presents the implementation environment and the background nec-essary to develop the scatter plot component.

2.1 C# and DirectX

C# is a modern object-oriented language for application development. In addition to object-oriented constructs, C# supports component-oriented programming with properties , method, events. According to Microsoft, .NET is the way to go when developing GUI applications for windows. Its managed memory is one of the biggest reductions to development time.

DirectX is a set of multimedia Application Programming Interfaces(API’s) written by Microsoft. It’s a collection of Dynamic Link Libraries(DLLs) that contain functions useful to graphics programmers. As we draw simple lines, circle rectangle in the scatter plot component, DirectDraw and Direct3D components are enough to provide such primitives.

2.2 Geo Analytics Visualization Framework (GAV)

The GAV framework and class library aims to allow easy use of sophisticated visualization approaches to aid in bridging the gap between data and mind by taking advantages of human perception capabilities. GAV also addresses the sub-stantial challenges of exploring integrated spatial and temporal multivariate data simultaneously, allowing the analyst to extract complex patterns in large infor-mation spaces via visual interaction, inquiry and thinking and then communicate assessment and gained knowledge for action.

In fact, GAV adopts the following generic tool design and implementation tasks of Geo Visual Analytics [8]:

• Short development time by using already developed and tested components • Layered component approach for customization, scalability, and re-usability

(13)

10 Methods and Tools

Figure 2.1. The figure depicts the composition of three types of component layers of

an application development for NCVA projects [9].

• Design based on cognitive and perceptual principles • Easy to integrate external user components and layers

• A 3D data model(DataCube) for spatial-temporal and multivariate data. • Analyzing with dynamic multiple-linked views

• Visual space-time and multivariate query tools

• Interactions through brushing, picking, highlighting, filtering range slider and view coordination

• Integrated mechanism for saving and packaging the result of s VA reasoning process.

2.3 Layered Component Architecture in GAV

One major and newly incorporated enhancement in the GAV framework is the layered component way of thinking, as shown in Figure: 2.1, instead of creating a large application that tries to do everything. In this approach, an application project is divided into a number of functional components(i.e., scatter plot) and each functional component is assembled with several atomic components or atomic layers, each one performing a specific task with embedded interactions. Thus, through an atomic layered component architecture containing several hundred C# classes, GAV offers a wide range of visual representations ranging from simple to sophisticated(as shown in Figure: 2.2). Advantages of layered approach can be summarized as:

• It enables broad applicability, customization, scalability and re-usability of components

• It increases the idea of interoperability since different developers and part-ners, working independently, can contribute VA to the GAV components repository.

(14)

2.4 The Scatter Plot Rendering Overview 11

Figure 2.2. GAV has an atomic layered component architecture and offers a wide range

of visual representations(from simple to sophisticated). With negligible programming effort in C#, comprehensive explorative multiple-linked view prototypes can be assembled rapidly [8].

• It encourages new ideas to be assessed together with standard feature with-out having to writer a complete functional visualization component

• The effectiveness and performance for each layer of the component increases • It makes simple for debugging and revision

2.4 The Scatter Plot Rendering Overview

The GAV framework utilizes MIcrosoft’s development tool, Visual Studio’s. NET hierarchical layout management to interactively design a GUI layout with dynamic embedded resizable view in a single coherent window. By using layout management environment, a visual interface can be divided into a number of views separated by interactive splitters. The functional component scatter plot( ScatterPlotExtended

class ) is attached to a view through a view renderer(ViewManager class) in GAV. The renderer takes a functional component(i.e., scatter plot) and a C# panel as input parameters. The renderer knows when , where, and how the scatter plot

(15)

12 Methods and Tools component will be rendered.

The scatter plot component (ScatterPlotExtended class) inherits from its base class GavComponent. The GavComponent class offers some useful properties and functionalities, such as,

• Adding and removing atomic layers (ComponentLayer class )

• Moving layers ( changing order and positioning layers along top to bottom) • Locking layers(used in mouse event implementation)

• Manipulating events on those layers - mouse move, mouse down, mouse up etc.

• Controlling layers invalidation (only if a layer is enabled then it is invali-dated).

Each atomic layer component, as shown in Figure 2.4, inherits from its base class,ComponentLayer. TheComponentLayerclass also has many functionalities(e.g., layer caching, employingGavControl class etc. ). We use GavControl class func-tionalities extensively for each layer. Here are some funcfunc-tionalities ofGavControl

class :

• Setting up a layer size and positioning it in a render target(C# panel) • Setting up a layer’s margins(left, right, up, down)

• Handling coordinate transformation among component absolute, layer rela-tive, layer absolute( see Figure 2.3)

Figure 2.3. Layer layout

A layer object’s(i.e., SPAxisAnnotationLayer class ) constructor generally takes a model parameter (ScatterPlotModel class ).

Some useful properties and functionalities of SPAxisAnnotationLayer class are as follows:

(16)

2.4 The Scatter Plot Rendering Overview 13 • It takes a DataCube1 _{object as input.}

• It stores values of all attributes, normalizes them and calculates some statis-tics, e.g. column’s maximum, minimum, average value.

• It holds events, e.g. when axis indicator is changed. • It updates multivariate attribute mapping.

• It contains some functions that are used by more than one layer. For exam-ple, for good annotation values, an algorithm is implemented in a function.

CalculateGoodValues, and is shared by three layers. • It povides other helper functions.

1_{DataCube class stores a three-dimensional float array with some helper methods. Most of}

the functionality in GAV only supports a DataCube with z = 1, meaning that the depth of the cube is 1. In GAV parts of the cube are often referred to as rows and columns.

(17)

14 Methods and Tools

(18)

Chapter 3 Implementation

As a reminder, the functional component scatter plot is composed of several atomic layer components( shown in Figure: 2.4). This chapter is divided into two parts. The first part presents implementations and achievements using the tool described in Chapter: 2 of those atomic layer components. The second part describe a toolStrip embedded with the scatter plot component for producing user interface elements.

3.1 Atomic Layers

3.1.1 Glyph rendering

Basic scatter plot: Scatter plots are traditionally most appropriate for numeric

real values data of low dimensionality. It displays the values of two variables at a time using points(glyphs), where the value of one variable determines the relative position of the points(glyphs) along the x-axis and the value of a second variable determines the relative position or of the points along y-axis. Typically , the points or glyphs in a scatter plot are rendered independently of one another. Conventionally,the first column of a tabular data is to be plotted on the horizontal or x-axis and the second column is to be plotted on the vertical or y-axis.

Axis scaling: Each axis represent a single attribute in the dataset and

cor-respond to, for example, a column in an excel spreadsheet. The scaling of the individual axis was supposed to range from attributes’ minimum values at the bottom to their maximum values at the top. But, since we want meaningful annotation values(not fractional) to display, the columns’ actual minimum and maximum values are adjusted to generate working minimum and maximum values based on column data itself and rendering layout size.

Plotting glyph: Two types of glyphs can be rendered on this plot - filled and

hollow circle. The latter types of glyphs,as shown in Figure: 3.1(a), may prevent cluttering the plot.

Color and size map: Various technique exist for increasing the visual

dimen-sionality of scatter plot. Thus a two-dimensional planar scatter plot of variable 15

(19)

16 Implementation

Figure 3.1. Fig(a)shows that plotting hollow glyphs can prevent visual cluttering. Fig(b)portraits a scatter plot containing four numerical dimensions(x-column, y-column, size, color)of a data set. Fig(c) illustrates various aspect of glyph shape for categorized data [1]

X and Y can also show additional variable by parameterizing more visible char-acteristics of the plotted points, which then become glyphs. Color is commonly used for this purpose to show a third dimension via a color map. Thus, when the plotted points are colored according to a color map, then corresponding points in another component plot are identically colored(linked). Glyph size and various aspect of glyph shape, shown at Figure: 3.1(c), which is suitable for categorized data [1] (not implemented here) can also be similarly used.

For size mapping we can provide size range(minimum and maximum glyph size in pixel) for a column and individual glyph are plotted in different size according to its value. The Figure: 3.1(b) shows the scatter plot containing four numerical dimensions(x-column, y-column, size, color)of a data set.

Picking: Picking algorithm generates a list of glyph indices when the mouse

hovers over the plot. Then, this list of glyph indices can be utilized in various ways. These are

• Brushing - It means an interactive method allowing one to select on-screen specific data points or subset of data and identify their common character-istics or other purpose( shown in Figure: 3.2(a)).

• Tooltip - The user hovers the cursor over the plot, and a small "hover box" appears with supplementary information(the user can determine the display text) from a pre-computed list of glyph indices(shown in Figure: 3.2(b)). • Selection Details - After selecting data points through brushing, we can get

all the attribute information of selected glyphs by clicking ToolStripButton ’Selecton Details’(shown in Figure: 3.3).

3.1.2 Regression line

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. Say, we have a set of data,(xi, yi),shown

(20)

3.1 Atomic Layers 17

Figure 3.2. (a)The points the brush by clicking left mouse button are black colored.(b)

’Hover box’ contents for tooltip are user specific.

Figure 3.3. All attribute information of pre-selected glyphs is shown by clicking

’Selec-tion Details’ button.

in Figure: 3.4(left).If we have reason to believe that there exists a linear relationship between the variable x and y, we can plot the draw and draw a ’best-fit’ straight line through the data. Of course, this relationship is governed by the familiar equation y = mx + b . We can then find the slope, m, and y-intercept, b, for the data which are shown in the Figure: 3.4(right) [12].

where,

m = nP (xy) − P x P y

nP x2_{− (P x)}2 (3.1)

b = P y − m P x

n (3.2)

(21)

18 Implementation

Figure 3.4. The left table shows a sample data set.The right image shows how to plot

a regression line

3.1.3 Percentile background

We can set a background in the scatter plot component based on percintile calcu-lation so that user gets a better depth impression for anlysing data.

Percentile Definition

:"A percentile is a value at or below which a given percentage or fraction of variable values lie. For a set of measurements arranged in order of magnitude, the p-th percentile is the value that has p% of the measure ments below it and (100 - p%) above it" [10]. Thus 25th percentile is the value such that one fourth of the data lie below it. It is heigher than 25% of the data values and lower than 75% of the data values. We have implemented here three options(see Figure: 3.5:

• 50% Median(2 regions, 1 data break points) • 33% - 66% (3 regions, 2 data break points) • 25% - 50% - 75% (4 regions, 3 data break points)

Method

: Microsoft Office Excel uses a method [10] to calculate the per-centiles. The pthpercentile is defined by

y = (1 − g) x (j + 1) + gx (j + 2) (3.3) where

(n − 1) p = j + g (3.4)

To better understand the method and above notation, we will apply it on simple example. The data set studied is in Table: 3.1.

Once ordered the Table: 3.1 turns into Table: 3.2.

(22)

3.1 Atomic Layers 19 variable(x) x0 x1 x2 x3

value 2 1 4 3

Table 3.1. A sample data set

variable(x) x(0) x(1) x(2) x(3)

value 1 2 3 4

Table 3.2. Ordered data set of Table: 3.1

The product (n − 1)p from Equation: (3.4) can be spilt up between j + g, where j is the integer part of (n − 1)p and g is the decimal part of (n − 1)p.y is the percentile associated to p.

So, (n − 1)p = 3 x 0.25 = .75 = 0 +.75( j = 0 and g = 0.75) and, from Equation: (3.3),

y = (1-0.75) x 1 + 0.75 x (2) = 0.25 x 1 + 0.75 x 2 = 1.75

we will normalize this edge value(y) to map on corresponding axes.

Figure 3.5. Screen shot visualization contains percentile back ground.Left image shows median percentile background.The middle one shows 33% - 66% percentile back-ground.The right image shows 25%-50%-75% percentile background.

3.1.4 Focus and context

Sometimes glyphs are plotted on top of one another if the amount of data that the user works with is increased or data values are very close to each other. This pro-duces visual cluttering and often obscure relationship among variables and restrict the user in showing details. One approach to solve the problem of presentation in an insufficient space(drawing panel) is ’focus and context’. This means that some part of the information is presented in details(focus) while the rest is still available but kept aside as insignificant(the context). An overview and further reference can be found in [16].

To achieve this, we place a track bar having double sliders on each of the axes of the scatter plot component. When the sliders are on the edges of the track bar,

(23)

20 Implementation

Figure 3.6. Controlling focus and context by attached range sliders

the whole set of column values is in focus(100% focus + 0% context). We can move the slider with the mouse to control the focus and context region. We will then see that the glyphs are moving. Both sliders display values on annotation labels which describe the coverage area they hold currently. Interacting by both x-axis’ horizontal track bar and y-axis’ vertical track bar, we can select any region as focus which is of our interest. Figure: 3.6 shows the scatter plot having track bars, each of which has attached range sliders.

3.1.5 Axis annotation

X axis and Y axis annotations consist of text string attached to a location on the axes. Thus axes can have both tic marks and tic mark labels(in which case tick values are also drawn). Tic spacing can be controlled by specifying grid spacing. In fact number of minor tic spacing are always equal to number of grid cells. If we specify narrow grid cell spacing, minor tics will be automatically visible and major tics are nicely spaced. Label information is shown only on major tic marks. We really need good values(not fractional) to display on labels. So,the finest feature about axis annotations is that they prevent adjacent annotations from overlapping. In fact, I had to face the following challenges:

• Tics are adjusted automatically by column data( data range and digits in values) to be plotted on the graph.

• Tics are automatically adjusted when the user change the view size through splitters.

(24)

Figure 3.7. The upper left image shows that x-axis minor tic spacing is too narrow(i.e.,

20 pixel) to fit the labels whereas the major tics are nicely spaced and labeled(not over-lapped).The x-axis attribute values range from 1755 to 4360. But the implemented algo-rithm generates working more readable values which range from 1600 to 4400. Similarly, the generated values range between 11 and 25(nicely round off) for the y-axis instead of 11.2 and 24.8. The lower right image has negative x-axis annotation. The upper right im-age demonstrates how annotations change with changing view port dimension(comparing to upper left image). The lower right image has the same view port as the upper right image but has different data and therefore produces different tics

.

3.1.6 Gridding and interpolation

Surface representation by gridding: A grid representation of a flat surface is

considered to be a functional surface because for any given location(x,y), it stores only a single Z value(attribute) as opposed to multiple Z values at a time. A grid is a spatial data structure that defines a space as an array of cells of equal size arranged in rows and columns. In this grid surface representation, each cell con-tains an attribute value that represents a change in Z value and is represented as color(raster cell). So practically, gridding is a process of creating a regular uniform grid from the scattered data. Typically we have a set of arbitrary scattered points at known location in the two-dimensional scatter plot region and we would like to convert them into the regular grid for further processing. Figure: 3.8 illustrates the phenomenon. Obtaining values for each cell in a raster is typically not

(25)

prac-22 Implementation

Figure 3.8. The left image illustrates how the points are distributed in a two-dimensional scatter plot while the right image illustrates the cell-centered uniform dis-tribution of those points after gridding.

tical, and instead sample points are used to derive the intervening values using interpolation technique.

Interpolation: Interpolation is a procedure used to predict the values of

cells at locations that lack sampled points. It is based on the principle of spatial dependencies or correlation which measures degree of dependence between near and distance objects. Similarity of objects within an area can be easily visualized. The method studied here is inverse distance weighting(IDW)[3, 4] interpolation and is implemented in the scatter plot. In a two-dimensional scatter plot , data points are normally randomly distributed. We can use some of the variation of inverse distance interpolation as well to predict values of unobserved locations. Interpolation is to be performed based on a global or local region [13].

Global methods utilize all the known values to evaluate an unknown value , while in local methods, some values in specified region are used to evaluate an unknown value.

Inverse Distance Weighted Method(IDW): The inverse-distance weighted

procedure is versatile, robust, easy to program and understand and is fairly ac-curate. Using this method, the property at each unknown location for which a solution is sought is given by:

Pi= PG j=1 Pj Dijn PG j=1 1 Dijn (3.5) Where Pi is the property at location i; Pi is the property at sampled location

j; Dij is the distance from i to j, G is the number of sampled locations; and n

is the inverse-distance weighting power. The value of n , in effect, controls the region of influence of each of the sampled locations. As n increases, the region of influence shrinks until, in the limit, it becomes the area which is closer to point i than to any other. When n is set equal to zero, the method is identical to simply averaging the sampled values. Usually n is set arbitrarily.

Neighborhood size: The neighborhood size is specified of its radius for

lo-cal interpolation . The radius is expressed in terms of a generic offset matrix. Figure: 3.9 illustrates how the matrix expands with increasing radius.

(26)

Figure 3.9. The left image has radius equal to 1(1 cell spacing) and kernel size 3x3 while

the right image has radius equal to 2(2 cell spacing) and kernel size 5x5. The indices are stored in a matrix according to the specified radius. Offset matrix index (0,0) always aligns with interpolated point.

Figure 3.10. Interpreting the true inverse distance interpolation algorithm.

Sometimes we it is not desirable to use true IDW interpolation but rather compute simplified form of IDW based on the statistical mean, the maximum or minimum values of each cell. Dense data points severely affects the performance during user interaction and hence, above variations of IDW method are sometime useful. So, four different methods can be set as parameters. These are :

1. True IDW interpolation: As mentioned above, estimated values are a function of the distance to, and magnitude of, surrounding points. The process is based in the assumption of positive spatial autocorrelation. Fig-ure: 3.10 helps to understand how we calculate the value at the interpolated point.

Let the value at interpolated point, produced by convolving by 3x3 kernel, be x. So x = 4 3n+ 11 8n+ 7 7n+ 20 12n 1 3n+ 1 8n+ 1 7n+ 1 12n (3.6)

(27)

24 Implementation 2. Mean interpolation: Figure: 3.11 explains how we determine the

inter-polated value using the mean method. Each cell has one value, which is

Figure 3.11. Interpreting the mean inverse distance interpolation method.

calculated by averaging the values of scatter points belonging to that cell. The equation becomes

x = 6 4n+ 4 5n + 6.5 3n + 7 3n 1 4n + 1 5n + 1 3n + 1 3n (3.7)

3. Minimum: This time we will consider the minimum value at each cell instead of averaging(shown in Figure: 3.12).

Figure 3.12. Interpretingthe minimum inverse distance interpolation method.

So the interpolated value,

x = 6 4n + 3 5n + 5 3n + 7 3n 1 4n + 1 5n + 1 3n + 1 3n (3.8)

4. Maximum: This method is very similat to previous one but we will consider the maximum value at each cell.

Grid size: Grid can be of any shape, square or rectangular. We can set the

desired grid(each cell) size in pixel along the x & y direction. But actual size depends on the available panel dimensions and data value itself. Actual infor-mation regarding grid size, grid resolution(no of grids along both axes)and other parameters, as shown in Figure: 3.13,are readily displayed.

(28)

3.2 ToolStrip 25

Figure 3.13. The Tooltip contains cell values calculated using different interpolation

methods when the user hovers the pointer over grid cells. In the right part of the image, a form control can be seen, through which the user can explore data regarding interpolation.

Tooltip: Cells are colored using the same color map which is being used for

glyph coloring. Interpolated cell values are always displayed as tooltip, as shown in Figure: 3.13, when the mouse moves over the scatter plot.

3.2 ToolStrip

The ToolStrip control is the base class of many user interface elements for Windows Forms and is embedded with the scatter plot component. This control allows programmers(e.g. users) to make user interfaces for their applications by giving applications docking menus, toolbars etc. Application users can interact with the component for common useful behaviors and properties, some of which are:

• getting information, row values, about the selected glyphs • selecting type of glyph, i.e. hollow or filled circle

• control glyph transparency • setting percentile background

• changing color map index, i.e. column index • setting size map index and control glyph size • color map index setting

(29)

26 Implementation • setting grid resolution

• setting weighting power • choosing interpolation method

(30)

Chapter 4 Achievements and

Limitations

We already provided enough screen shots of implementation results from each layer in Chapter: 3 and those are self explanatory to understand. We can control glyph transparency and select a hollow glyph to prevent visual cluttering(shown in Figure 3.1(a)). Exploring all layers at a time can cause visual cluttering in the component, so we can enable and disable layers according to user needs. The focus and context visualization with attached range sliders is useful to reduce the amount of data that is shown at each instance of time and guides the user to the most interesting areas of the data. But regression, percentile calculation etc. can be misleading while glyph moves by range sliders. If we compare it to the gap minder world [15],as shown in Figure4.1, the latter approach is far better.

Percentile calculation includes three options only - median, 25%-50%-75%, 33%-66%. It is generic and more options can be added to this list. The scat-ter plot component accepts and delivers dynamic linking(e.g., filscat-tering , selection, brushing, color map setting) to and from the other components of an application project.

Though the algorithm for generating good values for annotation is robust, em-ploys the spatial axes, x and y, but, in some cases, it generates unwanted fractional values. Suppose that, the value range for a column (max- min) is 10. If we set more than 11 tic marks(10 spacing), it breaks into fractional values(illustrated in Figure: 4.2).

I believe the ready made toolStrip is useful, well positioned(always docked at upper right corner of the component) and lessen some burden(otherwise the toolStrip is to be created in the application project) for users.

Disabling layers which are not being used reduces rendering performance over-head because invalidation occurs only on enabled layers. We tried to avoid unnec-essary invalidation of layers but found multiple invalidation per invalidation call in some layers. Even some part of the code could be avoided in taking part of invalidation. For all these reasons, it costs while interacting large data set.

For uncertainty estimating, first we considered two interpolation methods -27

(31)

28 Achievements and Limitations

Figure 4.1. An extra view (Focus) can be used to zoom in and out on an interesting

area through mouse drag move event and two button(+ and -)

Inverse Distance Interpolation [3, 4] and Kreiging [11, 7]. According to those papers, Kreiging is the best linear unbiased estimator. It is best since it aims at minimizing the variance of errors. But it is hard to implement and heavily computing-intensive. On the other hand, Inverse Distance Interpolation is robust, widely used and easy to implement. Thus, we decide to proceed with Inverse Distance Interpolation method with four variations. We consider a square radius instead of circular radius while convolving which could give better result. The interpolation results should always be looked at to ensure that interpolation cor-rectly simulates the sampled data. Few data or an uneven distribution of scatter points may outcome misleading information. Since no interpolation method is guaranteed to work for all data sets, different methods are to be tried in order to determine the best method for a dataset in a typical condition. The screen shot visualization in Figure: 4.3 containing a finer grid can produce results with minimum errors.

Finally, an atomic layered approach to form a functional scatter plot component brings the concept of scalability which is one of the vital requirements of this thesis.

(32)

29

Figure 4.2. Illustrating limitation of algorithm for generating good annotation values.

In x-axis annotation, the value range is 31.5(46.8 -15.3) and there are 35 annotation spaces. That means the algorithmic breaks into fractional values.

Figure 4.3. Interpolation results for a finer grid. The grid resolution is 155x87 and each

(33)

(34)

Chapter 5 Summary and Future Work

5.1 Summary

We try to modularize the existing legacy scatter plot component in an existing class library through layered approach to build our new component. We successfully incorporate some innovative ideas to our scatter plot and the component is open to add easily more layers. Scalability was a desirable property of the scatter plot implementation and thus it is ensured.

Our scatter plot depicts up to four numerical dimension - two orthogonal spatial dimensions, x-axis and y-axis, and two visual attributes, color and size. We fit a least square regression line to the scatter plot to investigate the relationship between two spatial variables. We add a percentile base background which gives a visual interpretation of internal characteristics and distribution of data. We can also explore interesting data items in the orthogonal space through focus and context mechanism.

Inverse Distance Weighting(IDW) is a method for multivariate interpolation, a process of assigning values to unknown points by using values from usually scattered set of known points. We successfully add this concept to our scatter pilot. However I studied and saw that kreiging, another uncertainty measurement interpolation method with high precision, can not easily be implemented.

Chapter: 3 describes details of these implementation techniques along with results as screen shoots. In Chapter: 4, we also discuss the initial goals and the limitations of the implementation and the results.

5.2 Future Work

There are many possibilities for extending and improving the scatter plot compo-nent further, such as:

• Right now, our scatter plot component formulates four numerical dimen-sions. We can increase dimensionality by adding more advanced glyphs to the component. For categorized data, various aspects of glyph(e.g., square,

(35)

32 Summary and Future Work circle, triangle) could be used to add a new variable to the component, as shown in Figure: 3.1(c). Glyph quality deteriorates if we increase glyph size. Definitely we could work on it to improve the visual quality of glyphs. More advanced glyphs, such as pie charts, can also be used to increase the number of multivariate dimensions.

• When the user works with focus and context, the scatter plot glyphs are displaced. It would therefore be convenient to have a reset button, which could be used to bring back the original state of the scatter plot.

• As mentioned in Chapter: 4, a different approach to control focus and context can be incorporated(shown in Figure: 4.1). So if we could add an extra rectangular view to zoom in and out on interesting areas, the exploration would be more intuitive.

• There are a few user options for percentile in our scatter plot. Percentile calculation should be more generic to increase further options . A slider could be attached to the axes for observing percentile values( shown in Figure: 5.1).

Figure 5.1. Depicting percentile(25%-50%-75%) sliders with edge values on the y-axis

• As mentioned in Chapter: 4, the algorithm aimed at extracting for more legible annotation values sometimes generates undesired fractional values. The algorithm should therefore be definitely improved.

• Currently , the axes of the scatter plot are always fixed. But if we map negative attribute values to an axis, the origin should move from its lower left fixed position. So the dynamic behavior of axes could be added for a more natural look.

(36)

5.2 Future Work 33 • More interface elements or items could be added to the tool strips to lessen

user burden .

• To investigate the relationship between two spatial variables more precisely, only least square regression line that we implemented is not enough. The scatter plot could be extended to employ exponential, power, logarithmic and other functions for curve fitting.

• The implementation code could be optimized to control unnecessary invali-dation.

• We could transfer some of the concepts implemented in the two-dimensional scatter plot component to a three-dimensional scatter plot.

• Kreiging interpolation is the right choice and reliable than inverse distance interpolation in convention. It can be a very nice feature if this interpolation technique is added to the scatter plot component. I believe this will probably be a promising addition for those who are engaged themselves in infovis field. • We have uniform gridding in our scatter plot. Isoline1_{or contour line drawing}

algorithm requires uniform gridding and could be implemented in the scatter plot as further development.

• Despite the popularity of scatter plot, the limitation concerns the number of data items that is possible to simultaneously render , interact. That means it causes performance inefficiency for large data items. Therefore, it would be interesting to study and apply GPU technology to the scatter plot component in order to increase its performance.

(37)

(38)

Bibliography

[1] A 5-Dimensional Scatter Plot. http://informationandvisualization.de/blog/5dimensional-scatter-plot. Retrieved 17 November 2008.

[2] DirectX Graphics Library. http://msdn.microsoft.com/sv-se/xna/aa937781(en-us).aspx. Retrieved 9 December 2008.

[3] Inverse Distance Illustration. http://www.uiowa.edu/ geog/health/interp/inv.html. Retrieved 20 December 2008.

[4] Inverse Distance Interpolation Method. http://age-web.age.uiuc.edu/classes/age357/html/age35731.pdf. Retrieved 22 November 2008.

[5] Microsoft .NET Platform. http://msdn.microsoft.com/sv-se/netframework/default(en-us).aspx . Retrieved 10 November 2008.

[6] Visual C# Development. http://msdn.microsoft.com/sv-se/vcsharp/aa336791(en-us).aspx. Retrieved 10 December 2008.

[7] Randal Barnes. Variogram Tutorial. Golden Software, Inc. http://www.goldensoftware.com/variogramTutorial.pdf. Retrieved 20 November 2008.

[8] Mikael Jern. Geoanalytics visualization framework.

http://nvis.itn.liu.se/nvisjoomla/index.php?option=com_content&task=view&id=84&Itemid=10. Retrieved 10 November 2008.

[9] Mikael Jern. NCVA. http://ncva.itn.liu.se/applications/?l=en. Retrieved 19 November 2008.

[10] David Journet. A Way to Calculate Percentile. iTSS Walling-ford. http://www.haiweb.org/medicineprices/manual/quartiles_iTSS.pdf. Retrieved 19 November 2008.

[11] K. E. Kerry and K. A. Hawick. Kriging interpolation on high-performance computers. In HPCN Europe 1998: Proceedings of the International Confer-ence and Exhibition on High-Performance Computing and Networking, pages 429–438, London, UK, 1998. Springer-Verlag.

(39)

36 Bibliography [12] Physics Laboratory. Linear Regression and Excel. http://phoenix.phys.clemson.edu/tutorials/excel/regression.html. Retrieved 18 November 2008.

[13] University of Calgary Nigel M. Waters. Global and Local Interpolation. http://www.geog.ubc.ca/courses/klink/gis.notes/ncgia/u40.html#SEC40.2.1. 28 November 2008.

[14] ITN Linköping University. Affiliated Research Groups of NVIS. http://nvis.itn.liu.se/nvisjoomla/index.php?option=com_content&task=view &id=144&Itemid=167. Retrieved 5 December 2008.

[15] Gapminder World. Mind the Gap. http://graphs.gapminder.org/world/. Re-trieved 20 November 2008.

[16] Y.K.Lung and M.D.Apperley. A review and taxonomy of distortion-oriented presentation techniques. ACM, 1(2):126–160, June 1994.