Snippet Generation for Provenance Workflows

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final Thesis

Snippet Generation for Provenance Workﬂows

by

Ayesha Bhatti

LIU-IDA/LITH-EX-A–11/032–SE

2011-08-30

Linköpings universitet

(2)

Linköpings universitet

Institutionen för datavetenskap

Final Thesis

Snippet Generation for Provenance Workﬂows

by

Ayesha Bhatti

LIU-IDA/LITH-EX-A–11/032–SE

2011-08-30

Supervisor: Lena Strömbäck

Examiner: Lena Strömbäck

(3)

På svenska

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära om-ständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten ﬁnns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

In English

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the con-sent of the copyright owner. The publisher has taken technical and administrative mea-sures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assur-ance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

(4)

(5)

Acknowledgements

I would like to thank my adviser and examiner, Lena Strömbäck for valuable insights and advice, she kept me focused on the goal and I have learned valuable work ethics in addition to technical knowledge. I am also thankful to Valentina Ivanova for her help in the data modeling.

Furthermore I like to thank Siv Söderlund for helping with administrative issues.

I also want to thank Ilona, Sirpa, Heidi, Parwin, Hacer, Agneta, Chopi and Katarzyna at Förskolan Svanen, Huddinge for taking good care of my children while I was at work. More than anyone else, I want to thank my beloved husband Sami for encouraging me to start this work and then helping me with the children and the house throughout this period.

I can not thank my parents enough for their never ending interest in my success. The knowledge that they take pride in my achievements keeps me going.

(6)

Abstract

Scientists often need to know how data was derived in addition to what it is. The detailed tracking of data transformation or provenance allows result reproducibility, knowledge reuse and data analysis. Scientific workflows are increasingly being used to represent provenance as they are capable of recording complicated processes at various levels of detail. In context of knowledge reuse and sharing; search technology is of paramount importance specially considering the huge and ever increasing amount of scientific data. It is computationally hard to produce a single exact answer to the user’s query due to sheer volume and complicated structure of provenance. One solution to this difficult problem is to produce a list of candidate matches and let user select the most relevant result. Here search result presentation becomes very important as the user is required to make the final decision by looking at the workflows in the result list. Presentation of these candidate matches needs to be brief, precise, clear and revealing. This is a challenging task in case of workflows as they contain textual content as well as graphical structure. Current workflow search engines such as Yahoo Pipes! or myExperiment ignore the actual workflow specification and use metadata to create summaries. Workflows which lack metadata do not make good summaries even if they are useful and relevant as search criteria. This work investigates the possibility of creating meaningful and usable summaries or snippets based on structure and specification of workflows. We shall (1) present relevant published work done regarding snippet building techniques (2) explain how we mapped current techniques to our work (3) describe how we identified techniques from interface design theory in order to make usable graphical interface (4) present implementation of two new algorithms for workflow graph compression and their complexity analysis (5) identify future work in our implementation and outline open research problems in snippet building field.

(7)

Chapter 1 Introduction

1.1 Background

It is always nice to be able to look at summary of a bigger work before taking a deeper look into it. Book readers prefer to read reviews while employers want to look at brief, structured, clear and revealing resumes before calling a candidate for interview or further evaluation. Search engines also use summaries for presenting multiple search results on the same page. The user looks at the search results page for a few seconds before deciding which result is worthy of ﬁrst click. Thus quality of summaries plays a big role. Good summaries reduce the number of clicks and number of pages visited which of course everybody wants. We call these summaries snippets. Fig 1.1 shows an example from Google.

1.2 Problem Statement

This thesis investigates snippets of provenance. Provenance means ”the place of ori-gin or earliest known history of something” [Dic]. Modern science deals with large volumes of complicated data for example genome data or environmental data. Scien-tists process and refine data in laboratories, share their data with other scienScien-tists and take data from other laboratories during experiments and analysis. During their work raw data is transformed into processed data products and scientists often need to know through which techniques or procedures a data product is derived. The detailed track-ing of data transformation or provenance allows result reproducibility, knowledge reuse and data analysis. Scientific workflows are increasingly being used to represent prove-nance as they are capable of recording complicated analysis at various levels of detail [BKWC01, SPG05, Kep, Peg, Swi, Tav, Vis]. In context of knowledge reuse and sharing; search technology is of paramount importance specially considering the huge and ever

(11)

Figure 1.1: A google snippet where a level deeper links are included in snippet. increasing volume of scientific data. This work has been part of a project ’Querying Provenance”, that Database and Web Information Systems Group at Linköping Univer-sity is conducting. Project requirements include modeling provenance and provenance queries. In case of provenance queries, presentation of query responses is an important subquestion. Workflows need to be summarized for presentation purpose and presenta-tion needs to be brief, precise, clear and revealing. This is a challenging task in case of workflows as they contain textual content as well as graphical structure. Current work-flow search engines such as Yahoo Pipes! or myExperiment ignore the actual workwork-flow specification and use metadata to create summaries. Workflows lacking metadata do not make good summaries even if they are useful and relevant to search criteria. This work (1) investigates snippet building from structure and specification of actual workflows and (2) explores interface design techniques in order to present usable graphical snippets.

1.3 Development Methodology

The design methodology used was prototyping. The requirements were clear but were contradicting (small but revealing) and the solution or end product was not clear, so prototyping was used in two weeks sprints. A number of prototypes were developed and enhanced during the work. We have studied and tested diﬀerent graph algorithms and developed two new algorithms during the process. We have also studied interface design techniques and used them to improve prototype usability.

(12)

1.4 Report Terminology

The key words ”MUST”, ”MUST NOT”, ”REQUIRED”, ”SHALL”, ”SHALL NOT”, ”SHOULD”, ”SHOULD NOT”, ”RECOMMENDED”, ”MAY”, and ”OPTIONAL” in this document are to be interpreted as described in RFC 2119 [Bra97]. In particular the term ”MUST” refers to a hard requirement, while the term ”SHOULD” does not refer to hard requirement.

1.5 Technological Choices

Work on ’Querying Provenance’ is in process at IISLab. We coordinated our work with previous results in this project. We decided to use Python for two reasons, (1) IISLab is using it already, (2) it oﬀers list and dictionary data structures, which are very useful for representing graphs. Django web framework is used to speed up web integration [fra]. jQuery is used to write AJAX routines like asynchronous HTTP GET [jiankoJL]. Scalable Vector Graphics are used to draw graphs in the browser [Gra]. Raphael is used to generate SVG tags [Bar]. NetworkX is a python graph library, we have used it for general purpose graph operations [Net]. We have used Firefox browser for presentation and Firebug for interface debugging. Eclipse is used as development environment as it oﬀers PyDev that facilitates with Python and Django development [iaPIfE]. LaTeX is used to write the report as it is easy to write mathematical equations in LaTeX [dps].

1.6 Outline of Work

Chapter 2 details literature study, this includes both snippet building techniques and

graph traversal and compression algorithms. Chapter 3 discusses the conceptual foun-dations of thesis, ﬁrst we deﬁne and explain the terminology that we mentioned in the literature review and then describe how we implemented these concepts in our prototype.

Chapter 4 discusses interface design in detail. We discuss signiﬁcant challenges the task

oﬀered and our approach to solution. Chapter 5 is about algorithms implementation. We present two new algorithms for workﬂow graph compression. We have also discussed computational complexity of algorithms in this chapter. Chapter 6 discusses testing and performance evaluation. Chapter 7 concludes the work and lists proposals for future work.

(13)

Chapter 2 Literature Review

This chapter presents summary of literature regarding snippet building and graph com-pression techniques. Section 2.1 details fundamental concepts like workflows and prove-nance. Section 2.2 discusses workflow snippet solutions available in industry. Section 2.3 discusses literature regarding subproblems such as snippet building strategies and graph compression. Section 2.3 briefly introduces few well known algorithms that are useful for snippet generation.

2.1 Fundamentals

In this section we describe fundamental concepts. First we describe scientiﬁc workﬂows and provenance, we then explain the problem of snippet building and motivate the need of our work.

2.1.1 Scientiﬁc Workﬂows

There is no formal classification of Scientific workflows. A workflow that is used for scientific purpose is called a scientific workflow. Due to the large amount of data involved, scientific workflows are usually more data centered than business workflows. Scientists use workflows for two major purposes. One is recording process specification and other is representing provenance.

2.1.1.1 Process Speciﬁcation

One major use of scientific workflows is to document a process specification. A process specification document is a stepwise instruction for performing a routine experiment e.g. conducting an ultrasound scan for a certain disease. An experiment is a series of procedures that are carried out on various intermediate data products. The experiment

(14)

Figure 2.1: An example from Yahoo Pipes, where user can search already created modules and build a workﬂow.

eventually ends with a final data product e.g. scan image. The specification document may include precise instructions about how to to use an ultrasound machine interface for a particular type of scan. It may even include instructions about which image functions to apply in what order. Writing such instructions with clarity and granularity is a difficult work. Workflows have the ability to record experiments with granular detail, so their use for this particular purpose is on rise in scientific community. Developing a new specification is time consuming and hard, so scientists often desire to reuse and refine existing workflow templates [LWMB09]. Importance of search technology in this context is immense.

Yahoo Pipes! is a good example [Yah] . Here, new workflows are created by refining existing workflows. The user has the ability to search through workflow repository with keywords and a list of candidate matches is produced by the system. The user examines the list and clone the one she deems suitable. From there the user refines the cloned workflow specification for her own specific needs. The user can also search through a module repository and add suitable modules to the unfinished specification. Presentation

of workﬂow summaries or workﬂow snippets in the result-set helps the user select the

most relevant workflow specification. The quality of workflow snippets in

(15)

Figure 2.2: An example from Vistrails. The left pane is recording provenance as actual work is done in right pane.

2.1.1.2 Data Provenance

Another major reason for popularity of workflow systems in scientific community is natu-ral ability of a workflow to represent provenance. The detailed tracking of data transfor-mation or provenance allows result reproducibility, knowledge reuse and data analysis. A scenario in an X-Ray lab is a good example. A lab technician starts a procedure by taking a scan of a suspected tumor patient. After taking the scan the technician sees something unusual and decides to apply some additional image functions in order to make the film clearer. The doctor has a better looking scan but she is interested in what the technician has done to it just to make sure that she is looking at the right image. The doctor likes what the technician has done and asks the technician to do it to another X-Ray after two months. Its a lot of labor now for the technician. He needs to dig out details of what he did two months ago out of a pile of files either on paper or on hard disk. It may be impossible to find actual steps if he didn’t document in detail. Detailed tracking of his actions or provenance can save a lot of effort and time for the X-Ray tech. He can reuse his earlier results for his own sake and reproduce them anytime for his doctor for further analysis. Fig 2.2 shows a provenance example from Vistrails.

(16)

Figure 2.3: An example snippet Google images, when mouse stays on some image a bigger snippet with a bigger image and some further information is shown.

2.1.2 Existing Snippet Building Strategies in Industry

Workflows are graphical presentations with a lot of information attached. Workflow struc-ture contains modules, connections, intermediate and eventual data products. Workflows also contain metadata like name of the workflow, textual description, date created etc. Existing workflow search engines e.g. Yahoo Pipes! and myExperiment use metadata associated with the workflow to generate a summary [Yah, myE]. They do not summa-rize the graphical structural of workflow, so they fail to make good snippets for poorly documented workflows. Yahoo Pipes! presents a summary of metadata with a thumbnail of workflow image. Fig 2.4 shows an example from Yahoo Pipes! where a significantly detailed workflow has failed to make a usable snippet. Nobody has used this workflow due to the poor quality of snippet, as cloning statistics show a zero in the bottom of the snippet. This shows that the quality of a snippet plays a vital part in reuse.

We couldn’t find a publicly available work that makes graphical summaries of work-flows. The popular technique is to create a thumbnail of the image of workflow and present it along with the summary of metadata associated with it. For example Pegasus presents summary of metadata in right side and a thumbnail of the workflow in left side [Peg]. An example Pegasus snippet is shown in Fig 2.5, a bigger thumbnail is presented when the user zooms in as shown in Fig 2.6

(17)

Figure 2.4: An example snippet from Yahoo Pipes!, First workflow seems to be con-siderably large but snippet shows meshed modules and very little text which gives the impression that it’s bit complex. The second workflow contains only 3 modules but snip-pet contains considerable more information extracted from metadata. This is reused by others and got cloned 3 times. It shows that quality of snippet plays a vital role in workflow reuse.

2.2 Related Strategies in Literature

The field of creating snippets for workflows has largely been ignored. However we have benefited from the research that has been done in fields of information retrieval, web snippets and XML snippets. In this section we first describe the techniques and strategies for information retrieval and XML snippet building. We also summarize a first study on creating workflow snippets that was done earlier at IISLab. We then discuss related work done in graph compression.

2.2.1 Information Retrieval Strategies

Information retrieval is the science of retrieving desired information out of a very large repository of unstructured data. It is not directly related to snippet building but it is possible to reuse information retrieval techniques for searching within a workﬂow in order to ﬁnd important modules. We are listing two techniques that we found useful for snippet building purposes.

Inverse document frequency: Document frequency of a phrase or term is number

of documents containing that term in the database[MRS08]. Common terms like ’a’, ’the’ etc. always have a high document frequency as they are used in every document. IDF technique uses this fact to determine meaningful or important terms in a document and mark terms with minimum document frequency as most important or representative terms. This technique is very useful for the purpose of snippet building and important or

(18)

Figure 2.5: Pegasus summary. The left pan contains a thumbnail, while the right pane presents textual summary.

Figure 2.6: Zoomed in image of the pegasus snippet. Its bigger thumbnail of left pan in Fig 2.5.

(19)

representative modules of workﬂow can be selected on basis of their IDF scores. According to the work done by Ellkvist et al, snippets created through this technique were rated high in a user feedback study [ESLF09].

Searching in query neighborhood: It is preferable to search important pieces

of information in neighborhood of query related phrases instead of the full document [TTHW07]. This strategy produces more cohesive pieces of information. It is very useful for our purpose as we need to produce small, condense and cohesive presentations.

2.2.2 Snippets of XML Documents

XML document structure is comparable to workﬂows; both can be represented as trees or graphs. Substantial work exists in the ﬁeld of XML queries [SSC09, HLC08, BEKM06]. Haung et al. have presented an interesting work on snippet creation for XML documents. They have outlined requirements for good snippets [HLC08]. These requirements state that snippets should be Self Contained, Representative, Distinguishable and

Small. We will now deﬁne each of these requirements brieﬂy.

Self-Contained: A snippet should contain query related parts of original workﬂow.

User should be able to see how snippet is related to query terms.

Representative: A snippet should represent the main function of original workﬂow.

User should be able to understand purpose and function of workﬂow by looking at snippet.

Distinguishable: Two diﬀerent workﬂows should not make same snippet.

Small: Number of modules in snippet should be smaller than a small integer g, user

should be able to assign value to g.

We will describe in next section how workﬂow snippet building has adapted these requirements to its purpose.

2.2.3 Earlier work on Workﬂow Snippets

We have built our work on ”A First Study on Strategies for Generating Workflow Snip-pets” [ESLF09] by Ellkvist et al, earlier done at IISLab. First they did a detailed survey in order to find techniques for building snippets, then they implemented three different types of snippets. In the end they presented their implementation to end users and took feedback through questionnaires. They included Yahoo Pipes! snippet in the question-naire in order to compare their work with older work. This activity produced valuable user feedback. We have studied their work in detail and used feedback that they gathered as guideline for our work. A summary of their conclusions is as follows.

(20)

2. Users approve the IDF strategy as a good strategy.

3. Snippets that were diﬀerent from their original workﬂows in structure or appearance received negative reviews.

4. The users want to see similarities and diﬀerences among snippets clearly highlighted. 5. The users do not like complicated snippets. It is also not liked if the users need to

make multiple references to a color legend in order to understand a snippet. Ellkvist et al. used the requirements formed by Huang et al. and formally defined them for workflow. We are listing formal 1_{definitions of Ellkvist’s system below.}

Workﬂow: A workﬂow is a partially ordered collection of modules.

Module: A module is computation which takes inputs and produces outputs. Query: A search query consists of list of query terms Q .

Keyterm: Any term w in a workﬂow speciﬁcation such that w ∈ Q is known as

keyterm.

Ellkvist et al, also deﬁned following strategies.

Self Contained: Each module m is associated with a set of keywords key(m). If a

query Q matches a module m, key(m)_{∩ Q ≠ 0, a self contained strategy should ﬁnd most} relevant modules, MS ⊂ M in neighborhood of m.

Representative: A snippet should display the main purpose of it’s workﬂow. This

is analogue to how sentences are selected from a paragraph in order to make a summary. A representative strategy should identify MR⊂ M modules that are most representative.

Distinguishable: Two different workflows should make a two different snippets. In

order to do this structural diﬀerences between workﬂows should be included in the snip-pets. A distinguishing strategy should identify MD ⊂ M modules that are distinguishing.

Small: _∣MQ∪ MR∪ MD∣ ≤ g where g is small integer.

2.2.4 Similarity and diﬀerence calculation

”Querying and creating Visualizations by Analogy”, is another very interesting work [SVK+_{07]. The work explains how to calculate similarity and diﬀerences between two}

workflows. They have provided mathematical definitions of both similarity and difference scores. This method can be used to present multiple workflows in one place and highlight their differences. It is however unclear if it is feasible to scale this method beyond two workflows.

1_{Full list of deﬁnitions is not given here. In case reader is interested in Ellkvist et al’s system, consult}

(21)

2.2.5 Graph Grammars

A graph grammar is a set of rules though which new graphs can be written. Rules in a graph grammar are applied to create new nodes or edges. For making snippets we have to write new smaller graphs inspired from the original workﬂow graphs and user provided query terms. Beeri et al, have written a query language for business processes [BEKM06]. They discuss a novel approach to write a new XML tree for presentation of query result with help of graph grammars. The work does not discuss snippets but explains how to write new graphs. As they deal with structured queries, they view both XML documents and query as ﬁrst order graph grammars2_{. This allows them to extract}

rules from the original document and then they can add edges to the answer tree in order to convert individual XML nodes into a connected graph. This is directly applicable to our problem. We do not need to create new nodes. We only need rules to create edges which can reconnect modules.

2.2.6 Compressing Graphs

The problem of snippet building roughly maps on compressing graphs. Gilbert et al, have provided techniques for compressing network graphs. [GL04]. They have presented two types of compression schemes. 1) Importance compression schemes, 2) Similarity Compression schemes. As they have dealt with network graphs, similarity compression schemes are based on network traits e.g. geographic clustering or shared medium clus-tering and hence are of no use for us. Importance compression schemes however are of our interest. They have described two algorithms KeepOne and KeepAll through which they create a new graph induced by important nodes. We now brieﬂy describe these algorithms.

2.2.6.1 KeepOne Algorithm

KeepOne algorithm reconnect important vertices. That is if K1 is the set of important vertices then goal of KeepOne algorithm is to ﬁnd a set of K2 vertices in the original graph so that the graph induced by K1_{∪ K2 is connected. The ﬁnal set of nodes is} reconnected through a minimum spanning tree algorithm.

2.2.6.2 KeepAll Algorithm

Though KeepOne algorithm preserves the connectivity of original graph, it does not preserve the distance between nodes. KeepAll algorithm is designed to add additional

(22)

nodes to the graph that lie on the shortest path between important nodes.

2.2.7 Graph deﬁnitions and Algorithms

Here we ﬁrst deﬁne basic graph terminology and then we explain two general purpose graph algorithms.

2.2.7.1 Graph Deﬁnitions

We ﬁrst give brief deﬁnitions of few basic concepts but it is recommended that reader has familiarity with basic graph theory. For more information, we recommend Graph Theory with Applications [BM82].

A directed graph G = (V, E, λ) is a triple where V is ﬁnite set of vertices , E is set of edges and λ is an incidence function that associates an edge e_{∈ E to an ordered} pair of vertices (u, v) _{∈ V . In this case (u, v) are adjacent vertices, e joins u to v , u} is predecessor and v is successor. A directed walk in G is a ﬁnite non null sequence

W = (v0, e0, v1, e1, ..., ek−1, vk). where e0 joins v0 to v1 and ek−1 joins vk−1 to vk. Walks deﬁne paths, i.e. if it is possible to walk from a to b then path exists between them.

2.2.7.2 Graph Algorithms

There are many graph libraries, we have used NetworkX for graph operations. In partic-ular we also needed two algorithms, we are describing them below.

Breadth First Search: BFS is one of the simplest and most used graph algorithm.

It is a graph traversal method which calculates distance of each reachable node from the source node. It is named breadth first because ”it expands the frontier between discovered and undiscovered vertices uniformly across the breadth of frontier.” [CLRS09, p.p. 594-597] Given a graph G = (V, E), and a distinct source vertex s, breadth first search systemically explore the graph G in order to find every node that is reachable from vertex s. Graph coloring is used in order to keep progress of search. Initially the search tree only contains source node and whole graph is considered white; when a white node is visited for first time, it is marked gray and added to the search tree; when every node reachable from a gray node is discovered, gray node is marked black. If (u, v)_{∈ E} and vertex u is black then v may be either gray or black; that is, all vertices adjacent to a black vertex are discovered. If v is gray then vertices adjacent to v may be white or gray. White vertices represent frontier between discovered and undiscovered vertices. Search is finished when all vertices are marked black. Complexity of BFS is O(∣V + E∣).

Transitive Reduction: The problem of transitive closure is shown in ﬁgure 2.7.

(23)

. . a . b .c . d

Figure 2.7: The transitive closure

. . a . b .c . d

Figure 2.8: Graph after transitive reduction

minimal representation. It is a well studied problem [IRW93, GMvdSU89]. Complexity of transitive reduction is O(_{∣V ∣}3_).

2.3 Interface Design Techniques

It is important to note that the problem of snippet building is basically a visualization problem. During literature study, we felt the need for doing a proper research into design techniques in addition to information retrieval techniques and graph processing. We now give a brief account of design techniques which we found useful for designing snippets.

2.3.1 Shneiderman’s Mantra

For interactive visualizations, Ben Shneiderman [Ben10, p.p. 349] has a mantra.

Overview ﬁrst, zoom and ﬁlter, then detail on demand.

According to this mantra, first interface should only contain summary or overview of data. The introductory interface should contain zoom and filter technology. The user should be able to zoom into selective segments of the data through a quick response technology so that the user can quickly filter and search what she actually needs. The heavy details should be loaded on demand in the end.

2.3.2 Card’s Advice on Amplifying Cognition

Stuart, K. Card argues that ”visualization is concerned with ”amplifying cognition” [Ben10, p.p. 349]. Card achieves this through the following guidelines:

(24)

1. ”Increasing the memory and processing resources to people,” 2. ”Reducing search for information,”

3. ”Helping people detect patterns in data,” 4. ”Helping people draw inferences from data,” 5. ”Encoding data in interactive medium.”

According to the above guidelines, an interface should offer memory and processing resources to the user. An interface can achieve this trough various methods. For example a cooking recipe interface may include a unit conversion tool. The goal is to present the data in a self explanatory way so that the user can understand it without having to memorize or consult other details. An interface is meant to help the user process data, interface itself should not be any processing burden on the user. Interface layers and zooming technology should guide the user to the desired information through the most reduced path and in the shortest possible time. The user should be able to detect patterns in the data and make conclusions on the basis of data presentation. This can be achieved though developing a design language. A design language makes expressions though different mediums including colors, shapes, sounds and effects. Appropriate and consistent use of these expressions helps the user see information within data presentation and hence helps the user draw inferences. The fifth guideline suggests to use an interactive technology in the interface design.

2.3.3 Marcus directions on use of color

Marcus [Mar95] explains the importance of simplicity and clarity in interface design. He expresses that we can achieve usable and expressive design with the appropriate use of colors. As a guideline he advises use of as few colors as possible in order to keep the interface clean and clutter free. His advised range of colors in one interface is 5± 2. He advises against the use of high chroma simultaneously but encourages the use of bright colors when purpose is to capture user attention. The meanings of colors are diﬀerent in diﬀerent cultures but according to some western conventions [Ben10, p.p. 343-344], red conveys hot and green conveys clear or okay.

2.4 Summary

Scientists are increasingly using workflow based provenance systems for managing and representing scientific data and specifications. Search technology for provenance systems

(25)

is of paramount importance as scientists like to search workflow repositories in order to reuse and analyze already existing data and specifications. One important subquestion is of snippets as snippets make face of a search engine. Current workflow snippet building techniques do not summarize actual workflow. They instead summarize metadata asso-ciated with workflow and complement that with a thumbnail of workflow image. As a consequence, workflows that lack metadata fail to make rightful summaries. Very little work has been done in the workflow snippet field, so we have done a literature study of a range of technologies that can be used to make graphical workflow snippets. Informa-tion retrieval technology can be used to select modules from workflows in order to make graphical snippets [MRS08]. Haung et al, have developed requirements for good XML snippets [HLC08]. Ellkvist et al, have built on Haung et al. work and formally defined those requirement for workflows [ESLF09]. Ellkvist et al, have also gathered valuable feedback from users which can be used as guideline for further research. Beeri et al. have described a novel approach that uses graph grammar to write query results in graphi-cal form [BEKM06]. Gilbert et al, have described graph compression strategies [GL04]. Problem of making workflow snippets is basically a data visualization problem. Interface design for data visualization should amplify users cognition through creating metaphors in design language [Ben10, p.p. 349]. Color should be carefully used in order to make clutter free and usable interface [Mar95].

(26)

Chapter 3 Snippets of Provenance

In this chapter we ﬁrst explain the problem of snippet building in detail. We then explain the ’Querying Provenance’ project for which the prototype snippets are built.

Section 3.1 explains workflow terminology through an example. Section 3.2 explains the problem of snippet building in field of workflows and gives motivation for our work. Section 3.3 introduces the database schema and defines terminology in database context. Section 3.4 defines the concept of Inverse Document Frequency. Section 3.5 maps snippet requirements on database that the department provided us. Section 3.6 defines precisely how connectivity is designed.

3.1 A Scientiﬁc workﬂow

Here we describe the workflow components and terminology with an example. Fig 3.1 shows an example workflow from VisTrails. As shown in the figure, a workflow is com-posed of partially connected modules. A module is an atomic computation that converts data items from one form into another. The data items are known as properties or pa-rameters. A property can be a simple data value such as an integer or a float or it can be a complicated object such as a sound or a picture. In the current example modules are shown by rectangles, these may however be represented by ovals or by any other shape. Every module rectangle further contains small squares near the top and the bottom edge. An example module from Fig 3.1 is vtkDataSetReader. It contains six small squares near the top edge and two small squares near the bottom edge. In a workflow, these are known as ports. Ports are vectors as they are directed entities; they can either have in direction or out direction. Contents of ports are parameters or properties. Ports that are placed near the top edge are input ports. Ports that are placed near bottom edge are output ports. A module accepts a property through an input port, processes it, and outputs the updated property through an output port. This arrangement of input and

(27)

(28)

output ports constructs connections between modules. The modules and the connections together construct a workﬂow.

A workflow system usually contains template modules. When a scientist creates a new workflow, she chooses a template module and instantiate it. The scientist then configures the module instance through selection of input and output ports. When a second module is added and ports are configured, a connection is constructed between the two modules. The scientist repeats the process for all the modules she wishes to add and a workflow is constructed. The scientist may add annotations to module in order to document it.

As shown in the example, the workﬂows are actually structured documents. They contain both textual content and directed graphical structure. They may get very large and complicated. They however provide a concrete and unambiguous way to document complex speciﬁcations and experiments.

3.2 Snippets of Workﬂows

The standard way of writing a summary of a document is to write a shorter text that contains highlights of the original document. The standard way of compressing a picture is to create a thumbnail. Workflows are comprised of both documents and pictures. Writing a textual summary and complementing that with a thumbnail does not serve the purpose sufficiently. We need to write a graphical summary in document style. It means that we need to construct graphical sentences by choosing or selecting fewer units or modules from original workflow graph. There are obvious restrictions on processing time and presentation space. Presentation should be clear and easily understandable but the presentation space is limited. Too large snippets limit the number of results presented and eventually force user to go through several pages. Too small snippets do not convey enough information and force user looking at the details of too many workflows. The problem is to present workflow snippet in as small space as possible and make it as revealing as possible both in terms of structure and function.

We have presented in previous chapter how existing workflow search engines use meta-data to generate textual summary and complement that with a thumbnail of workflow image. We concentrate on the use of structure and specification of the workflow it-self in order to construct graphical summaries, so that the snippet usability becomes

relatively independent from metadata and user provided descriptions.

In order to address the problem of small presentation space we have introduced the idea of small but rich snippets. To best of our knowledge this work is first effort to make an interactive snippet in the field of workflow snippet building.

(29)

for purpose of making a graphical summary. Additionally we have made the workﬂow interactive so that user can see more targeted and ﬁne grain information with less time consuming mouse actions.

3.3 Database Schema for Querying Provenance

The project ’Querying Provenance’ is still in an early development phase. Currently no documentation is available from department. We are however provided with an interim schema, which is current version and subject to change at any time. We now present how the workﬂow deﬁnitions are implemented in the database.

The database schema has a template-instance theme as shown in Fig 3.1. Template tables represent abstract entities, they do not house actual workflows or workflow com-ponents like modules or ports. A new workflow or workflow component is created by instantiating a template. The newly created instance is stored in respective instance table. We now present details of important entities. We have introduced the work-flow and workwork-flow components earlier, we however again explain each component and its representation in the database.

3.3.1 Property

A property is a data value, it is scalar as it has no direction. It can both be passed into a process and produced out of a process. It can be simple integer or it can be a complicated object like a picture. In database we have a template table and an instance table for property.

3.3.2 Port

Ports are vector entities. These are properties but with direction. Input port pi is a property that can be passed into a process. An output port po is a property that comes out of a process. There is no template table for port. A port is instantiated from property table.

3.3.3 Connection

A connection is formed by two ports of diﬀerent types. It houses an input port and an output port and sits between two modules in order to connect them. The connection is directed from input port to output port. In the database, connection table imposes the condition (m1, p1)→ (m2, p2). Where p1 is output port and p2 is input port.

(30)

(31)

3.3.4 Module

Module is a computation with predeﬁned properties at one or both ends. In the database there is a module template table and a module instance table. Formally a module is deﬁned as m : pi→ po both at instance and template level.

3.3.5 Annotation

Annotations are used to attach metadata to workﬂow. They are stored in a separate table annotation.

3.3.6 Workﬂow

Workflow W also has a template and an instance table. Formally a single workflow is a directed, a-cyclic graph of partially ordered modules. The partial order function is defined as connection (m1, po, m2, pi).

3.3.7 Workﬂow Collection

Workflow collection C is finite set of workflows, which basically is all records in workflow instance table.

3.4 Information Retrieval concepts

In the earlier work representated by Ellkvist et al. it has been concluded that Inverse Document Frequency is a useful measure [ESLF09]. We now present our implementation of IDF.

3.4.1 Module frequency and importance

We assume that module frequency is inversely proportional to its importance. In many cases modules are actually utility modules and do not represent the distinguished function of workﬂow.

3.4.1.1 Inverse document frequency

Inverse document frequency, for a module m is deﬁned as follows [MRS08].

log

N dfm

(32)

where N is the total number of workﬂows in collection C i.e N =_{∣C∣ and df}m is the number of workﬂows that contain module m, i.e. _{∣∀W : W ∈ C ∧ m ∈ W ∣. It is clear from} equation that the most uncommon module gets the highest IDF score.

3.5 Mapping of Snippet Requirements on Database

In the previous chapter, we mentioned requirements developed by Huang et al. for snippets of XML documents[HLC08]. In this work, we have renamed Self Contained to Query biased as this is closer to our goals. We ﬁrst deﬁne query and result set in context of given system. We shall then map those requirements on the database that we have received from project.

3.5.1 Query and keywords

We have worked with text queries. A search query consists of finite set of keyterms Q . The search function filters the workflow collection C for Q. Query results R is a subset of C , R_{⊂ C.}

3.5.2 Query Biased

A snippet is required to be relevant to a query and result set. In the database each module m has various data strings keyterms(m)associated to them. The set of strings is comprised the name of the module, port names and annotations. If a query Q matches a module keyterms(m)∩ Q ≠ 0, then the strategy is required to include module m in MQ i.e. m∈ MQ where MQ is set of query biased modules.

3.5.3 Representative and Distinguishable

The snippet is required to show the main function of the workﬂow, in other words it is required to include representative modules. This requires the ability to pick important modules critical to the function of workﬂow.

Two snippets are required to look different from one another, In other words if two workflows are different from each other their snippets are required to contain that differ-ence. We do not differentiate between distinguishable and representative modules at the

database level. In our system high IDF value modules are both and distinguishable and rep-resentative modules. If MIDF represents high IDF value modules then MIDF ={MD, MR}

(33)

3.5.4 Small But Rich

There is a restriction of presentation space on the size of a snippet. According to the original requirements, a snippet is required to be reduced to g nodes such that ∣MQ∪

MIDF∣ ≤ g. We are introducing a new term Small But Rich. A SBR snippet contains the g nodes but also contains other information like module details and intermediate modules. The initial presentation reveals g modules on the screen. The user then can interact with the snippet and see parts of the workﬂow by mouse actions. For example if the user clicks on an edge, the intermediate modules that form shortest path between the two end nodes of the clicked edge are presented.

3.6 Connectivity Strategy

In literature review, we have given basic graph deﬁnitions, here we describe our connec-tivity strategy.

Given a workﬂow graph G = (V, E), and corresponding snippet graph SG = (SV, SE), we have viewed the graph G as a ﬁrst order graph grammar. A vertex v can be written to snippet SV if and only if v_{∈ V , and a directed edge e : (u, v) can be written to SE if and} only if there exists a path between (u, v) in G. It means that an edge e goes from u

into v in SE provided a walk is possible from u→ v in G. This rule conﬁrms that neither labeling nor connectivity structure of snippet can diﬀer from original graph.

3.7 Conclusion

If g is the upper bound on MS, where MS represents snippet modules, MQ represents query biased modules and MIDF represents IDF modules then following holds

∣MQ∣ + ∣MIDF∣ ≤ g

We are introducing a novel approach of adding interactive technology to snippets. The notion of small snippet is still valid in our solution as we do not extend permanent space on screen but our small snippet is rich and can offer user more information if user communicates with it. We confirm that snippet connectivity mirrors the connectivity structure of original workflow.

(34)

Chapter 4 Designing Workﬂow Snippets

Workflow snippets presentation can be categorized as interactive visualization. As our assignment was to make a usable presentation, we first did a detailed study into principles of interactive design, we then conducted a thorough analysis of problem and existing solutions. We finally used prototyping in order to make a presentable and usable solution. We have described design principles in literature review, we will now describe our analysis and design.

PACT stands for People, Activities, Context and Technology [Ben10, p.p. 26-48]. Section 4.1 presents the PACT analysis of system. Section 4.2.1 outlines conceptual design. Section 4.2.2 details physical design and details of design language artifacts that we have designed for the solution. Section 4.2.2.6 gives detailed account of an interactive snippet.

4.1 PACT

In a PACT analysis, the people analysis investigates the target users and their physi-cal and psychologiphysi-cal characteristics, their skill level and mental models; the activities analysis include temporal aspects e.g. daily vs hourly activity, response time, single user vs group activity; the context analysis investigates physical environment. It also ana-lyzes social and organizational context; the technologies analysis include which input and output technologies are available. This section presents the PACT analysis of system.

4.1.1 People

Our target users are scientists. It is given that they understand the notion and concept of workﬂows. So we do not need to be pedagogic in this aspect. We only have to help them adapt their mental model to our solution through interface design principles. We

(35)

have not considered color blind persons. People may use diﬀerent languages.

4.1.2 Context and Activities

For contexts we have laboratories. For activities we have two scenarios. The first is a search engine use case, where the user wants to search through a repository of workflows to see if something exists that she can reuse for her new work. The system should provide a list of small, meaningful summaries that are relevant to search terms and representative of actual workflows. The second is a database search use case, where the user wants to search if a record exists. The user does not know the correct identification of the record but has some state of data that can be used as query. The requirement on the system is the same. Produce a list of small, conveying, query biased and actual workflow representative snippets. We actually have only one activity that the scientist types keywords and system produces a list of relevant snippets. Our assignment of making snippets carries similar requirements in both activities. There is no specific peak time. Scientists all over the world work around the clock. Search is done by a single person, who can do it without any cooperation from others at that point of time (It is true that user is benefiting from others but at the time of search nobody is supposed to assist or cooperate personally). The system needs to respond in real time. Response time for the web based applications should be less than 1 second [CRM91]. One webpage usually contains 10-12 results so requirement is to create and render 12 snippets at one time. We have not considered any safety measure as we are only making a prototype.

4.1.3 Technologies

For technologies, in this work we have used keywords that the user types in a browser from the keyboard, but the solution can easily be expanded to other inputs like a graph or picture. The output is a graph with shapes as nodes and lines as edges. We do not need anything else like buttons or toolbars. The graph is responsive to mouse actions and reacts with small tooltip windows. This feature can be translated to ﬁnger touches if we expand the technology to touch sensitive systems in future. However it is worth mentioning that we have provided a solution to the snippet problem, we did not aim to make an in depth data visualization tool.

(36)

4.2 Development Process

4.2.1 Conceptual Design

A clear conceptual design is vital for developing systems that are understandable[Ben10, p.p. 199]. We need to ensure that our understanding of the system is consistent with the user’s needs and expectations. The conceptual design is all about ’what’ so we have drawn ’rich pictures’ for the purpose as shown in Fig 4.1 and Fig 4.2. We have drawn these diagrams according to the problem scenario that we have got from the department. Fig 4.1 shows three scenarios. The first is the X-Ray lab scenario. A X-Ray technician is doing a routine scan, but he needed to deviate from the standard specification. He needs a way to document what he does differently in order to reuse his newly devised technique and to reproduce the result. The second is a doctor looking at the scan done by the technician in the first scenario. The doctor notes that the X-Ray technician did something different and wants to know exactly what? She also wants the X-Ray technician to do it to another scan! In the third scenario a scientist is trying to lay down a new specification. He want to know if someone else has already done it as the scientist wants to save his time!

Fig 4.2 shows a scenario where a scientist from third scenario in Fig 4.1, has got the search engine he was wishing for. He is now looking at the snippets and trying to understand them. He wants to know how diﬀerent snippets compare to each other? He also wants to know more about the module details like input and output parameters. He does not want to click on a snippet that leads to an irrelevant workﬂow as he has limited time. He would rather like to further investigate the snippet in a less time consuming way.

4.2.2 Physical Design, Prototypes, Alternatives

We started our work with hand drawn prototypes. We looked at current workflows available on web [Peg, e.g.] and started sketching. After some survey, brainstorming and drawing we got a typical workflow and its snippet version as shown respectively in Fig 4.3, 4.4. One solution is finding subgroups in workflows and replacing those groups with appropriate labels. According to Ellkvist et al, this reduces to the vertex cover problem which is NP complete [CLRS09, p.p. 1089].

The standard software project starts with requirement gathering. We had no oppor-tunity to contact end users during this work. So the only option for us was to study what other researchers have done in this ﬁeld. We have mentioned in literature review about work done by Ellkvist et al, who have laid the ground work in this ﬁeld. They also

(37)

Figure 4.1: An overview Rich Picture

(38)

Figure 4.3: A typical workﬂow.

(39)

conducted a user feedback experiment at the end of their work which produced valuable insights. We have treated this feedback as requirement document for our work. We have presented their work in literature review, here we present critical analysis of their work. This analysis reﬁnes requirements for our work and shapes top level design strategies.

Their work presents three different approaches. In the first approach, statistics are done on modules which often occur together. They classified those module groups as atomic functional units. These atomic units were replaced or ’collapsed’ by appropriate labels in order to shrink the workflow. This felt like an alternative to the vertex cover solution. But this technique produced snippets which were different in structure from actual workflows and hence received bad reviews. In the second strategy, similar modules in different workflows are colored with same brush and a group snippet is made as shown in Fig 4.5. This resulted in a single snippet with lots of information but also with lots of colors. Users found it difficult to read as they had to consult color legend multiple times. The third and the only strategy that worked reasonably well was simple first order selection of important modules through IDF and query neighborhood strategy which we have already explained in the literature review.

According to their own analysis replacing multiple modules with a single label which was different from original workflow was not a very good idea. It confused users as snippets became inconsistent with original workflows and users were not expecting this inconsistence. According to our analysis of the problem, the grouping strategy is also not a suitable solution to this particular problem. Our requirement is to present an easy to understand, smaller view of a bigger workflow. Grouping increases the size of the prob-lem from making a summary for one workflow to making a single summary for multiple workflows. The resulting view becomes complicated to understand hence contradicts the requirements. In our view, making similar looking snippet is better approach. Similar looking snippets can be achieved through appropriate selection of the modules and im-proved snippet connectivity. Moreover it may be beneficial to make snippet interactive so that user can investigate at an upper level with less time consuming clicks.

We have discussed various techniques for module selection in literature review. Here we are listing two types of modules that we have decided to include in the snippets. We will ﬁrst describe the types of modules and then we describe how these types are translated to design language that we have developed for this work.

4.2.2.1 Query Related Modules:

All workﬂows in a result set share the property that each of them matched one or more keywords, so there exists a possibility that multiple workﬂows in a result set contain the same query related modules. We have mentioned in the previous chapter that snippets

(40)

Figure 4.5: [ESLF09], group snippet, goal is to show the similar and diﬀerent modules. This snippet received criticism for being too complicated.

are required to be query biased, so snippet interface should also convey this important property of modules in a noticeable way. It is important that the user is able to

see which modules of workﬂow are related to the query.

4.2.2.2 Distinguishing modules:

The other important requirement for the snippet is to include modules that highlight the main function of the workflow and distinguish it from others. These modules exhibit the differences that workflows in result set have among them. Probability for two workflows to contain same distinguishing module is very low. It is important that the user is

able to clearly see which modules are distinguishing modules.

4.2.2.3 Metaphors and Color:

According to the design principles, interface design should not be against the expectations of the user and should be easy to learn [Ben10, p.p. 335-352]. To achieve this objective, patterns or metaphors are used [CMK98] so that endusers may use the skills and knowl-edge that they already possess. We wanted to make certain modules noticeable and give different look to different types of modules. A very straightforward approach is to prefix their names with an appropriate tag so that users can read and recognize the difference. This approach is too trivial and a bit verbose so we decided to use the notion of design

(41)

Figure 4.6: Our group snippet, instead of going for involved grouping we have put snippets in matrix lay out. We have colored all query biased modules green and all representative modules red. It is now easy to see where to look for similarity and where to look for diﬀerences.

language[Ben10, p.p. 69,71,77,335]. We used colors and shapes to convey information and create metaphors. We have described Marcus directions about the use of color in section 2.3.3. Following those suggestions we have chosen red for the representative modules as they are hot modules that distinguish a workflow form the crowd. We have chosen green for query related modules as they are desired modules which resonate with query and have high probability of presence in multiple workflows. This has created a meaningful pattern for the user. If the user wants to compare differences between workflows then she should look into red modules, if user is interested which workflow is most relevant to query terms, she should look at workflow with the highest number of green boxes. The user does not have to look at too many colors and the same colors always convey the same meaning. This offers a simple presentation that does not require heavy analytical skills from user and does not burden the user’s memory. We have used rectangles with bright outline stroke and a dull fill. This has decreased the overall chroma of snippet but difference among the module types remains very clear.

4.2.2.4 Perception of Size:

In order to fulﬁll requirements, a snippet has to be revealing in a small space. We are not allowed to increase the total size of snippet but we may control the size of objects within the space allowed. Modules are represented by rectangular boxes. The next visual expression, we decided to use was size of these rectangles. We present bigger rectangles when respective module has more query terms attached to it or has higher IDF score. We increase the width of outline stroke accordingly which gives more intensity to the color of

(42)

Figure 4.7: Some modules are bigger than other. Bigger size conveys the point that the box is more important in its respective role. Undirected edges are shown with dashed line. Directed edges are double lined directed arrows.

overall rectangle and emphasizes the point a bit more. This gives valuable information to user at no cost to overall space that snippet occupies in precious presentation territory.

4.2.2.5 Direction and Connectivity:

Workflows are directed acyclic graphs, so direction is an important piece of information. We have shown directed edges with an arrow. Undirected edges are dashed lines. To convey connectivity we used double lines when two neighboring modules have interme-diate modules in workflow and a solid line when two neighboring modules in snippet are neighbors also in the original workflow.

4.2.2.6 Interactive Snippets

It is evident that the smallness requirement has put serious constraints. Card’s ﬁfth guideline states that data should be encoded in an interactive environment. According to this guideline response technology is added to the snippet. We have described KeepOne and KeepAll algorithms designed by Gilbert et al. in the literature review. We have not used any of their algorithms, we only took inspiration from their idea of expanding the

(43)

Figure 4.8: Responsive snippet; Right info window contains in and out ports of module, while left window contains modules that lie between two snippet modules.

minimal version through addition of shortest paths. We present a minimal snippet to the user who then is able to click on edges to see modules that lie on the clicked path. Additionally the user can also know more about modules like input and output ports if she clicks on a module. Windows carrying this extended information have a small red close button. User can open many information windows simultaneously. In this way user does not have to remember any information at all as everything the user wants to see can appear simultaneously on screen. We have avoided complicated grouping but we are still able to oﬀer an easy way to compare multiple workﬂows.

4.3 Conclusion

It is easy to see that we have repeated after Shneiderman with his mantra, ”Overview first, zoom and filter, then detail on demand” [Ben10, p.p. 349]. By implementing these techniques we are able to conserve the structure of workflow in snippet. We have used only two main colors as advised by Marcus and it is pretty straightforward to spot similar and different modules in list of snippets. Design language helps endusers spot patterns in data as directed by Card’s second and third guideline. Users do not have to learn the meaning of color, shape or size of the object for every result set so it’s pretty easy on user memory or effort. We have avoided involved grouping which is computationally expensive, requires

(44)

considerably more man hours to implement, difficult to visually present and eventually requires more memory and effort from the end user. We are still able to offer functionality to compare multiple workflows in the result set.

(45)

Chapter 5 Snippet Construction

This chapter describes implementation details. We have done information retrieval im-plementation in order to select important modules. We have also developed two new algorithms in order to reconnect the selected modules. Section 5.1 defines and declares the data structures. Section 5.2 describes implementation of information retrieval con-cepts. Section 5.3 describes search query that produces result set of workflows R ⊂ C. Section 5.4 details the strategy for selection of high IDF value modules within a single workflow. Section 5.5 details connectivity implementation with examples and proofs.

5.1 Data Structures

The workﬂow is a triple G(V, E, λ), where V is set of vertices, E is set of edges, and λ is a partial incidence function. λ is not cyclic.

5.1.1 Vertices

A vertex is a tuple, (module id, module version). Properties of a vertex includes its name, ports, keyword score, IDF score and degree. A set of vertices is represented as a dictionary with tuple (module-id, module-version) as keys. Each key points to the dictionary that contains properties.

V ={

(m1, v1) :{name : value, idf : value, keyscore : value, degree : value},

(m2, v2) :{name : value, idf : value, keyscore : value, degree : value},

...,

(46)

5.1.2 Edges

An edge is a pair (source-module, target-module). Edge representation does not include properties. A set of edges E is a dictionary where source tuples (source-id,source-version) form the keys. Every key points to a list of target modules. This representation is very eﬃcient in terms of space and speed. It is also very ﬂexible and easy to use as one can loop over the edges and access the properties stored in vertices as both vertices and edges share same keys.

E ={ (sm1, sv1) : [(tm1, tv1), (tm2, tv2), ..., (tmk, tvk)], (sm2, sv2) : [(tm1, tv1), (tm2, tv2), .., (tmk, tvk)], ... (smn, svn) : [(tm1, tv1), (tm2, tv2), .., (tmk, tvk)] }

5.1.3 Paths

If (u, v)_{∈ G.V ∧ v ∈ E[u] then u and v are said to be adjacent vertices where direction is} from u to v. We write this as u→ v.

If (u, v, w)∈ G.V ∧v ∈ G.E[u]∧w ∈ G.E[v] then u and w are said to have path between them where direction is from u to w. We write this as if u→ v ∧ v → w then u ↝ w. The path property is transitive. i.e. if u↝ w and w ↝ x then u ↝ x.

5.2 Information Retrieval Implementation

Idea comes from text search. In a text search system we have collection of documents. In our system, workflows correspond to documents. A text document is a collection of sentences. A workflow is collection of modules and connections. Modules are nouns and connections are verbs. Port names and annotations are also considered important terms. We have not done any new research work in this area. We learned how information is processed and indexed in text search field [MRS08] and implemented same techniques for workflows. We will now describe our implementation.

We tokenized all information present and indexed it separately in the database. Two relations are added to the database. The ﬁrst is the table keyterm (keyterm_id, keyterm_label) which contains all tokenized terms with their unique ids. The second table is module_keyterm (module_id, module_version, keyterm_id) and it indexes all terms against modules.