A database to store and annotate Stem cell gEnetic Abnormalities

(1)

SEA database

A database to store and annotate Stem cell gEnetic Abnormalities

NICOLAS GIRAULT

Master’s Thesis at the Institute for Research in Biotherapy Supervisor: Lars Arvestad

Examiner: Jens Lagergren

(2)

(3)

Abstract

Human pluripotent stem cells (hPSC) are important in medicine due to several of their distinctive features. How- ever, genomic abnormalities are observed in PSC, which arise either during cell culture or during the process of cell reprogramming. These genomic abnormalities are a serious concern for the use of hPSC. To assess the threat of these abnormalities it is essential to distinguish polymorphisms from bona fide mutational events, and to distinguish incidental abnormalities from those that are bona fide altering the biology and/or security of hPSC. An important aspect of PSC genetic abnormalities is that they are often recurrent possibly due to a strong selective advantage in culture for cells. Such selective advantage is reminiscent of pre- cancerous or cancerous lesions and should be avoided.

It is therefore mandatory to carefully catalogue all the genomic alterations that are found in hPSC and identify those that are recurrent as well as those that are shared with cancer. With these requirements, I set up SEAdb, a database indexing and annotating hPSC genomic abnormalities. This database has been curated with most of the abnormalities reported in the scientific literature. It integrates a tool to easily focus on hotspots. It also integrates information from databases reporting common polymorphisms and genetic abnormalities implied in cancer.

(4)

Referat

Mänskliga pluripotenta stamceller (hPSC) har en viktig betydelse i medicinsk vetenskap på grund av deras unika egenskaper. Abnormaliteter i arvsmassan har observerats i PSC, vilket uppstår antingen under cellodling eller under cell-omprogrammering. Dessa orsakar allvarliga bekymmer gällande användningen av hPSC. För att uppskatta hotet av dessa abnormaliteter är det nödvändigt att skilja polymorfisms från bona fide mutationer, och att urskilja oför- utsedda abnormaliteter från dem som är bona fide som för- ändrar biologin och/eller säkerheten av hPSC. En viktig aspekt av PSC genetisk abnormabnormalitet är att de ofta är återkommande pågrund av deras starka selektiva fördel vid cellodling. En sådan selektiv fördel påminner om för- cancer eller cancersår som bör undvikas.

Därför är det ett måste att noggrant katalogisera alla genetiska förändringar som hittats i hPSC och att identifi- era de som återkommer likväl som de som är gemensamma med cancer. Under dessa förutsättningar satte jag upp SE- Adb, en databas med index och kommentarer beträffande hPSC genetiska abnormaliteter. Denna databas har blivit annoterad med de flesta av de abnormaliteter som rappor- terats i vetenskaplig litteratur. Den innefattar ett redskap som gör det lätt att belysa det mest väsentliga. Dessutom innehåller det information från databas som rapporterar gemensamma polymorfisms och genetiska abnormaliteter som finns i cancer.

(5)

Why SEAdb?

1.1 About stem cells

Our body is composed of many cells which are classified in cell types (e.g. mus- cle, skin, blood, neurons...). This diversity of cell types is possible thanks to the process of differentiation by which a less specialized cell becomes a more specialized cell. Human pluripotent stem cells (hPSC) can differentiate into all adult cell types. As hPSC are also endlessly self-renewing in vitro, they are viewed as an inexhaustible and physiologically relevant cellular material for experimentation and regenerative medicine.

Figure 1.1. Illustration of "stem" cells. In embryonic development, pluripotent stem cells are the stem of many cell types.

Human PSC can be isolated from the inner cell mass of discarded embryos, i.e., hESC, or derived from differentiated cells by cell reprogramming, i.e., iPSC.

Many applications can be envisioned from PSC, including (i) in vitro modeling of human development, (ii) in vitro modeling of human genetic diseases, (iii) an

(8)

unlimited production of normal or diseased cells for drug testing, (iv) rejuvenation of old and/or senescent cells to revitalize organ functionality and (v) an unlimited supply of normal cells for cell therapies.

1.2 In vitro culture and PSC genetic abnormalities

When stem cells are collected from an embryos or an adult, their are not immediately reintroduced in the organism (for therapeutic application). Between these events, stem cells are cultured in vitro. Cell culture is a critical step in the process of cell therapy or other stem cell applications.

Since stem cells proliferate endlessly, their are passaged every week to restrain the number of cells (Fig. 1.2). Passaging consists to split regularly cells (usually weekly) and transfert a fraction in a new fresh medium. Thus, cell culture is subject to a strong selection pressure. In the specific case of stem cells, the selection will favor cell clones that display a diminished apoptosis (cell death), an increased proliferation or a diminished tendency to differentiate. Thus, any culture parameter such as culture media, passaging techniques (mechanical with a scalpel versus enzymatic with an enzyme that dissociate cells) might be the cause of recurrent abnormalities that confer a selection advantage in this particular environnement. For example several studies reported that enzymatic-passaged cell lines acquired more easily cytogenetic aberrations [6, 7] whereas other studies reported that mechanically passaged cells were harboring recurrent amplification of the segment 20q11.21 [8, 9].

1.3 Other source of abnormalities

Genetic abnormalities reported in stem cells are often reported after a prolonged in vitro culture. However these abnormalities can also appear before the culture, i.e., inherited from zygotes (the initial cell formed when two gamete cells are joined) for hESC or parental somatic cells for hiPSC, or generated during the derivation process for hESC or the reprogramming process for hIPSC (indeed the reprogramming process uses integrative vectors is a cause of DNA damage).

1.4 An overview of genomic abnormalities found in stem cells

DNA is a dynamic and adaptable molecule. As such, the nucleotide sequences found within it are subject to change as the result of a phenomenon called mutation. Mutations occur through various mechanisms that will not be expained here. Nevertheless these mechanisms give rise to almost all sequence rearrangment that we could imagine ranging from single nucleotide mutations to chromosomal

(9)

1.4. AN OVERVIEW OF GENOMIC ABNORMALITIES FOUND IN STEM CELLS

Figure 1.2. (a) hPSC in vitro - (b) what really happens in vitro (every week, a small part of the culture is kept and put in a new culture) - (c) what happens when a genomic abnormality confering a phenotyping change occurs.

abberations. The genetic abnormalities listed below are observed in cultured stem cells.

• Small variations (one, sometime more nucleotides):

Single Nucleotide Variation (SNV): single nucleotide position in DNA where different sequence alternatives exist. The most common type of SNVs are Single Nucleotide Polymorphism ( SNP pronounced "snip").

It is important not to confuse these two terms: SNPs are SNVs which are present in more than 1% of a population. It occurs in human DNA at a frequency of 1 every 1,000 bases¹. Several hundred thousand SNP sites are being identified and mapped on the sequence of the genome, providing

1The reference database dbSNP list 3,5 millions of verified common polymorphisms while there are around 3 billions bases in the human genome.

(10)

the densest possible map of genetic differences. Although a SNP is a SNV, we often use SNP to designate a SNV that is not necessarily a SNP.

Indel: short for insertion/deletion.

• Structural variations (typically a structure variation affects a sequence length about 1Kb to 3Mb [1]):

Copy number variation (CNV): alterations of the DNA of a genome that results in the cell having an abnormal variation in the number of copies of one or more sections of the DNA. CNVs correspond to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes.

Inversion: a DNA sequence gets flipped in place.

• Chromosomal abnormalities:

Aneuploidy: abnormal number of chromosomes (e.g. trisomy, quadri- somy, monosomy)

Translocation: rearrangement of parts between non-homologous chro- mosomes. Translocations can be balanced (in an even exchange of material with no genetic information extra or missing, and ideally full functionality) or unbalanced (where the exchange of chromosome material is unequal resulting in extra or missing genes).

Complex chromosome rearrangement: Some mechanisms create complex rearrangements such as isochromose (a chromosome that has lost one of its arms and replaced it with an exact copy of the other arm), translocation involving more than two chromosomes etc.

1.5 The need of a database

We explained in this introduction that genetic abnormalities arise in hPSC for several possible reasons. Since 2004, tens of publications have reported thousands of abnormalities of various kind. All these abnormalities are spread in the scientific literature and there is no common format to report them: karyotype abnormalities often follow the International System for Human Cytogenetic Nomenclature [2]

(which can be in some case difficult to understand to the non-specialist²) whereas CNVs and SNVs are presented in PDF tables (in the best case in an Excel table).

2For example: 44,X,-Y,der(6)t(6;17;17) (6pter->6q15::17q25.1->17q25.3::17q11.2->17qter), del(10)(p11.2),+12,-17, der(18)t(17;17;18)(17qter->17q11.2::17q25.3->17q25.1::18q23->18pter) (Draper et al. 2004).

(11)

1.5. THE NEED OF A DATABASE

Useful review publications have summarized these abnormalities [3, 4, 5] but this is not anymore possible with the actual number of abnormalities reported. Moreover, with the increasing resolution of genome analysis techniques, there is an expansion in the detection of sub-chromosomal abnormalities in hPSC. An avalanche of small genetic alterations is on its way and unless a platform integrates and provides in a coherent fashion all these genetic abnormalities, it will not be possible to mine this rich and soaring dataset. For this purpose, we set up The Stem cell gEnetic Abnormalities database SEAdb.org.

(12)

(13)

Chapter 2

Organising biological data in a database

SEAdb consists of a front-end web application and a back-end relational database implemented using the PostgreSQL relational database management system. In this chapter we will focus on the back-end relational database.

When setting up a database from scratch, the first question to consider is How to organize data on the server side. Some biological concepts are complex and thus arduous to represent and store on a server. Furthermore, it is common that a set of various annotations goes with the main data. For example, considering a mutation such as a simple SNV, one will of course store its identification and its coordinates on the reference genome. But one might also consider to store data about the corresponding sample, the protocol that led to observe this mutation on this sample, the detection method used, a possible publication where this mutation is described, a contact to refer to, a description of the corresponding phenotype, cross reference database and possibly many other annotations. A good organization of the data are crucial to avoid redundancy and to answer fast when data are queried.

In this regard, designing a smart database schema is a key step that deserves to be deeply examined. First of all I have been looking at the latest developments in this area by analyzing database schemas of two genomic variant databases¹.

2.1 Analysis of the structure of two databases of genomic variants

COSMIC [10] and DGV [11] are two databases storing genomic variants. Those two databases are presented in more details in section 4.2. They are renowned databases, working well and largely used by the scientific community in such a way that they can both be considered as reliable and good models. It was technically difficult to insert their database schema in this report since they would not fit a A4 sheet. However they are available at these URLs: http://dgv.tcag.ca/v103_

1An introduction to understand database design is available on http://www.ncbi.nlm.nih.

gov/books/NBK6828/.

(14)

20131106/app/erdiagram.htmland ftp://ftp.sanger.ac.uk/pub/CGP/cosmic.

I studied these schemas to better understand how one can represent genetic variants in tables. Here is a short comparison between these two schemas:

Similarities

• DGV is composed of 30 tables. If ignoring tables storing data not directly related to genetic variants (genes, tumor, tissue...), COSMIC is also composed of 30 tables. The structure of these tables is centered on genetic variants.

Then, group of tables are plugged in to annotate these variants.

• Both schemas use three levels to classify data: study –> sample –> genetic variant. However in DGV some studies do not report sample ID but instead only a variant ID. This would typically be due to the need to protect the privacy of the individual who provided the sample.

Differences

• The variant representations are different. In COSMIC, variants can be stored in different tables according to their type (i.e., structural variants or small variants). These tables have fields describing the genomic position of the feature while in DGV, all the variants are stored in the same table regardless of the type. However, a field links the variant table to a variant type table.

The mapping information is contained in other tables depending of the type (variant mapping versus translocation mapping).

• DGV captures information about methods, platform used and analysis performed to obtain variants but COSMIC does not.

Overall these schemas are quite complex and it is not easy in such an early step to catch what would be needed in our case. Although I got inspired by these schemas and I understood the different ways to represent genomic variants in a database, I did not decide to build my database schema based on one of these schemas. I decided to use a generic biological database schema called Chado that has been developed as a way of standardization of biological database schemas.

2.2 Towards a standardization of biological database schemas

Thanks to the revolution in DNA sequencing, the amount of biological data available on the web is shooting up. More and more genomic annotation databases are created (Fig. 2.1). Nevertheless, a good organization of the data are not easy to set up. It requires experience (unfortunately only when facing the data does that the designer come up with possible improvements). Moreover, with all the

(15)

2.2. TOWARDS A STANDARDIZATION OF BIOLOGICAL DATABASE SCHEMAS annotations that go with the central data, biological databases often requires tens of tables and thus designing a database schema becomes time-consuming and labor- intensive. Furthermore, when database applications are constructed to work with a particular schema, changes to the database schema may dictate reciprocal changes to this software. All of this makes schema evolution a costly affair.

Figure 2.1. Number of publications in Pubmed with the keyword "online database"

Considering these features and to avoid redundant software development but also to potentially increase interoperability, a small number of stable schemas would be favored over a plethora of rapidly evolving schemas. However, such a schema that aims to be durably and succesfully reused must be generic to overcome the following challenges:

• biological knowledge is evolving fast, changing the way we consider biological entities

• new experimental techniques are designed

• a wide variety of biological projects may use such a schema

Several biological database schema are available on the web. For example, En- sembl [12] provides its own core database schema just as ACeDB [15], arkDB [13], GUS [14] and many others. However, the aim of these organizations is not to provide a re-usable schema. All these database schemas are somehow specialized and a database designer could miss a way of representing some data for a specific project.

A database schema named Chado has been designed to answer the need of a generic schema for biology. Since 2007, Chado has become a popular tool to design biological databases. The Chado database schema is used by tens of databases².

2Checked by screening the publication citing the main Chado publication.

(16)

Moreover, emerging tools assume databases to follow this schema to interact with it.

2.3 Chado

Chado [16] is a relational database schema developed by the GMOD consortium³. It is capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. It has been designed to handle complex representations of biological knowledge. Chado’s main page is accessible on gmod.org/wiki/Chado.

In this section, I will present the main features of Chado and how data are represented in Chado tables.

2.3.1 Chado is a generic schema

In many databases, each type of thing one wants to store is given its own table.

That is, we would have a table for genes, one for chromosomes, tRNAs, etc. The problem with this sort of design is that, as you encounter new types of things, you have to create new tables to store them. Also, many of these thing tables are going to look very much alike.

Chado is what is a generic schema, i.e., data are abstracted wherever possible to prevent duplication in both the design and data itself. So, instead of one table for each type of ’thing’, we just have one table to hold ’things’, regardless of their types. In the Chado world these are known as features.

The main characteristic of Chado is to be highly generic. To handle this gener- icity Chado use ontologies. In Chado, in contrast to other database schema where data type is enforced at the relational layer, data type and relationship is driven by ontologies which are stored in controlled vocabularies tables (Fig. 2.2).

An ontology is a representation of the different types of entity that exist in the world, and the relationships that hold between these entities. Several groups work on defining precise biological ontologies. The Sequence Ontology Project (SO) which is part of the Gene Ontology Project and the Open Biomed- ical Ontologies (OBO) defines a list of controlled vocabulary and relationship between these terms. This set of terms is used in Chado. For example SO defines a subgroup of genomic features called sequence alterations which is used

3Generic Model Organism Database project: a collection of open source software tools for creating and managing genome-scale biological databases.

(17)

2.3. CHADO

Figure 2.2. Chado is a generic database. Let us consider a transcript of a gene.

A classical database could represent it with two tables named ’gene’ and ’transcript’

and link them with a foreign key relationship (data type in enforced in the relational layer). In Chado, both the transcript and the gene are stored in the ’feature’ table.

An entry in the ’feature_relationship’ table links the gene and the transcript and a table called ’cvterm’ (controlled vocabulary) stores the necessary vocabulary which is linked to ’feature’ and ’feature_relationship’ via forein key relationships.

in SEAdb and is also extended to include trivial cases such as complex chromosomal rearrangements.

This way of representing data are very flexible and perfectly answer the challenges listed above. However it also increases the complexity of SQL requests (see the following example) and thus the response-time of the database. This is the price to pay for genericity and a compromise has to be made between genericity and SQL request complexity.

Example of SQL complexity due to genericity (using the tables of Fig 2.2.):

selection of the list of genes in the database SELECT * FROM gene

versus:

(18)

SELECT * FROM feature

INNER JOIN cvterm ON feature.type_id = cvterm.cvterm_id WHERE cvterm.name = ’gene’

2.3.2 Representation of biological data in Chado

Chado stores the main information in a ’feature’ table. A feature is considered to be a region of a biological macromolecule (i.e., a DNA, RNA or a polypeptide molecule) or an aggregate of regions on this polymer. Features may be localized relative to other features (Fig. 2.3). In Chado, all genetically encoded or trans- mitted entities, including chromosomes, genes, transcripts, genetic abnormalities, etc. are modeled as entries in the feature table. Chado uses a relative localization model: all feature localizations must be relative to another feature. Features (SNV) hold a relationship to a location, i.e., coordinates, which itself holds a relation to a source feature (e.g. chromosome). Locations are encoded in a separate table called

’featureloc’.

Figure 2.3. Feature localization is relative to another feature.

Few tables are linked to the ’feature’ table: featureprop to add a property about a feature, featureloc to add a localization information, feature_relationship to add relations between features.

2.3.3 Modularity in Chado

The whole set of Chado tables is wide (more than 150 tables) but all these tables are not necessarily used. For the sake of clarity, the tables are grouped into modules. The main module containing the central ’feature’ table is called the

’sequence’ module. Another mandatory module is the ’cv’ (controlled vocabulary)

(19)

2.4. USING CHADO FOR SEADB

module. Then, one may decide to use other modules such as the ’contact’ module, the ’publication’ module, the ’cell line’ module etc.

2.4 Using Chado for SEAdb

SEAdb uses the Chado schema to store abnormalities in the database. In this section, I describe the SEAdb data model and how it has been implemented within Chado.

2.4.1 SEAdb data model

Studies: All abnormalities that are submitted are part of a study. Each study typically represents a coherent set of methods and analyses that were performed at around the same time, by the same authors, in the same laboratory (or labora- tories). Because these parameters determine to a large extent the variability that exists between datasets, all data in SEAdb is organized by study. Typically, a study will correspond to a single publication or community resource.

Samples: Each abnormality in SEAdb is associated to a unique sample on which the abnormality has been detected. Samples are associated to studies and cell lines. The same sample can be assayed by several technologies (NGS, microar- ray, cytogenetics). Reprogramation (for iPS) and culture protocol of the sample can be described. Up to now SEAdb does not store possible relationship between samples.

Abnormalities: Abnormalities have a type (SNV, translocation, etc.) and coordinates on the reference genome (sometime several genomic position c.f. translocations). It is also possible to add residue informations and a strand. Annotations can be associated to abnormalities.

2.4.2 Representation of genomic abnormalities in Chado

As previously explained, a DNA sequence abnormality such as a SNV corre- sponds to an entry in the feature table. The type_id field (see Appendix B.1) refers to the vocabulary table entry ’SNV’. The abnormality is localized using entries in the featureloc table. When needed a pair of entries is used. One is relative to the initial position of the abnormality (wild position). The other one refers to the possibly new position (variant position) of the nucleotides due to the variation (this is the case with translocations for example). Although an important part of sequence variations has identical wild and variant positions, the pair of featurelocs remains useful since the featureloc table holds a field storing information about residues. For example, considering a SNP A->G, A will be stored in the residue information field of the wild featureloc while G will be stored in the variant featureloc. To distinguish

(20)

the wild featureloc from the variant featureloc, the rank field is set to 0 (wild) or 1 (variant) (see featureloc fields description in table B.2).

Figure 2.4. Representation of a SNV using Chado feature and featureloc tables.

Note: according to these rules, one can also represent more complex DNA sequence variations. For example a translocation with a SNV on the translocated sequence can be represented as a pair of features. The first one would be the translocation as indicated in table 2.1. The other one would be a SNV where the pair of featurelocs are relative to the translocation feature.

2.4.3 Annotation of genetic abnormalities

The web interface allows to query abnormalities by cell lines, cell type (iPS/ES), studies, culture conditions... To do so, the feature en featureloc tables are included within a more complex structure of tables which is represented in appendix A.1 and it will be briefly described below.

The publication module

The publication module allows to store information about publications such as title, authors, publisher... There is also a link to an important Chado table which is a cross reference with external databases. This allows for example to store a publication PMID: a unique number assigned to each PubMed publication. This publication is accessible through the URL: http://www.ncbi.nlm.nih.gov/pubmed/[PMID].

(21)

2.4. USING CHADO FOR SEADB

feature type fmin fmax position relative to residue info rank remark

SNV 1000 1001 chr1 A 0 Chado uses interbase

coordinates so fmin 6=

fmax. Base A is

between interbase 1000 and 1001

1000 1001 chr1 G 1

insertion 1000 1000 chr2 GAGAGA 0 insertion occurs be-

tween base 1000 and 1001. It does not change for the variant featureloc

deletion 1000 1006 chr2 TACCGA 0

translocation 30000 40000 chr7 +1 0 For structural

abnormalities, residue info is used to inform about the gain/loss of genetic material. Here a part of chromosome 7 has been duplicated (+1) and translocated on chromosome 11 which has lost a part (-1)

2000 2000 chr11 -1 1

trisomy 0 159 138 663 chr7 +1 0

copy number varia-

tion 30000 40000 chr7 +2 0

Table 2.1. How the main abnormalities are represented in Chado. In the feature table a field held the type of the abnormality. Position and residue information are hold by entries in the featureloc table. When needed, a pair of featureloc entries are used to represent the abnormality (rank 0 for the wild position and rank 1 for the variant position).

Publications are mainly associated to SEAdb studies but can also be associated to cell lines.

The study module

A study is the higher level used to gather genetic abnormalities. Each study is stored using two tables of the study module: study and studyprop. The study table stores information such as a title, a description and has foreign key relations to a contact or a possible publication. A study property is systematically associated to each study using the studyprop table. This property refers to the privacy of the

(22)

study (either public or private). This allows to keep a set of abnormalities private to group of logged users.

The contact module

Each study must be associated with a contact. Usually this contact is the corresponding author of the associated publication. However the main utility of the contact module is to handle private studies (so the studies which are not published).

Indeed, this contact table is linked to a table of registered users. This table is external to Chado and is handled by the CMS (see next chapters). In this way, if a study is private, only a logged user associated to the study contact can access abnormalities associated to the study. Note that a contact entry can be a person but also a research group. In this case, all the persons associated to the research group (through the contact_relationship table) are authorized to access the study data.

The custom_sample table

Although Chado proposes a large range of tables, there was no table matching our need to link a cell line, a study, protocols and features. Thus, I added an additional table called custom_sample (custom_ to clairly separate it from native Chado tables). The sample table has five columns which are: sample_id, study_id, cell_line_id, name and description. Thanks to the study_id and the cell_line_id fields, a unique cell line and a unique study are associated to each sample. Two link tables called custom_sample_feature and custom_sample_protocol allows to associated several features to a sample or several protocols to a sample (note that this could be handled by adding a field in the feature table and the protocol table but it is preferable to create custom tables rather than modifying the native Chado schema to avoid compatibility issues with other programs interacting with Chado).

The cell_line module

Cell lines are stored in the cell_line module. The cell_line table stores the name and the id of each cell line. A cell_line_relationship table exists but is not used up to now. The cell_lineprop table is systematically used to store the cell type (either ES or iPS) of the cell line. A cell line is systematically associated to a sample.

The protocol module

Each sample might be associated to several protocols. I defined three types of protocols: culture protocol, reprogramation protocol and detection protocol. Proto- cols are stored using two tables: protocol and protocolparam. The protocol table stores the name and the type of the protocol while protocol parameters are stored in the protocolparam table. For instance a culture protocol could be composed of

(23)

2.5. CONCLUSION

three paramaters: a "medium" (the liquid where cells grow) paramater with value

"DMEM", a "passage type" paramater with value "enzymatic" and an "enzyme"

paramater with value "trypsin".

The cv module

The cv (controlled vocabulary) module is closely related to all the tables described above. Each parameter or property type is linked to this module where terms are stored.

2.5 Conclusion

In this chapter I have presented how data are organized on the server side. I used as template a well known biological database schema called Chado that I extended for our needs. Genomic abnormalities are stored in the feature and featureloc tables.

They refer to a sample that refers to a study. Culture and reprogramation protocol are attached to samples as well as detection methods. This structure is the core of SEAdb. Now I will present how I used this structured data to build a tool to allow a researcher to handle pluripotent stem cells abnormalities.

(24)

(25)

Chapter 3

Building a web interface to display biological data

SEAdb is composed of a front-end web application implemented using the Drupal content management system. The web application allows a researcher to query the database and to visualize PSC abnormalities in a genome browser. In this chapter I will present how I used Drupal to build such a website and the tools I used to visualize data.

3.1 Drupal

Drupal is a free and open source content management system (CMS)¹. Choosing a CMS to develop a large website is a wise choice. Nowadays web CMSs are mature and efficient. It saves a lot of time, allows to focus on more advanced problems rather than reinventing the wheel. Among the main CMSs that share the market, we can find Wordpress, Joomla and Drupal. We chose Drupal for several reasons.

3.1.1 Why Drupal?

Around 2% of websites use Drupal as a back-end framework. Although it does not represent a big part of the market, it corresponds to almost one million of websites². Thus, there is a wide and active community developing Drupal and helping when needed. On one hand, Wordpress and Joomla are largely used and are very adapted tools to build most of websites we can find on the web such as blogs or showcase websites. However, when one begins to need to use more advanced functionalities and develop a complex application on the server side, Wordpress and Joomla might not be the best tools to use. On the other hand, Drupal seems to be

1A CMS allows publishing, editing and modifying content as well as maintenance from a central interface. CMSs aim to avoid rebuilding a website from scratch.

2https://drupal.org/project/usage/drupal.

(26)

more developer-friendly. Drupal is very flexible and gives access to a powerful API.

Drupal seems a good partner together with Chado to build a web interface to interrogate and provide a graphical or tabular view of the database content. This is probably the reason why the GMOD consortium developed a collection of open- source and freely available Drupal modules called Tripal [18] that serves as a web interface for the Chado database and is designed to allow anyone with genomic data to quickly create an online genomic database using community supported tools. It also provides a complete API to interact with Chado.

3.1.2 Drupal modules

There are more than 1,000 modules available on the Drupal platform. A Drupal module is a collection of files containing some functionality and is written in PHP.

Each developer picks the modules he needs on Drupal platform to develop his Dru- pal website. If one needs to extend or customize Drupal functionalities, building a module rather than tweaking Drupal source code gives upgradability, portability and reliability. Because the module code executes within the context of the site, it can use all the functions and access all variables and structures of Drupal core.

I built several Drupal modules to fit specific functionalities I was looking for.

Those functionalities are the following: building pages with specific content or layout (home page, advanced search form, abnormality detailed page, JBrowse page, all the Chado administration pages), handle data confidentiality (see below) and export data in specific file format.

3.1.3 Example of a module I developed

During my project, I had to handle data confidentiality. Indeed, a requierement was to label genomic abnormalities as either public or private. If an abnormality is private, it can be displayed only for logged users to which the abnormality belongs (an abnormality can belong to a group of users).

Since I am using an additional database schema (Chado), there was no module fitting my needs to handle confidentiality. Thus I developed a module implementing a function called just after the normal interaction with Chado. For each abnormality queried there is a field containing all the authorized users. If the abnormality is labelled as private, the module compares the current user id to the authorized user ids. If it is found, the abnormality is displayed. Else it is deleted from the content to display.

This is a very simple module composed of only two files which are stored in a directory bearing the name of the module and stored in the "module" directory. I

(27)

3.2. SEADB MAIN FUNCTIONALITIES

added the prefix "custom" to all the modules I developed. This module is called

"custom_confidentiality". Thus, the "custom_confidentiality" module contains two files which are:

• custom_confidentiality.info: which is a small file with information such as the name and description of the module, the compatible versions of Drupal, a list of files present is the module directory and a list of dependencies.

• custom_confidentiality.module: which contain the main code of the mod-

ule. In this case, there is only one hook function³: custom_confidentiality_views_pre_render.

3.1.4 Tripal

Tripal is a Drupal module that proposes several functionalities to integrate Chado in a Drupal website. It allows to add, update and display Chado content easily. Currently, Tripal is limited because not all the Chado tables are integrated into the default templates. Nevertheless, Tripal offers a complete API to fully interact with Chado.

3.2 SEAdb main functionalities

3.2.1 Advanced Search

In parallel to the visualization of data in a genome browser, the main feature of SEAdb is to allow the query of data. To do so, there is an advanced search page that has been designed to allow as many queries as possible. Abnormalities can be queried according to their location (by gene, by cytoband, by genomic position on the last assembly GRCh37), to their size and type. They can also be queried according to features of their sample such as the cell type (ES/iPS), the cell line, the detection methods used. More possibilities such as culture/reprogramation protocol parameter will be available in the future. Finally abnormalities can also be queried according to the study to which they belong.

3.2.2 Recap cards

For each abnormality displayed in the table resulting from the advanced search form, the user can click on three different recap cards: the abnormality recap card, the sample recap card or the study recap card. On these recap cards, all the information about the corresponding entity is displayed (see an example of an abnormality recap card in annexe C.1). Samples recap cards integrate a Circos graph (see next section) resuming all the abnormalities found on this sample (Fig. 3.1).

3Hook functions are special function recognized by Drupal core that are called at very specific moment of the page building process.

(28)

Figure 3.1. Circos graph displayed on sample SEA_sample_465 recap card. It resumes all chromosomal (translocation and >1Mb) and subchromosomal abnormalities.

3.2.3 Data export

When querying the database, the resulting abnormalities are displayed in a table that can be downloaded in the standard GVF format. The Genome Variation For- mat (GVF) is a very simple file format for describing sequence alteration features at nucleotide resolution relative to a reference genome. GVF is a type of GFF3 file with additional pragmas and attributes specified. The GVF format has the same nine-column tab-delimited format as GFF3. All of the requirements and restric- tions specified for GFF3 apply to the GVF specification as well and thus a GVF file should be fully compatible with code used for processing and displaying GFF3 files. In addition, GVF adds additional constraints to some of these columns.

A possibly coming SEAdb functionality is to allow data exportation in VCF format. The Variant Call Format (VCF) is another largely used format that has been introduced during the 1000 Genomes project. An important functionality of this type of files is that it can be used compressed and in paire with a .tbi index file.

Up-to-now there is a memory issue with exportation of large tables but a down- load page allows to export the full SEAdb dataset.

(29)

3.3. VISUALIZING BIOLOGICAL DATA

3.3 Visualizing biological data

Visualisation is increasingly important and challenging in life science as data grows rapidly in volume and complexity. This challenge worths the trouble since a pattern can be obvious if visualized with relevant tools whereas it would be almost impossible to detect in raw data. For instance a recent publication shows a summary of a large body of published karyotype abnormalities (Fig. 3.2). This figure is precious since it takes several days to gather together all this data coming from many different publications. Patterns such as chromosome 12, 17, 20 or X trisomies are thus made immediately apparent.

Figure 3.2. Example of a nice figure to visualize in the twinkling of an eye pluripo- tent stem cell abnormalities reported in the literature.

According to this analysis it is essential to select a good way to visualize the data we store in SEAdb. To do so, we chose to use the popular Circos graphs and a recent genome browser called JBrowse.

(30)

3.3.1 Circos

Circos [17] is a software package for visualizing data and information. It visu- alizes data in a circular layout that makes Circos ideal for exploring relationships between objects or positions. There are other reasons why a circular layout is ad- vantageous, not the least being the fact that it is attractive.

The creation of Circos was motivated by a need to visualize structural variation within a genome but it is now used to show many other features. Support exists for a variety of plot types, such as paired-location, scatter, line, histogram, heat map, tiles, glyph and text elements plots. Plots may be combined in a single track and multiple tracks are supported. Colours and positions of individual elements can be tuned to suit your application.

Circos is the tool we needed to visualize the whole genome and identify in the twinkling of an eye hotspots on the genome. A Circos graph is displayed on SEAdb home page (Fig. 3.3) that is clickable and refer to the corresponding positions on the genome browser. On this graph each color corresponds to a chromosome. The external histogram shows karyotypic abnormalities whereas the internal heatmap shows hotspots considering all SEAdb abnormalities.

We can notice the complete correlation between the Circos graph and figure 3.2:

chromosome 12, 17 and X are implied in many karyotype abnormalities as well as a particular part of chromosome 20 called 20q11.21.

The main asset of this Circos graph (Fig. 3.3) compared to the summary figure 3.2 is that it is automatically updated when data are imported in SEAdb. It is also interactive: a researcher can click on a hotspot and visualize with a higher resolution what happens at this part of the genome thanks to the genome browser.

More Circos graphs are displayed on a page showing SEAdb translocation and other data http://SEAdb.org/?q=circos. Each sample also has its own Circos graph to visualize and browse its abnormalities thanks to hyperlinks to JBrowse.

3.3.2 JBrowse

JBrowse [19] is a recent genome browser. Genome browsers are graphical inter- faces for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data are usually from multiple diverse sources.

(31)

Figure 3.3. Circos graph displayed on SEAdb home page.

Why JBrowse?

Wikipedia references almost 40 genome browsers. Most of them are specialized and/or are charging to use it and only a few are web applications. It appears to me that two of them stand out from the crowd: UCSC [20] and Ensembl[12] genome browsers. They are widely used by the scientific community due to the colossal amount of data available online.

Although an instance of the UCSC and the Ensembl genome browsers can be locally installed, their main asset is the online data that goes with it. To embed a genome browser on a website, packaged tools are available and more adapted.

We focused on two genome browsers called GBrowse and JBrowse because they are developed by the GMOD consortium that also developed Chado an Tripal. Thus compatibility with Chado would be easier to handle.

(32)

JBrowse is a genome browser with a fully dynamic HTML5 user interface, being developed as the successor to GBrowse. It is very fast and scales well to large datasets. JBrowse differ from most of genome browsers because it distributes work between the server and client and therefore uses significantly less server overhead than previous genome browsers. This inherent feature of JBrowse is why I chose to use this genome browser. Here are two examples originally presented in the JBrowse publication⁴ that explain the importance of this feature:

• In most genome browsers, to scroll the displayed region, the user presses the ’pan left’ button or other navigation control; the browser transmits the changed coordinates to the server, and the process repeats itself. Such a use imposes every action (such as moving to a different part of a chromosome or changing how the data are displayed) to reloads the entire genome browser page, which incurs a delay and makes the user experience ’choppy’. This manner of progressing through a series of static pages results in disruption of user attention. Since navigating through large volumes of information requires these actions to be done frequently, the disruptions add up.

• Another common implementation drawback is that the server generally does most of the work involved in showing genomic data to the user. Typically, a program running on the server has to query a database for genomic information in the region the user is viewing, and then render a static pictorial representation of that region, which the web browser passively displays. In this type of system, the server incurs the majority of the computational expense involved, which increases with the number of users and with the amount of genomic data. As that computational expense increases, so does the amount of time the user has to wait for each new page.

Jbrowse allows to interact with the application without having to wait for the server; communication between the web browser and the server takes place asyn- chronously in the background. This strategy inspired from Google maps makes the genome browser fast and smooth. However this strategy also limits users to web browsers that support Javascript and needs the user terminal to be powerful enough to compute what was originally computed on the server (it has happened that the web browser crash). Another drawback (and not the least) is that JBrowse is very young and thus has a lack of documentation.

A detailed presentation of JBrowse functionalities is available at http://SEAdb.

org/getstarted.

4JBrowse: A next-generation genome browser, Genome Research, 2009.

(33)

How Jbrowse is used in SEAdb?

Although my focus on JBrowse was due to its possible compatibility with Chado schema, I did not get the best from this possibility. Indeed JBrowse accepts a configuration file in JSON format that specify the method to connect to a Chado database, and which data to extract from it, and which JBrowse feature tracks to create to display the data. The script linking Chado to JBrowse uses the DAS protocol (read frame page 30) that allows interaction between different databases.

I struggled with this configuration and I did not manage to link JBrowse to Chado by this way. Thus I decided to export Chado data to flatfiles (GFF) that are easily read by JBrowse. This way of linking Chado to JBrowse is not optimal since data are duplicated on the server and synchronization between Chado and JBrowse is needed. This is one of the point that could be improved in the future.

Conclusion:

In this chapter, I presented the tools I used to develop SEAdb website and the tools used to visualize data. Drupal is a developer-friendly CMS that is very flexible and powerful thanks to its module system. One of this module is Tripal that provides a complete API to interact with Chado. Circos and JBrowse are cutting-edge tools that greatly improve the user experience on the website and help to detect interesting spots on the genome. However, sub chromosomal abnormalities cannot be interpreted without annotations coming from external databases. Thus our core vizualization tool JBrowse includes data from outside databases that allows to com- pare SEAdb data with other data. In the following chapter I will detail how SEAdb is using external databases to annotate its data.

(34)

(35)

Chapter 4

Enriching SEAdb with external databases

While it is clear that karyotype abnormalities render the cell lines inappropriate for research or medical applications, the signification of sub-chromosomal abnormalities is uncertain. It is therefore becoming increasingly important to assess the threat of each individual genomic abnormality to address the issues of the represen- tativity of hPSC and their progenies with normal human development and disease, and issues of safety for clinical uses. To this end, it is essential to distinguish polymorphisms¹ from bona fide mutational events (Fig. 4.1), and to distinguish incidental abnormalities from those that are bona fide altering the biology and/or security of hPSC.

For this purpose, annotating genetic abnormality stored in the database as to whether it corresponds to an abnormality that is recurrent in hPSC, or found in cancer or reported as polymorphism becomes a crucial issue. These annotations were made possible thanks to the recent development of biological databases reporting polymorphisms or relating genetic abnormalities to phenotypes. Nevertheless, com- putorising these annotations requires to intersect hPSC genetic abnormalities with other sets of abnormalities. It is therefore mandatory to define a similarity relation- ship between abnormalities reported from various sources and this is not a trivial question.

Once this is done, we will be able to compute a ’genomic passport’ for each sam- ple that synthesizes the load of genomic abnormalities and their potential impact.

1A polymorphism is usually taken to imply a minor allele frequency greater than 1%.

(36)

Figure 4.1. Example of hPSC abnormalities probably corresponding to polymor- phisms: upper tracks correspond to hiPSC and hEPSC abnormalities while the bot- tom track corresponds to control polymorphisms.

4.1 Communication between biological databases

With the significant increase of online biological databases (Fig. 2.1), inter- database communications protocols seem essential. A good example of such a protocol is the Distributed Annotation System (BioDAS).

BioDAS defines a communication protocol used to exchange annotations on genomic or protein sequences. It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites. The advantages of this system are that control over the data are retained by data providers, data are freed from the constraints of specific organisations and the normal issues of release cycles, API updates and data duplication are avoided.

DAS is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up sequence annotation information from multiple distant web sites, collate the information, and display it to the user in a single view. Little coordination is needed

(37)

4.2. PRESENTATION OF RELEVANT DATABASES

among the various information providers. DAS is heavily used in the genome bioinformatics community.

A DAS-enabled website or application can aggregate complex and high- volume data from external providers in an efficient manner. For the biologist, this means the ability to "plug in" the latest data, possibly including a user’s own data. For the application developer, this means protection from data format changes and the ability to add new data with minimal development cost.

As explained above, such a tool allows to keep updated aggregated data and avoid data duplication. It seems to me to be a substainable way to communicate between databases. However this protocol is not used by all biological databases and setting up this communication system requires time (at least the first time!).

Most of databases provide their data via flatfiles available on FTP servers. Al- though it is more binding in terms of release cycles and it generates data duplication, this way of interrogating external databases is easier to set up and data are obtained faster by far (all data are locally stored). It also allows to successfully interrogate an external database whether or not its server is available. For those reasons, this is the way that has been chosen to communicate with external databases. The main drawback is that updates must be performed manually each time a new external release is available.

4.2 Presentation of relevant databases

The number of databases storing genomic variants is important and expanding and it would be an ardous job to list all of them here. However a small set of them stand out from the crowd either because they are developed by central biotech- nology organisms (e.g. NCBI, Sanger Institute) or/and because they are perfectly relevant to our needs.

These databases bring the information What is a given genomic variant but it is also important to consider where a genomic variant happens. For example an abnormality found in a functional site of the genome should not be considered in the same way as an abnormality found in a non-coding part of the genome. This is why the following list also includes Pfam, a database referencing the functional parts of the genome.

4.2.1 Pfam (Sanger Institute)

Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range

(38)

of proteins found in nature. Pfam allows the identification of domains that occur within proteins. It can therefore provide insights into their functions.

Originaly, Pfam allows one to analyze its protein sequence for Pfam matches.

However, the UCSC genome browser maps Pfam-A² domains found in transcripts located in the UCSC Genes track on the genome. This track is highly significant for SEAdb since it corresponds to functional regions of the genome that are usually higly conserved. This track is available on SEAdb genome browser and abnormalities found on Pfam domains are highlighted as such.

4.2.2 COSMIC - The Catalogue Of Somatic Mutations In Cancer (Sanger Institute)

All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage upon the cells in which they have occurred. There is a vast amount of information available in the published scientific literature about these changes. COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers.

COSMIC provides a wide list of abnormalities including (in version 67 oct-2013) 1,592,109 mutations, 422,314 copy number variations, 9,190 fusions and 7,584 structural rearrangements. It also lists 513 genes known to be involved in cancer.

With regards to our goal to assess the threat of each individual genomic ab- normality it is relevant to look for hPSC abnormalities which have a counterpart in the COSMIC dataset or reside in one of the 513 genes known to be involved in cancer. Nonetheless, defining what is a counterpart is a debatable question that will be deepen further.

4.2.3 dbSNP - The Single Nucleotide Polymorphism Database (NCBI) The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms. Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation:

SNPs, small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs), and retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs). Each dbSNP entry includes the sequence context of the polymorphism (i.e., the surrounding sequence), the occurrence frequency of the polymorphism (by population or individual), and the experimental method(s), protocols, and conditions used to assay the variation.

2Pfam-A is the manually curated portion of the database that contains over 10,000 entries.

(39)

4.2. PRESENTATION OF RELEVANT DATABASES

The need of dbSNP is due to the simplified way to represent the reference genome. Nowadays the reference genome is an average genome but a single tiling path is insufficient to represent a genome in regions with complex allelic diversity.

Indeed allelic diversity is indistinguishable from a rare genetic mutation. Both are seen as genetic variants to the reference genome. To overcome this problem, a list of referenced polymorphims is needed to realize if a genetic variant is a simple polymorphism or an unknown genetic abnormality.

The quality of the data found on dbSNP has been questioned by many research groups, including a high false positive rate due to genotyping and base-calling errors. At least two studies deserve to be cited:

1. Musumeci et al. [2010] shew how some SNPs listed in dbSNP are arising through amplification and sequencing artifacts attributable to paralogous (duplicated) genes. They define these as single nucleotide differences (SNDs) and show that as many as 8.32% of the SNPs in dbSNP could be SNDs.

2. A meta-analysis of four studies designed to estimate general dbSNP error rates [Mitchell et al., 2004] estimated a 15 to 17% false-positive rate.

NGS data will rapidly come to dominate dbSNP content now, and along with it, errors from read and assembly processes from the various NGS platforms. Luckily genomic abnormalities can be filtered according to a validation tag associated to each abnormality which comprises:

(i) Validated by multiple, independent submissions to the refSNP cluster

(ii) Validated by frequency or genotype data: minor alleles observed in at least two chromosomes

(iii) Validated by submitter confirmation

(iv) All alleles have been observed in at least two chromosomes apiece (v) Genotyped by HapMap project

(vi) SNP has been sequenced in 1000Genome project (vii) Suspect SNPs: snp suspected from paralogous region.

4.2.4 ClinVar (NCBI)

Recently released (2012), ClinVar (Clinical Variant database) reports relationships among human variations and phenotypes, with supporting evidence. In Clin- Var, the description of the genotype/phenotype relationship partially consist of a clinical significance attribute that can take nine different values which are: Not provided, Benign, Likely benign, Unknown, Likely pathogenic, Pathogenic, Protective

(40)

Drug response, Susceptibility.

The significance of any particular variant of this dataset should be carefully interpreted since inclusion of a variant in this dataset is not necessarily an indicator of risk.

4.2.5 Database of Genomic Variants (The Centre for Applied Genomics) and dbVar (NCBI)

It is well known that variations at the single nucleotide level are abundant across the genomes of all species. However, it is becoming clear that genomic structural variation - this is variation ranging from tens to millions of base pairs in size and includes insertions, deletions, inversions, translocations and locus copy number changes - accounts for more of the individual differences at the base pair level in humans and is likely to play a major role in disease.

dbVar operates in close cooperation with the Database of Genomic Variants Archive (DGVa)³, a sister database at the European Bioinformatics Institute (EBI).

dbVar and DGVa both accept data submissions, and use similar data models and submission templates. After regular monthly syncing, dbVar and DGVa contain the same data. After the data have been made public at dbVar and DGVa, it may also be imported by the Database of Genomic Variants (DGV) at the Center for Applied Genomics in Toronto (Fig. 4.2).

Figure 4.2. Organisms curating, annotating or displaying genomic structural vari- ations.

The content of DGV is only representing structural variation identified in healthy control samples. It provides a useful catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The database is continuously

3http://www.ebi.ac.uk/dgva/.

(41)

4.3. USING EXTERNAL DATABASES TO CLASSIFY HPSC GENETIC ABNORMALITIES

updated with new data from peer reviewed research studies.

Conclusion of section 4.2:

Overall, with the completion of projects such as the HapMap⁴ project [21] and the 1000 Genomes⁵[22], the study of genomic variants is topical and thus changes occur very fast. A scientific monitoring must be performed to integrate potentially new relevant genome variant databases and new releases of currently existing databases.

For instance at the end of October 2013, COSMIC released its version 67 that is a main release including for the first time structural variants and mutations located on non-coding genome areas found in cancer.

4.3 Using external databases to classify hPSC genetic abnormalities

The databases presented in the previous section help to assess the threat of hPSC abnormalities and thus assess the genetic integrity of stem cells which is needed for the development of stem cell based therapy. Thus, with these data, we could try to quantify how bad a variation is.

Although this quantification is not easy to implement, it is clear that one should not pay the same attention to a silent coding substitution and a substitution arising a stop codon in the middle of the tumor supressor P53! In regards to the available databases described above, we can attempt to classify genomic abnormalities in four classes which could be:

1. Critical genomic abnormalities: abnormalities at a karyotype scale : > 10 Mb (i.e. variants whose size is such that they most certainly would be visible on a karyotype). Abnormalities that are also found in COSMIC (mutations implied in cancer) AND for which there is clear indication in the medical literature that it can induce cancer : e.g. a point mutation in p53 that is known to inactivate p53 tumor suppressor function. Dominant mutations as well as "recessive" abnormalities that must be present on both allele to lead to cancer will be considered "critical".

2. Severe genomic abnormalities: abnormalities that are present in COSMIC but NOT in the Database of Genomic Variants or dbSNP, and there is no clear indication in the medical literature that it can induce cancer. Of note, a threshold of frequency will have to be defined for data coming from dbSNP.

4The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.

5The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

(42)

3. Potentially not dangerous genomic abnormalities: genomic variants present in genetic polymorphism databases but NOT in COSMIC nor Deci- pher. These genomic variants are those that are not classified as "1" nor "2"

AND are present in either the Database of Genomic Variants or dbSNP.

4. genomic abnormalities of unknown signification: abnormalities that are neither classified as "1", "2" nor "3". This category will include genomic alterations that induce a codon stop or alter the reading frame of known genes but that are not in the above mentioned databases.

It is important to keep in mind that this classification is a first try and will have to give proof of its relevence on test data. Although I set up the first steps to perform this classification, this work remains to be finished in the future. Moreover, the fine tuning of the classification proposed here will be subject to further research.

This classification requires to intersect hPSC genetic abnormalities with other sets of genetic variants such as those stored in dbSNP, DGV and COSMIC. This intersection is not obvious. When two genetic variants should be consider as similar? Should a substitution A->G considered similar to a substitution A->T if they happened at the same position? But what if the corresponding codons are similar?

What if we do not have any information about the corresponding codons? And what about CNVs? What should be decided if there is only a partial overlap? Well...

This raises the question of how to intersect two sets of genome variants and what would be a biologically meaningful way to define a similarity relationship between abnormalities from SEAdb and other sources.

4.4 Using DGV to compute a polymorphism confidence score

To classify abnormalities as proposed in the previous section, it would be con- venient to label SEAdb abnormalities as whether or not they are well known polymorphisms. Since we might not be able to bring a clear answer to this question, we decided to label abnormalities with a score ranging from 0 to 1 that corresponds to our confidence in whether or not this abnormality is a polymorphism. I will now explain how we decided to compute this score.

4.4.1 Some background about the data we are looking to

Fact 1: Depending on the method used for detection of the CNV, the bound- aries reported may be quite different from the actual underlying variant. This is obvious when looking at regions where a large number of different studies have reported the same variant. The data must therefore be interpreted with this in mind.