• No results found

8. Life ScienceS and MoLecuLar Medicine

8.1 Genomics

Genome sequencing is today being used in a variety of fields generating ex-tremely large data sets. Sequencing has evolved dramatically since the pub-lication of Sanger´s method in 1977. Thee early sequencing methods which were based on electrophoretic separation of polynucleotides on gels, were laborious and costly and could only be used for reading sequences from small genomes like those of viruses. In 1987 an American company, Applied Bio-systems, introduced an automated version of the Sanger-method enabling larger sequencing tasks. In the late 1980s a debate started as to whether time was then ripe to sequence the whole human genome. Although it was an-ticipated that superior sequencing technologies would be introduced dur-ing the course of the HGP, the two drafts of the human genome which were announced in the year 2000 were obtained by the automated Sanger technology. Since the completion of the human genome project, several dramatic improvements in sequencing have been made all of which are non-gelbased. Several competing instruments are currently on the market having different advantages. A common feature is that they are capable of producing gigabase amounts of sequence daily. The improvement in speed has been accompanied by a dramatic reduction in cost. The current goal is to sequence a complete human human genome with high accuracy within a week and at a cost of less than 1000 US dollars. This goal will be reached in the near future. The introduction of vastly more powerful and cost-ef-ficient sequencing technologies makes it possible to address new types of biological questions, such as studies of population structures, identification of mutations in cancer cells, identification of pathogens in crude samples, and comparisons of entire microbiota.

The new molecular technologies have changed the face of genetics and a key challenge is to identify genetic variants in patient genomes which are associated with common diseases like hypertension, heart disease, asthma, and schizophrenia. The information contained in genomes is not restricted to the nucleotides but includes epigenetic modifications of the nucleotides which now can be studied.

The utilization of sequencing methodology has, in recent years, increased drastically in Sweden. With the development of the genomics platforms at the National Genomics Infrastructure hosted by the Science for Life Labo-ratory (SciLifeLab), and a few smaller facilities, the data output, and with it the need for both computing and data storage, has increased dramatically. In the last three years, the number of projects run at the Stockholm platform has increased from a few hundred to several thousand and the number of computer core hours has increased accordingly.

While it was previously possible to carry out the computational tasks associated with the bioinformatics analyses in house, this is now impos-sible. The datasets are so huge that computing clusters and vast amounts of memory are needed. In addition, large amounts of storage space are needed.

Internationally, several genome centres have built up their own computing platforms, but in Sweden this has not been possible. The researchers are thus now dependent on the available national infrastructure provided by SNIC (the Swedish National Infrastructure for Computing). The general idea is that SNIC supplies and supports the hardware, while different sub-ject-specific organizations provide software and assistance for users. Of the SNIC platforms, the genomics field currently mostly uses UPPMAX, due to their close association with SciLifeLab and the transfer of specific compe-tence in bioinformatics from ScLlifeLab to UPPMAX. The expansion of the use of genomics to many more fields and users, makes it necessary to expand

Figure 8.1: Sequencing has progressed much faster than computer performance. Image: National Hu-man Genome Research Institute.

organizations such as BILS (Bioinformatics Infrastructure for Life Sciences) and WABI (Wallenberg Bioinformatics Infrastructure).

There is a general agreement that the sequence data output will con-tinue to increase exponentially through both technology improvements and through the addition of more users and projects. It is clear that the existing platforms are already beginning to be insufficient to cope with all the proj-ects as waiting times are getting longer.

8.1.1 Potential breakthroughs

Genomic as well as epigenetic information will be collected from thou-sands of individuals, from different tissues and in some cases from single cells

Mechanisms behind so called complex diseases, which depend on more than one gene variant, will be unravelled opening new possibilities for diagnosis and treatment of many disabling diseases in the western world

The role of microbial populations for human health will be understood

Genome information will become a routine clinical tool guiding physi-cians in diagnosis as well as in therapeutic choices.

8.1.2 e-Science requirements

The infrastructure platforms that are in place at the moment seem to be working well in most cases, even though there are concerns that the nec-essary increases in capacity can be obtained without completely replacing the platforms. The close interaction between UPPMAX and SciLifeLab has proven successful and it is clear that closer communication between every-one involved, including SNIC facilities, software infrastructure platforms and users, will be needed in the future. Possibly all these facilities should be organized under a common higher structure, in order to ascertain that the facilities are able to meet the needs of the users.

A continuous and rapid increase in the capacity for both computing and storage is needed, both short and long term, based on the projections for increase of data output. At present, the technology for sequence data pro-duction is developing at a more rapid pace than computing power. Possible solutions include improved computer technology, providing both superior computational power and larger storage, and there are indications that such developments will come. Another possibility is the use of cloud computing, where the cost is currently prohibitive. This will most likely change and it is a very promising area which could completely change the hardware needs in the future.