Welcome Guest User     Log In | Sign Up


BIOINFORMATICS
By Assistant Prof. Yeong Foong May,
Dept of Biochemistry, National University of Singapore

Introduction
The need for Bioinformatics
The use of Bioinformatics


Introduction:

Living organisms are complex systems organised in a hierarchical manner from organic macromolecules and genes to cells, cells to tissues, organs and whole organisms. The field of biology attempts to provide, through the study of the diverse array of organisms such as bacteria, fungi, plants and animals, a scientific approach to the understanding of living systems, or life. Although the roots of biology can be traced back to the ancient Greeks, biology as we know it today came about during the nineteenth century through the development in several major fields of study. Broadly speaking, modern biology is based on the concepts of cell theory, embryology, homeostasis, evolutionary theory and genetics.

The late 1980s saw the convergence of different fields of study in biology as well as information technology into what is known as Bioinformatics. These different areas include molecular biology, genetics, genomics, biochemistry as well as evolutionary biology and computational technologies. It is hoped that with the help of Bioinformatics, the understanding of biology can be attained at a faster speed and with a more global perspective as we look for unifying principles governing life across the evolutionary divide. The beginning of Bioinformatics saw the development of algorithms that enabled the storage of protein sequences and nucleic acid sequences in the form of annotated databases in such a manner that would allow researchers to exchange information about gene sequences and protein sequences easily and quickly. Though many people are of the opinion that the field of Bioinformatics was driven by the advent of DNA sequencing technologies, it was in fact started because of the development of protein sequencing technologies. With the improvement of computing powers due to the second world war, the growth of the protein database was further helped along with protein crystallisation methodologies and storage of the protein structure into the database.

Back to Top

The need for Bioinformatics:

Beneath the diversity among organisms lie certain similarities since different organisms evolved from pre-existing ones. These databases containing the genetic or protein sequences would therefore allow researchers to compare genes and gene functions from organisms where information is available. Also, with integrated databases like the Entrez (see table), it is possible to search for literature, DNA and protein sequences at one site with hyperlinks across the databases. With tools provided that enabled researchers to query sequences and perform comparisons to identify homologous gene or protein sequences, researchers working on a model organism for particular diseases could study the relevant genes in the organism and use that information for further studies and analyses in humans.

Examples of sequence databases:
  • GenBank – database of DNA sequences collected from GenBank at NIH, European Molecular Biology Laboratory (EMBL) and DNA DataBank of Japan.
  • Genomic databases at NIH
  • SWISS-PROT – database of proteins with descriptions of protein function, domain structures, post-translational modifications etc.
Sequence similarity search engines:
  • NCBI BLAST similarity search servers

    ToolQueryDatabaseLevelApplications
    blastnDNADNADNASearch for similarities or identities in DNA sequences
    blastpProteinProteinProteinSearch for homologous proteins
    blastxDNAProteinProteinNew DNA sequence searched against database to look for genes and homologous proteins
    tblastnProteinDNAProteinLook for genes in DNA
    tblastxDNADNAProteinLook for gene structure
    Others: Pairwise blast, Genomic blast, Specialised blast

     
  • EMBL similarity search engines
     
  • Pfam protein family search engines
     
Integrated database:

Entrez is developed by National Center for Biotechnology Information (NCBI) that has an integrated system of databases. One can have access to:
  • Nucleotide sequences
  • Protein sequences
  • molecular modelling 3-D structures
  • genomic database
  • maps database
  • taxonomy database
  • PubMed – a literature database where abstracts of articles in the field of biomedical sciences can be accessed

Back to Top

The use of Bioinformatics:

Through the years, Bioinformatics has grown to include other areas of computational capabilities such as protein modelling for structure prediction, genome mapping, gene prediction and determining evolutionary relationships between organisms, among others. In the field of protein structure and function studies, X-ray crystallography has been a mainstay for understanding how protein folding translate into function given a specific amino acid sequence. However, not all protein structures can be solved using X-ray crystallography as not all of them can be crystallised. In other cases, proteins that can be crystallised may be unable to give good diffraction patterns from which their structures can be deduced. Using the database of proteins whose structures have been elucidated, scientists have been able to make use of algorithms to predict the structures of homologous proteins, that is, proteins with very similar sequences. This is known as homology modelling.



Sequence and structure homology – a protein whose crystal structure has been solved can be used to predict the structure of another protein with similar sequence of amino acids.


Using the database of proteins which have been crystallised, researchers are attempting to understand how local and global interactions between amino acids in the protein sequence affect protein-folding. This is useful for predicting the structures of proteins where no known structures of homologous exist for structure prediction. In some instances, there are proteins for which the sequences of the proteins are different but they are able to assume similar structures. There are available certain methods known as threading for predicting the protein structures of such proteins. Bioinformatics has thus provided a means of studying how proteins function through structure analyses. Using information from such studies, it is possible to develop rational designs for drugs which could target and inactivate certain enzymes from infectious organisms or mutated proteins Presently there are websites which are hosted by several research institutes that provide structure prediction on-line assist researchers in structure prediction of their proteins-of-interest.

Examples of Protein structure databases: Examples of Protein analysis tools:
  • SWISS-MODEL - automated knowledge-based protein modelling server.
  • Rasmol - software for structure display and analysis.


The genome of the unicellular model organism, the budding yeast, was completely sequenced in 1996. The first multicellular organism whose genome was sequenced to near completion was that of the nematode Caenorhabditis elegans, in 1998. The success of this sequencing project was in large part due to the sheer determination and courageous efforts of those involved from the Wellcome Trust Sanger Centre at Hinxton, UK and Washington University at St Louis, USA, beginning at a time when advanced sequencing technology was still limited. However, the C elegans genome sequencing project was a model where DNA sequencing technologies were tested and improved upon for the subsequent Human Genome Project. The Human Genome project was driven mainly by improved sequencing strategies and automated sequencing technologies used both by the International Human Genome Sequencing Consortium and Celera, USA. The automation has allowed for the rapid sequencing of the human genome and at the same time, computational power of computers has improved which enabled annotated large databases for the human genome to be built as well as tools for analysis of the sequences.

Databases of several genome sequences: The site at NCBI shown below contains a list of all the genomes available for easy access. NCBI is hoping to consolidate a genomic database of the sequenced organisms by working closely with the sequencing centres.

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/euk_g.html

Complete genomes:
  • Anopheles gambiae
  • Arabidopsis thaliana
  • Caenorhabditis elegans
  • Drosophila melanogaster
  • Encephalitozoon cuniculi genome
  • Guillardia theta nucleomorph genome
  • Saccharomyces cerevisiae
  • Schizosaccharomyces pombe
Maps:

VertebratesPlants
Homo sapiensAvena sativa (oat)
Mus musculusHordeum vulgare (barley)
Rattus norvegicus (rat)Oryza sativa (rice)
Zebrafish (Danio rerio)Triticum aestivum((bread wheat)
 Zea mays (corn)


While various efforts are going on to sequence the genome of different organisms, it is useful to have a platform on which one can orientate the sequences. This platform, known as a genomic map, is useful for researchers to pin down the location of their gene-of-interest to a particular region on a chromosome by just searching the websites for databases and making use of tools available for displaying such maps. For example, the 'MapViewer' from the National Centre for Biotechnology Information is such a tool that displays the data collected on the different human chromosomes graphically with links to information on regions of the chromosomes if available. Bioinformatics has enabled the integration of different sources of data such that it is easy to obtain all that is known from one website. The NCBI provides access on the internet to the genomes of about 800 different organisms.

The sequence of the human genome contains within it sequences of genes with known functions as well as many others which are as yet unknown. There are available tools for facilitating the search for regions within the genome sequence where genes could be present, ie computational genefinding. For example, the Gene Finder software is useful for predicting if a particular sequence-of-interest codes for a putative protein. There are certain criteria for which a DNA sequence is predicted to be a gene sequence. Algorithms are programmed to look for specific sequences such as promoter sequences and transcription factors binding sites. These local sequences are known as signals and recognition by the software is called signal sensing. Another factor which is examined in computational genefinding is the content sequences, that is, the variable lengths of exons and introns and the methods for detecting this is known as content sensing. However, there are limitations to using either signal and content sensors to predict the presence of a gene sequence and integrated systems have been developed that combines both the methods and yet others make use of knowledge from DNA to protein databases in multiple statistical methods for gene predictions. These algorithms for computational genefinding are useful for determining if the sequence one has at hand could potentially have a function as a protein-coding sequence in the organism. In addition, such programs are also useful for determining the mechanisms which are involved in gene expression.

Bioinformatics also include many other tools for analysis and data-mining of the wealth of information now available on the web or from data obtained from high through-put experiments. From the large collection of experimental data accumulated over the years in areas such as metabolism and cellular signalling, researchers are hoping to build a network of metabolic or signalling pathways so as to obtain a more global picture of how cellular processes are integrated in the cells.

An example of an interesting website to checkout:

http://e-cell.org/

E-CELL consists of a set of software tools to allow researchers to perform an ‘in silico’ experiment to tweak the metabolic pathways of a cell. Basically, the investigator can specify a cell's genes, proteins, and other molecules of interest and describe their individual interactions. By using the tools provided, they can then model how the pathways interact together as a system. This may be useful for researchers interested in studying metabolic pathways by introducing perturbations to the system ‘in silico’ and examining resulting effects.


In addition, large-scale experiments such as genome-wide studies using the budding yeast have been performed. Examples are genome-wide gene deletions studies, two-hybrid interactions, expression patterns of genes in different mutant backgrounds, in the presence of specific drugs and so on have been published and some of the data made available on the web (eg http://genome-www4.stanford.edu/cgi-bin/SGD/reference/geneinfo.pl?topic=Genome-wide+Analysis - is a website in the Saccharomyces cerevisiae genome database where a list of such studies and data have been put up). All these large-scale experiments in one way or other had required the use of automation for data-collection as well as the computational power of the computer for analyses – that is, Bioinformatics.

Now that the genomes of a variety of organisms have been sequenced, there is a great interest in finding out what the functions of the gene products are. The analysis of genome-wide gene expression using the DNA micro array technology (see below) gives a good indication as to which particular genes are turned on or off in a particular instance. However, this may not be absolutely correlated to the behaviour of the gene products, that is, the proteins, which are the active units of the genes. As such, there is an interest in understanding what all the proteins in a cell encoded by the human genome do and how they are regulated. The term proteome is used to describe the collection of all the proteins that are encoded by the genes and the study of the regulation and functions of the proteome is broadly known as proteomics. With this interest, traditional methods of protein analysis such as 2-D gel electrophoresis and mass spectrometry have been improved with new methods of improved sensitivities being developed. In addition, they have been automated for the collection of large sets of data. Databases of peptide mass and fragmentation patterns are available for searches where the identification of proteins of interest can be made without the complete sequencing of the full-length of proteins. One can then identify the proteins which are present in the cell. There are also algorithms which one can use to predict the functions, cellular localisation and properties such as structural domains of the identified proteins.

Examples of databases for protein identification: Examples of protein property prediction tools:


Overview of DNA microarray:

The use of computational tools in biology is to enable large-scale or high through-put studies. One example of a high through-put application that is currently in use in many studies is that of genome-wide gene expression studies using the DNA microarray technology. This technique involves obtaining biological samples as well as the use of automated robotics and computing capabilities.

How the technology works:

Messenger RNA, which is used as an indication of gene expression levels, is extracted from a control sample and a test sample and reverse transcribed into complementary DNA (cDNA). The cDNAs from the control sample is labelled with a green dye and those from the test sample labelled with a red dye. They are then used for hybridisation to a chip spotted with an array of short DNA fragments representing 6000 genes from the organism of interest. The hybridisation of the cDNA to the chip will result in a certain pattern which can be analysed by a special software. For example, a gene (A) on the chip is hybridised to a cDNA from the control sample only will show up as a green spot and another gene (B) which is hybridised to the test sample will be red. This would mean that gene A is expressed only in the control but not in the test – meaning that gene A is turned off when subjected to the experimental condition. Gene B on the other hand is turned on in the experimental condition. In a third instance where a particular gene (C) is hybridised by both the control and test cDNA, a yellow spot will show up. Using a special software, one can then determine the difference in levels of the expression in the normal versus test sample.

An example of a useful application of DNA microarray technology:

Differential gene expression patterns of a large number of genes (eg 5000) can be obtained by comparing normal and disease tissues and the information can be used as a prognostic tool where treatment given to patients depend on their gene expression profiles.

The DNA micrarray technology will be elaborated upon in the following section. The section details what is involved in data analysis in DNA microarray in terms of what the problems are in data analysis and how signal can be differentiated from noise to enable accurate analysis.
Last but not least, the availability of complete genomes from various organisms would allow the comparisons of complements of genes from different species. This is of interest to the problem of classification of organisms as well as the understanding of the phylogenetic relationships between organisms. Such comparisons provide a means for researchers to try to reconstruct how evolutionary pathways emerged by looking at the differences in DNA and protein sequences. The information obtained can then be used to plot a phylogeny tree, which is a graphical representation of the data. Knowing that life is hierarchical organisation of different levels, researchers can then hope to gain a more unified picture of the workings of life by examining the relationships between these different organisms.

Comparative genomics:



This section serves to provide an overview of the broad-ranging applications of Bioinformatics and is by no means a complete survey of all that is available currently. In short, Bioinformatics is a range of computing tools which are employed for the organisation and understanding of biological data usually in a large-scale manner.

Back to Top    


Powered by