You are on page 1of 209

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

november 16 to 18, 2010

Abstracts Book

Contents
1 2 3 4 5 6 7 Genomics, Evolution and Phylogeny Structural Bioinformatics and Molecular Dynamics Transcriptomics and Proteomics Databases and Bioinformatics Tools Text Mining and Information Extraction Systems Biology and Networks Sequence Analysis 3 33 63 78 134 147 173 203

Authors index

TOPIC 1 G ENOMICS , E VOLUTION AND P HYLOGENY

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP01

MAPPING AND ASSEMBLING THE GENOME OF A STREPTOCOCCUS AGALACTIAE STRAIN ISOLATED FROM FISH
Pereira U P1 , Araujo F M G2 , Drumond B2 , Zerlotini A3 , Coimbra R S2,3 , Oliveira G C2,3 , Figueiredo H C P1
1 Laboratory 2 Genomics

of Aquatic Animal Diseases, Veterinary School - UFMG and Computational Biology Group - FIOCRUZ-MG 3 Center for Excellence in Bioinformatics - FIOCRUZ-MG Streptococcus agalactiae (Lanceeld group B; GBS) is a pathogen that causes meningoencephalitis in sh, mastitis in cows, and neonatal sepsis and meningitis in humans. In aquaculture GBS is an emerging pathogen that has been associated with considerable morbidity and mortality in sh farms worldwide. There are three full and ve draft genomes published in databanks, all isolated from humans. The pangenome of the species is considered openand it is expected that, for every new GBS genome sequenced, an average of 33 new strain-specic genes will be identied. Thereby the aim of this work is to assemble the genome of a strain of GBS isolated from sh and compare it to the available genomes of this species. We present herein the preliminary results of mapping and de novoassembly of data generated by SOLiD sequencer (Life Technologies) using different approaches. The genomic DNA of S. agalactiae isolated from Nile tilapia was extracted and mate-paired libraries, with 1-2 Kb inserts, were constructed. million 50 bp-long reads were generated with SOLiD v.3.0 and mapped with the From this library, 50 software SHRiMP 1.3.2 onto three reference genomes of S. agalactiae, namely NEM316 (NC_004368), 2603 (NC_004116), and A909 (NC_007432). In this step all reads were used with no previous quality ltering. Additionally, reads were ltered for quality using csfasta_quality_lter_1.0 and assembled de novowith the software SOLiD System De Novo Assembly Tools (SDNAT) versions v1 and v2. BLAT was used to align the contigs to the three reference genomes. Using SHRiMP, 58% of the reads were mapped to the three reference genomes with an observed coverage ranging from 651X to 689X. The number of SNPs was higher in strain 2603 (11,052) than A909 (10,089) and NEM 316 (7,188). The uncovered extension in the genomes A909, 2603, and NEM316 were 17, 20, and 22%, respectively. In the de novoassembly, the number of contigs generated with SDNAT v1 and v2 were 8,682 and 818, respectively. Using SDNAT v2 the average contig size (2,205 bp) and n50 (4,536 bp) were ten times higher than v1. The largest contigs were 20,591 bp (v2) and 479 bp (v1) and the number of contigs > 1 Kb were 278 (v1) and 425 (v2). The number of nucleotides in all contigs was 1,864,009 and 1,804,357, and the percentage of nucleotides in contigs > 1Kb was 21% and 91% in v1 and v2, respectively. The number of contigs not aligned to the reference genomes ranged from 16 to 27 and their average size ranged from 581 to 1400 bp approximately. Mapping with BioScope and the de novoassembly with Edena (Exact De Novo Assembly) are being performed to further comparative analysis and possibly to complementing the results. The broader habitat and host range of GBS may favor lateral gene transfer increasing the intraspecies genomic diversity. Thus, the mapping and de novoassembly are essential and complementary analytical approaches. Version 2 of SDNAT outperformed version 1 in all relevant criteria. The combination of different bioinformatics tools is usually helpful to obtain more reliable results in genomic analysis. Thereby computational tools currently available will be evaluated and integrated. Supported by: FAPEMIG (CBB-1181/08), NIH-FIC (TW007012), CAPES/CDTS-FIOCRUZ, FIOCRUZMG, CNPq and Furnas Centrais Eltricas

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP05

SNPS ANALYSIS AND VIRULENCE CORRELATION IN KLEBSIELLA PNEUMONIAE KP13 DRAFT GENOME SEQUENCE
Lima N C B1 , Nicolas M F1 , Cantao M E1 , Almeida L G P D1 , Vespero E C2
1 Laboratrio

Nacional de Computao Cientca - LNCC/MCT de Patologia e Anlises Clnicas e Toxicolgicas - Centro de Cincias da Sade Universidade Estadual de Londrina
2 Departamento

Bacteria of the genus Klebsiella are important opportunistic pathogens associated with hospitalacquired infections, causing pneumonia, urinary tract infections and septicaemia. These bacteria are found in soil, sewage, surface water and mucosal surfaces of mammals. Most strains of K. pneumoniae can be present in the human nasopharynx and intestinal tract. In general, the resistance to ampicilins, carbenicilin and rst generation of cephalosporins is due to production of beta-lactamases. A new clone of K. pneumoniae , called KP13 was isolated from a hospital infection in Londrina, Paran State, Brazil. The genome of this isolate was sequenced using the GS-FLX System (454 Roche, Inc). De novo assembly by Mira software generated 161 contigs larger than 500 bp with 23x coverage. The generated genome sequences were used for comparison analysis with three K. pneumoniae complete genomes (strains MGH 78578, NTUH-K2044 and 342) via MUMmer v.3.0 and NUCmer algorithm. The alignments were processed using the delta-lter and the show-snps algorithm of the MUMmer v.3.0 package to generate primary SNP calls. These sequences were aligned with three K. pneumoniae complete genomes (strains: MGH 78578, NTUH-K2044 and 342) using the NUCmer algorithm of the MUMmer v.3.0 package. The alignments were processed using the delta-lter and primary SNP-calls were made with show-snps algorithm. The primary SNPs set was ltered to exclude repetitions, indels, homopolimeric tracts resultants from the pyrosequencing, low quality SNPs and SNPs with low quality neighborhood [2], using the right parameter on show-snps and writing scripts in Perl language. ORFs were predicted with the software Glimmer [3] and automatic functional annotation was performed using the package SABIA [5]. Afterward, the nal set of SNPs was mapped into ORF or intergenic region. When the SNPs fell into the CDS, we veried its relatedness with the Klebsiella pneumoniae virulence. Also, we performed calculation of synonymous to nonsynonymous substitutions in order to get the evolution rate of SNPs.

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

GENOMIC ECOLOGY: GENES COMPETING FOR A METABOLIC NICHE?


Catanho M1 , Guimares A C R1 , Otto T D2 , Alvarez-Valin F3 , Degrave W1 , Miranda A B D4
1 Laboratrio 2 Wellcome

de Genmica Funcional e Bioinformtica, Instituto Oswaldo Cruz, Rio de Janeiro, Brasil Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton United Kingdom 3 Seccin Biomatemtica, Facultad de Ciencias, Universidad de la Repblica, Montevideo, Uruguay 4 Laboratrio de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Rio de Janeiro, Brasil Abstract: Microbial comparative genomics has undoubtedly contributed to the elucidation of fundamental aspects of the genetics, biochemistry and evolution of numerous species. A particular important contribution refers to the study of metabolic variability and conservation of enzymatic functions. Since enzymes control almost all biochemical reactions in the metabolism of living organisms, it is extremely important to characterize the genes encoding enzymatic activities. The most successful approaches to perform this task are based on sequence similarity searches, using computational tools like BLAST and FASTA. Comparisons of metabolic pathways computationally predicted from completely sequenced genomes of diverse organisms revealed incomplete or even absent pathways in several analyzed species. In some of these cases, the missing enzymes were substituted by functional equivalent molecules, able to catalyze the same reaction but exhibiting virtually no sequence similarity at the primary level, thus escaping identication by methods based on sequence similarity. These alternative forms, known as analogous enzymes, arise from independent evolutionary events, converging for the same biological function, and may be associated with separate phylogenetic lineages and/or possess different catalytic mechanisms, as well as distinct folding topologies. Previous works performed by other groups suggest that the fraction of enzymatic activities in which multiple events of independent origin have occurred may be substantial. However, up to this point, a comprehensive survey of the occurrence, distribution, and implications of these events, comprising fully sequence genomes, has not been done. In addition, an intriguing question can be raised: since for many enzymes the existence of multiple alternative forms (considering their origins) is true, as well as the pervasiveness of lateral gene transfer between bacterial genomes, one might consider these enzymes as individuals and their clusters as populations , all in competition for a particular metabolic niche ? And in such a case, it would be plausible to think that more competitive enzymes would have a selective advantage over less competitive alternative forms, and would consequently spread over diverse bacterial genomes over time? In this work, we investigate the variability and evolution of metabolic pathways in prokaryotes, analyzing the frequency, distribution and extinction patterns of putative analogous enzymes in the glycolysis/gluconeogenesis pathway. Acknowledgments: We wish to thank Coordenao de Aperfeioamento de Pessoal de Nvel Superior (CAPES), Programa Estratgico de Apoio Pesquisa em Sade (PAPES), Programa AMSUDPASTEUR, Red Iberoamericana de Bioinformtica (RIB), Conselho Nacional de Desenvolvimento Cientco e Tecnolgico (CNPq), Rede Fiocruz, Plataforma de Bioinformtica PDTIS, and Fundao de Amparo Pesquisa do Estado do Rio de Janeiro (FAPERJ) for their support.

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP06

FIRST STEPS ON THE PIPE-LINE FOR PROBABILISTIC ANNOTATION OF SUGAR CANE METABOLOME
Silva R R1 , Vncio R Z N1
1 Laboratrio

de Processamento de Informao Biolgica (LabPIB) - Faculdade de Medicina de Ribeiro Preto (Universidade de So Paulo) - Departamento de Gentica The pressing need to decrease consumption of fossil fuels and reduce emissions of greenhouse gases have increased the interest in the production of biofuels, especially ethanol. Understanding the mechanisms of sucrose accumulation in sugar cane is central to the rationalization of breeding programs using traditional methods of selection or genetic engineering, in order to reach a sustainable production of ethanol. The increasing potential for generating metabolic proles makes the ability of data screening and reliable comparative analysis a strictly necessary condition for the success of metabolomics in sugar cane. Therefore, it is necessary to develop new tools for the effective handling of information in silico . In this work, we present the rst steps for accurate annotation of metabolic measurements in high-throughput mass spectrometry. To achieve this we rst implemented a pipeline to make the preprocessing steps needed to translate mass spectra into reliable molecular masses, and therefore, identify metabolites. We present a local implementation of xcms (Smith et al. , Analytical chemistry 2006), a free R package to preprocessing high-throughput mass spectrometry data. We reproduced the faaKO data analyses and generated a list of molecular masses. After that, we created a perl script using the KEGG API web service to retrieve molecular formulas in a given mass window, in the example case of 0.15 m/z units. These are the rst steps on the probabilistic annotation of the sugar cane metabolome. With the masses and potential formulas we can implement a probabilistic tool that assigns empirical formulas to peaks of masses, not only based on mass comparisons, but also in the metabolic relationships between metabolites found together in the same sample, similarly as Rogers et al. (Bioinformatics 2009). This method not only allows a more reliable annotation of the metabolites, how allows too, the identication of metabolic pathways, and therefore was chosen as our initial point for improvement. Supported by: CAPES

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP03

NATURAL SELECTION IN DIFFERENT REGIONS OF MEMBRANE MOLECULES


Mendes F H K1 , Souza J E S D2 , Ramalho R F1 , Souza S J D2 , Meyer D1
1 Departamento 2 Ludwig

de Gentica e Biologia Evolutiva, Universidade de So Paulo, Brazil Institute for Cancer Research, at Hospital Alemo Oswaldo Cruz, So Paulo, Brazil

Genome-wide scans have listed positively selected genes (PSGs) in different phylogenies and investigated if: i. certain functional categories are enriched with PSGs; ii. PSGs have higher expression levels when compared with non-PSGs. Comparatively less attention has been given to establishing which regions of a molecule are under natural selection. Such an analysis may now be accomplished using information on the molecular regions of gene products (e.g., using the results of the Surfaceome Project). The present study intends to use both the results of a genome-wide scan and the descriptions of gene products from the Surfaceome Project in order to compare the ratio of nonsynonymous to synonymous substitution rate ratios of extracellular, intracellular and transmembrane portions of positively selected gene products. This analysis will allow us to test if a region is signicantly more affected by natural selection than others. Specically, we test the hypothesis that the extracellular region, as a result of being more exposed to molecules (e.g., pathogen peptides) acting as selective pressure, has a higher dN/dS. Supported by: FAPESP

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

PHYLOGENETIC SUPERTREE FOR FALCONIDAE (AVES): SPECIES-LEVEL METAPHYLOGENY WITH MORPHOLOGICAL AND MOLECULAR DATA
Almeida R B D1 , Schhli G S E2
1 Universidade 2 Empresa

Estadual de Ponta Grossa - UEPG - Ponta Grosa, PR, Brasil Brasileira de Pesquisa Agropecuria - EMBRAPA Florestas - Colombo, PR, Brasil

A large and well-supported phylogeny represents an important tool for taxonomic and systematic classications and to infer hypotheses on a large scale in nature. Usually, due insufcient data resulting from the inequality of research efforts throughout taxa, is not possible the construction of large trees completely comprehensive. Falconidae is one of four families of Falconiformes with 63 species in 11 genera and 2 subfamilies according to the systematic classication most recently accepted. Several types of data has been analyzed to infer the phylogenetic relationships of the family, like osteological, morphological and molecular data. The Supertree Construction approach represents one method to generate more comprehensive phylogenies, through the combination of existing phylogenetic information using data already published or incorporated into databases. Studies have shown that supertrees built with numerous and large enough source trees can represent the information provided by the phylogenetic trees with source precision. In the Matrix Representation using Parsimony (MRP) method the matrix representations of different source trees are combined into a single matrix that can be analyzed using a criterion of parsimony. The aim of this work is collected different source trees inferred for Falconidae, apply the Supertree Construction method by MRP criterion and analyze the phylogenetic and systematic relationships within the family, for this were researched papers that contain potential source trees inferred for Falconidae. Several source trees were collected and their topologies were reproduced in software of dendrogram edition. A digital le in NEXUS format was obtained for each cladogram considered, which were contained the sequences of characters in Newick format. Due to the variance in the methodology of source trees, discrepancies were found in the number of taxa considered among the different trees, but they have no conict in the overlap analysis all 63 taxa corresponding to the species of the Falconidae were included in all trees reproduced, as well as the taxon Accipitridae, added as outgroup. A total of 23 trees were edited at the end of the review and editing of publications considered. Sequences of characters in Newick format were combined and run in the program CLANN and the criterion for MRP was performed. It was obtained at the end of analysis 544 most-parsimonious trees, the strict-consensus was performed, resulting in one Supertree to Falconidae. The bootstrap analysis was run in PAUP* with 100 replications. The Supertree is in agreement with the most recently systematic classication where they are considered two subfamilies, the Subfamily Herpetotherinae and Subfamily Falconinae. The pattern of biogeographical distribution of Falconidae taxa around the World are compatible with the topological distribution of taxa on the Phylogenetic SuperTree obtained. The basal branches appear exclusive in the Neotropics (America). This pattern of distribution indicates, parsimoniously, that origin of family Falconidae occurred in the Neotropics, more specically in South America. The Supertree Construction approach show satisfactorily results to generate a large, holistic and comprehensive phylogeny of Falconidae.

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

GENETIC VARIABILITY OF TRYPANOSOMA CRUZI II BY CORRELATIONS BETWEEN NUCLEAR AND MITOCHONDRIAL GENOME (KDNA) ASSOCIATED WITH CLINICAL MANIFESTATIONS OF CHAGAS DISEASE
Baptista R D P1 , Dvila D A1 , Valadares H M S1 , Galvo L M D C2 , Gontijo E D1 , Macedo A1 , Machado C R1 , Chiari E1
1 Universidade 2 Universidade

Federal de Minas Gerais - UFMG Federal do Rio Grande do Norte - UFRN

Chagas disease, caused by the protozoan Trypanosoma cruzi , presents a wide spectrum of clinical manifestations varying between individuals and geographical regions. In the chronic phase, around 70% of the individuals are asymptomatic (IF), whereas 30% develop the cardiac (CF) or digestive forms of the disease. The factors that determine the outcome of the infection are unknown, but certainly depend on complex interactions amongst the genetic make-up of the parasite, the host immunogenetic background and environment. In this study, we analyzed nuclear and Mitochondrial genes as well as microsatellite markers and the structure of cytochrome oxidase subunit I (COI ), II (COII ), III (COIII ), Cyb, NADH Dehydrogenase subunit 4(ND4) and 7(ND7) genes by PCR assays and sequencing in Trypanosoma cruzi isolates obtained from 61 chronic patients with well-characterized clinical forms of this disease and belonging to T . cruzi II genotype. To analyse our results we decide to group these markers by nuclear and mitochondrial haplotypes to compare and search for correlations between them and their clinical manifestations. For microsatellite analysis we use a software for haplotype reconstruction, and recombination rate estimation from population data (PHASE) and one that generates evolutionary trees and networks (Network). After these analyses we observed that there is no correlation between mitochondrial and nuclear markers, and our network microsatellite tree shows that different isolates from T.cruzi II have different steps of mutation. We conclude that there is no correlation between these markers and the clinical manifestation, but we can demonstrate that there are subgroups in T.cruzi II genotype which probably are correlated to hybridization events. Supported by CAPES, FAPEMIG and CNPq.

10

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

IN SILICO ANALYSIS OF GENES INVOLVED IN SMALL RNA PATHWAYS IN SCHISTOSOMA MANSONI


Batista T M2 , Lobo F P1 , Franco G R1 , Marques J T1
1 Universidade 2 Ncleo

Federal de Minas Gerais de Bioinformtica da UFMG - Centro de Excelncia em Bioinformtica - Fiocruz MG

Small non-coding RNAs including microRNAs (miRNA) and small interfering RNAs (siRNA) play important roles in many biological processes. Argonaute proteins serve as a key component of the RNAinduced silencing complex that mediates miRNA/siRNA functions. In this study, we investigated the presence of Argonaute (Ago) proteins in the genome of Schistosoma mansoni . So far, we have identied ve candidates: Smp_102690.2, Smp_102690.3, Smp_140010, Smp_179320 and Smp_188610. Interestingly, considering their protein sequence, these putative Ago proteins fall into distinct and wellcharacterized subgroups. These subgroups of Ago proteins are associated with distinct classes of small RNAs suggesting the existence of different small RNA pathways in Schistosoma mansoni . Supported by FAPEMIG/CEBio

11

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

UNRAVELING THE EVOLUTION OF SUBSTRATE SPECIFICITY IN GERMIN AND GERMIN-LIKE PROTEINS


Scherer N D M1 , Bitar M2
1 Bioinformatics 2 Instituto

Department, Heinrich-Heine-Universitt, Dsseldorf, Germany de Biofsica Carlos Chagas Filho, Universidade Federal do Rio de Janeiro, Rio de Janeiro,

Brasil. Germin and germin-like proteins (GLPs) belong to a multigenic family found in plants, which is part of the cupin superfamily. They have been described as pathogenesis-related and classied as PR-15 and PR-16. The PR-15 family comprises the truegermin proteins (hereafter germins), which are only found in true cereals(barley, wheat, rice, oat, rye and corn). These proteins present oxalate oxidase (OXO) activity, which is indeed responsible for their pathogenesis-related properties. Oxalate oxidase restricts fungal cell growth by the cross-linking of cell wall components with hydrogen peroxide (H2 O2 ). Oxalic acid is a fungal toxin, and OXO may thus exert a dual action, at the same time destroying the toxin and generating H2 O2 , that is also used as substrate by another pathogenesis-related protein, peroxidase (PR9). This constitutes an especially potent defense mechanism. All other family members which lack OXO activity are then referred to as germin-like proteins. Accordingly, every newly described pathogenesisrelated protein that falls into this description is included in the PR-16 family. Therefore, the PR-16 family is not necessarily a monophyletic group. The germin-like proteins from the PR-16 family also present the capacity to restrict pathogen invasion by promoting cell wall reinforcement in the plant. It has been suggested that H2 O2 generation is also the mechanism of action of GLPs to promote resistance to penetration. In this case hydrogen peroxide is produced by superoxide dismutase (SOD) activity. In this work, the molecular differences that may account for the specic actions of PR-15 and PR-16 are investigated using branch site models of codon substitution. To construct the datasets, representative sequences were aligned to serve as seed for HMM proles, which were used to search homologous sequences in UniProt with hmmsearch . Phylogenetic trees were inferred by PhyML, and codon substitution models were analyzed with PAML4. Phylogenetic analysis demonstrated that PR-15 groups itself as a subtree of PR-16. PR-15 is indeed a monophyletic group, while family PR-16 becomes a paraphyletic group. A preliminary analysis with branch site models indicated 10 sites subject to positive selection, that may be involved with the change in function in the branch leading to germins. Supported by: CAPES

12

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

ZEBU GENOME SEQUENCING AND ANALYSIS USING SECOND GENERATION SEQUENCERS


Araujo F1 , Drumond B1 , Zerlotini A2 , Rosse I C3 , Lopes B C4 , Guedes E5 , Arbex W A5 , Machado M A5 , Peixoto M G C D5 , Verneque R S5 , Guimares M F M5 , Coimbra R S1,2 , Carvalho M R S3 , Silva M V G B5 , Oliveira G C1,2
1 Genomics 2 Center

and Computational Biology Group - FIOCRUZ-MG for Excellence in Bioinformatics - FIOCRUZ-MG 3 Institute of Biological Sciences - UFMG 4 EPAMIG - MG 5 EMBRAPA Dairy Cattle - MG INTRODUCTION: The genetic structure of the Brazilian cattle is formed mainly by animals from zebu races and crossbreeds with taurine races. In Brazil, during the last decades, traditional genetic techniques have considerable gains in dairy production and resistance to infectious diseases. Although effective, these methods do not shed light at the biological processes underlying the observed results. The inclusion of genetic markers in the selection process has fostered the genetic gains and lowered the costs of traditional progeny tests. OBJECTIVE: To identify Single Nucleotide Polymorphisms (SNPs) markers in the zebu genome of Gir breeds. METODOLOGY: Mate paired libraries, with 1-2 kb inserts, were constructed. The reads generated (50bp long) were mapped into the publicly available reference genome of a female Bos taurus (NCBI Project ID: 10708) using SHRiMP. SamTools were used to generate the consensus sequence, and Varid to identify SNPs. The Y chromosome will be mapped into sequences from species phylogenetically more distant such as human, mouse, rat, etc. RESULTS: The rst two runs using the SOLiD V3 plus platform yielded 203 million reads representing a theoretical 3,5 X coverage of the reference genome. Parameters such as observed coverage, gap sizes, number of mapped reads and number of SNPs will be used to evaluate the sequencing and the analysis process. PERSPECTIVES: The identication of SNPs in Brazilian Gir breed will improve the efciency of the next version of genotyping chips for dairy zebu genetic selection in Brazil. Funding: FAPEMIG (CBB-1181/08 and TCT 12.093/10), NIH-USA (TW007012), CAPES/CDTSFIOCRUZ, FIOCRUZ-MG

13

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP07

RETRIEVING PROTEIN SEQUENCES FOR PHYLOGENETIC STUDIES


Razante H L1 , Braz A S K1 , Scott L P B1
1 Universidade

Federal do ABC

Introuction In order to investigate a hypothesis, a researcher must deal with thousands of sequences from a variety of taxons. The work includes the classication of sequences, selection and linkage of sequence names from different sources. We developed a tool to assist in this task, automating the computation of labels and the retrieval of the protein coding sequences. Methods and Results Most sequence databases are not organized aiming at phylogenetic studies. This leads to the problem of selecting the sequences needed for the study. This paper presents a tool developed aiming the retrieval of sequences for this purpose. The tool is being developed in C++ and is capable to retrieve data from NCBI and convert it to suitable formats for phylogenetic softwares, mainly Fasta les. Protein phylogenetic studies may involve thousands of sequences and their retrieval may be an arduous task. An issue that arises from the use of phylogenetic software is the limit of 10 characters for the sequence name in the Fasta le. These strings are them displayed in the visualization tree, and it should be easily understandable by the analyst. The genbank le format contains numeric keys that could be used as a label for a sequence, such as the GI or the taxon ID. However, we developed an algorithm to convert the organism name into a string that is suitable for the visualization, allowing direct recognition of the organism by the analyst. The simple generation of such strings may lead to duplicities, so a database containing a data dictionary was implemented. The algorithm guarantees the generation of distinct labels among the sequences being studied. The tool also helps in the retrieval of the coding sequences, a task that may take a very long time if done manually. Their phylogeny tree may also be computed and visualized, thus the naming issue is solved with the same naming convention. Conclusions Our tool allows the transformation and retrieval of sequences aiming at phylogenetic studies. All these tasks are done automaticaly in a batch sequence. The software is under development and is freely available for download at http://professor.ufabc.edu.br/humberto.razente/phylogeny/. Suggestions are welcomed. Future work includes integrating multiple alignment algorithms and the visualization of the phylogeny tree. Acknowledgements This work was supported by UFABC, CNPq e Fapesp.

14

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP03

DATA MINING AND PHYLOGENY TO SELECT CANDIDATE GENES RESPONSIBLE FOR ALUMINUM TOLERANCE IN MAIZE
Tinoco C F D S1,2 , Magalhes J V D2 , Lana U G D P2,3 , Noda R W2 , Simes C C2,3 , Guimares C T2
1 Biology

Department, University Center of Sete Lagoas - UNIFEMM Maize and Sorghum - CNPMS 3 Genetics Department, Federal University of Minas Gerais - UFMG
2 Embrapa

Bioinformatics is an essential tool for handling biological data and has been useful in many plant breeding applications. Aluminum toxicity on acid soils is a major factor limiting yield of many important crops such as maize. In acidic conditions (pH 5.0) ionic forms of Al are released into the soil solution, inhibiting root growth and plant development. Recently, one Al tolerance gene was isolated in sorghum, SbMATE , which encodes a membrane transporter responsible for Al-activated citrate release, and is a member of the multidrug and toxic compound extrusion (MATE) family. MATE genes have also been identied in other species such as arabidopsis (AtMATE ), rice (Os01g69010), barley (HvMATE ) and soybean (GmAlTsb1 ), wheat (TaMATE ) as associated with aluminum tolerance. Considering that the MATE is a multigenic family with a wide range of biological functions, our major goal was to use Bioinformatics strategies in order to select MATE genes most likely to be associated with Al tolerance in maize. A search for MATE homologs in maize was undertaken by means of sequence similarity analysis against the maize genome sequence based on SbMATE . BLAST searches were performed in Phytozome v. 5.0. Using the identied maize homologs as query, new searches were carried out using the tool Peptide Homologs in Phytozome in order to expand the search for MATE homologs in maize. A total of 44 MATE homologs were identied and a phylogenetic study was based on the predicted protein sequences was performed, including protein sequences of genes described in the literature as responsible for Al tolerance in other crops. The protein sequences were subjected to multiple alignments using the program T-Coffee and a neighbor-joining tree was generated based on the genetic distance matrix using the Phylemon website (http://phylemon.bioinfo.cipf.es). Five maize homologs were clustered with MATE genes responsible for aluminum tolerance in other species and showed similarity from 24 to 61% with the SbMATE. In order to improve the selection criteria, the ve predicted proteins were submitted to topological analysis based on the HMMTOP program (http://www.enzim.hu/hmmtop). Out of them, only one member presented a unique large intracellular loop between the second and third transmembrane domains, shared among plant citrate transporters. This member is the second best hit compared with the SbMATE, sharing 50% of similarity, and is a citrate transporter activated by aluminum in maize root tip according to data of the literature. Then, the proposed approach identied precisely one putative Al tolerance gene in maize, named as ZmMATE1 . Additionally, ZmMATE1 is co-localized with a major Al tolerance QTL in maize. Thus, we showed that combining Bioinformatics tools is an important strategy to select candidate genes to be further validated. Supported by: FAPEMIG, Generation Challenge Programme, CCRP McKnight Foundation, Embrapa

15

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP05

SELECTION OF SINGLE NUCLEOTIDE POLYMORPHISMS FOR USING IN PATERNITY ANALYSIS IN ZEBU BREEDS
Guedes E1 , Andrade L G D1 , Muniz M N M1,2 , Tagliatti R F1 , Arbex W A1 , Caetano A R3 , Paiva S R3 , Silva M V G B D1
1 Laboratrio 2 Ncleo

de Bioinformtica e Genmica Animal - EMBRAPA Gado de Leite - MG - Brazil de Bioinformtica - Universidade Federal de Juiz de Fora/EMBRAPA Gado de Leite - MG -

Brazil 3 Laboratrio de Gentica Animal - EMBRAPA Recursos Genticos e Biotecnologia - DF - Brazil DNA marker technology represents a promissing means for determining the genetic identity and kinship of livestock. Compared with other types of DNA markers, single nucleotide polymorphisms (SNPs) are appealing because they are abundant, genetically stable, and amenable to high-throughput automated analysis. Recent advances in DNA sequencing, computing tools and bioinformatics have facilitated and improved the identication of SNPs from amplied segments of genomic DNA. SNPs have been already employed in animal identication and paternity analysis in American and European beef and dairy breeds and in analysis on genetic distance. In cattle, the challenge has been to identify a minimal set of SNPs with sufcient power for using in a variety of popular breeds and crossbred populations. A total of about 58,000 SNPs genotyped with the Illumina BovineSNP50 Bead Chip (Illumina Inc., San Diego, CA) were investigated to determine usefulness of the associated SNPs for paternity analysis. The informativity of these SNPs was estimated from the distribution of minor allele and genotype frequencies in two zebu breeds: (a) Gyr (with 319 animals), and (b) Nellore (with 959 animals). SNPs with a minor allele frequency between 0.44 and 0.56, and genomics distance 3 cM were selected. Existence of Hardy-Weinberg equilibrium was investigated by probability test. Paternity tests were performed using CERVUS 3.0 for 53 offsprings and 39 candidate sires for each offspring. For paternity analysis, the proportion of loci typed was 0.9691 and the simulated genotyping error rate was set at 0.01. Critical values of LOD were determined for 90% and 99% condence levels based on simulations of 100,000 offsprings. This report describes a set of 109 highly informative bovine SNPs markers distributed among 28 autosomes and both sex chromosomes for using in paternity analysis in zebu breeds. Financial support: CAPES, EMBRAPA, CNPq

16

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP05

IDENTIFYING GENES WITH HIGH DIFFERENTIATION AMONG HUMAN POPULATIONS: EVOLUTIVE INFERENCES AND BIOMEDICAL APPLICATIONS.
Soares-Souza G B1 , Chevitarese J1 , Magalhes W1 , Rodrigues M1 , Gilman R3,4 , Yeager M5 , Chanock S2,6 , Tarazona-Santos E1,2
1 Universidade 2 Section

Federal de Minas Gerais of Genomic Variation - National Institute of Health 3 Asociacin Benca PRISMA 4 Bloomberg School of Public Health 5 Core Genotyping Facility - National Institute of Health 6 Laboratory of Translational Genomics - National Institute of Health Genetic diversity is related to phenotypic differentiation among human populations. Even though most of this variation occurs among individuals and interaction with environment plays key role in phenotypic determination, the study of variants that are genetic structured in human populations is crucial in two major elds of biology: evolution and medicine. Investigate loci with differentiated allelic frequency in human populations allows improvement of case-control association studies in admixed populations and discover of genetic polymorphisms that have experienced natural selection. In this context, we developed a relational database using MySQL and scripts using the programming environment R (http://www.r-project.org/). We studied hierarchical genetic structure of 1442 SNPs located in 411 genes related to immune response, carcinogenesis and pharmacogenetics. This genetic characterization was made for 1198 individuals from 60 worldwide populations belonging HGDP, SNP500Cancer and NativeAmerican populations of Ecuador and Peru. The following results emerge from Analysis of Molecular Variance approach: we identied 196 polymorphisms allocated in 111 genes that can lead to spurious association in case-controls studies performed in tri-hybrid populations like Brazilian population. In this context, many genetic markers were recognized in alternative models of bi-hybrid populations formed from European, Native-American and West African populations. We show 36 SNPs highly differentiated among East Asians and Amerindians that could have suffered positive selection or allele surng. Using data from loci heterozigosity we appointed some loci possibly under one of three types of natural selection: positive, purifying and balancing. The combined use of distinct genetic statistics in different populations can improve not only design of epidemiological studies as can clarify aspects of prevalence in human diseases and to tells a little about human history. Supported by: FAPEMIG, CNPq, CAPES, NIH

17

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP08

LINKAGE DISEQUILIBRIUM PATTERNS IN DAIRY GYR BREED


Muniz M N M1,3 , Andrade L G D3 , Guedes E3 , Tagliatti R F3 , Machado M A2 , Verneque R D S2 , Guimares M F M2 , Arbex W A1,2 , Silva M V G B D1,2
1 Ncleo

de Bioinformtica de Juiz de Fora - Universidade Federal de Juiz de Fora/Embrapa Gado de Leite - MG - Brazil 2 Embrapa Gado de Leite - MG - Brazil 3 Laboratrio de Bioinformtica e Genmica Animal - Embrapa Gado de Leite - MG - Brazil For using single nucleotide polymorphisms (SNPs) markers in a livestock breeding program they should be in linkage disequilibrium (LD) with the targets regions. Haploview software was used to perform the calculation of LD (r) between SNPs located on the same chromosome. It performs calculations based on genotypic data. This study presents a rst generation of LD map statistics for the genome of indicine (Bos indicus) cattle, the Dairy Gyr breed. DNA samples from 379 Gyr breed genotyped by Illumina BovineSNP50 Bead Chip (Illumina Inc., San Diego, CA). Only SNPs with MAF > 0.03 (minor allelic frequencies) were included in the LD analysis, totalizing 26,094 SNPs (46.4% of the total SNPs available). The pairwise rstatistics of SNPs up to 5 Mb apart across the genome was estimated. For the pairwise distances of < 25kb, it was observed a mean value of r= 0.18 0.26 and it dropped to 0.16 0.24 at 50-75 kb. The proportion of SNPs in useful LD (r>= 0.25) was 20.2% for the distance of 50 and 75 kb between SNPs. LD structure is best described using a haplotype block model, which is dened where 95% of combinations of SNPs within a region are in very high LD, demonstrating historical recombination. A total of 440 haplo-blocks spanning 71,949.092 kb (2.8%) of the genome and containing 1,577 SNPs (6%) were detected. The mean and median block lengths were estimated as 2,481 1,429.06 kb and 1,962.572 kb respectively. A set of tag SNPs has been identied, which will be useful for further ne-mapping studies. The number of haplo-blocks in Gyr cattle is lower than Holstein cattle. There are two hypotheses that can explain these ndings. First, taurine breeds have a higher level of LD. Second, modern breeding programs increased the extent of LD in Europe and caused differences of LD between genomic regions. Financial support: FAPEMIG, EMBRAPA, CNPq, CAPES

18

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP05

EFFECT OF COVERAGE DEPTH ON DE NOVO BACTERIAL GENOME ASSEMBLY USING SHORT READ
Tieppo E1 , Tibes J H1 , Gehlen M A C1 , Guizelini D1 , Marchaukoski J N1 , Pedrosa F D O1,2 , Oliveira L F D1 , Souza E M D1,2
1 Programa 2 Departamento

de Ps-graduao em Bioinformtica - UFPR de Bioqumica e Biologia Molecular - UFPR

Recent new developments of sequencing technologies have dramatically decreased costs and increased the sequence output. Among these new technologies is the Illumina-Solexa platform which is a deep sequencing approach capable of delivering several gigabases of sequence with length between 36 bp and 100 bp. The sequences are rapidly produced allowing high coverage of the genome. However, the size of the reads brings new challenges for the assembler programs, thus leading to the development of new generation of assemblers and strategies for de novo genome assembly. The new assemblers, for example, cannot make use of traditional metrics and strategies of the conventional algorithms, since the short reads impose constraints on the usual overlapping algorithms. In this work we simulated different sequencing coverage depths of a bacterial genome to analyze the efciency of the short read assembler Velvet. Random error-free sequences of 36 bp were generated from the 2 Mbp genome of Streptococcus suis P1/7 and used for de novo genome assembly at coverage depths ranging from 10x to 100x. The reads were assembled by the Velvet package and parameters of the resulting assemblies were compared at different coverage depths. Little or no variation in the assembly parameters was observed to coverage depths above 30x. The coverage depth of 10x did not have sufcient data to transpose adequately the overlap threshold needed to conrm the data, whereas coverage depths from 20x to 30x produced the best results; no improvement of the nal assembled genome was obtained with coverage above 30x. The results suggest that genome projects could benet from careful selection of input sequences for optimum performance at a minimum cost. Financial support: INCT/CNPq/MCT, CAPES.

19

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

THE NITROGEN METABOLISM UNDER A SYSTEM BIOLOGY PERSPECTIVE: THE USAGE OF ENRICHMENT ANALYSIS TO DETECT THEIR CONSERVED AND LINEAGE-SPECIFIC MODULES IN CYANOBACTERIA, RHYZOBIALES AND BURKHOLDERIALES
Souza R A D2 , Lobo F P1
1 Department 2 Ezequiel

of Parasitology, Federal University of Minas Gerais Dias Foundation

Nitrogen is an essential element for all organisms and on average it accounts for up to 6.25% of their dry mass, being a constituent of nucleic acids, proteins and other biomolecules. Although nitrogen is one of the most abundant elements on Earth, it is also one of the most limiting for biological growth, since it is largely found in the virtually non-reactive form dinitrogen (N2). Consequently, most organisms must obtain their nitrogen in an accessible form such as proteins, ammonia, or nitrate. The dinitrogen is made available to living organisms through nitrogen xation, which may occur by biological, industrial or climatic processes. Biological nitrogen xation, responsible for most of nitrogen xation that occurs nowadays, is an ancient metabolic pathway that evolved early in the history of life, being widely distributed in several clades of microorganisms. This process is not monophyletic, with its origin and distribution across taxa being modied in function of selective pressures and processes such as gene duplication, loss, and gene transfer. In this work we compared three distinct nitrogenxing monophyletic groups to detect groups of homologous genes signicantly enriched or depleted in one lineage when compared with another to detect the conserved and variable functional modules of this pathway in these groups. The nitrogen-xing groups chosen were Phylum Cyanobacteria and the Orders Rhizobiales and Burkholderiales , which were compared with the entire Bacteria Domain in order to evaluate KEGG Orthology (KO) groups signicantly represented in these lineages when compared with all bacterial genomes already present in KEGG. Since each KO cluster contains information about a given group of homologous genes (such as their corresponding genomes and metabolic roles), it is possible to objectively evaluate it as differentially represented in a given set of complete genomes T when compared to another given set of complete genomes t by means of statistical testing. For that, four numbers were dened for each KO: N (genomes in T), n (genomes in t), X (fraction of T that posses the KO) and x (fraction of t that posses the KO). With these numbers, a chi-square test FDR correction analysis pipeline was performed to evaluate if the given KO was signicantly more or less enriched in T when compared with B. Results are displayed in a graphical pathway-like form by using the KEGG::API module, allowing a deeper understanding of their distribution patterns. The phenotypic trait known to occur in each of those groups (nitrogen xation) was readily detected, with all groups presenting the genes responsible for nitrogen reduction signicantly overrepresented, as well as several other lineagespecic modules. We also detected possible ancient modules, with local groups of conserved KO shared among all three taxa with frequencies closer to typical bacteria. We believe these ndings suggests that the enrichment analysis procedure, when applied to analyze the frequency of shared groups of enzymes in distinct taxa, is a suitable procedure to add valuable information to the understanding of the evolution of molecular pathways.

20

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

PASSATEMPO: SEQUENCING THE GENOME OF A BRAZILIAN VACCINIA VIRUS STRAIN


Drumond B P1,2 , Arajo F M G1 , Zerlotini A3 , Aguiar E R G R3 , Ferreira J M S4 , Campos R K4 , Abraho J4 , Coimbra R S1,3 , Kroon E G4 , Oliveira G C1,3
1 Genomics

and Computational Biology Group, FIOCRUZ-MG de Virologia - Departamento de Parasitologia, Microbiologia e ImunologiaUNIVERSIDADE FEDERAL DE JUIZ DE FORA 3 Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-MG 4 Laboratrio de Vrus - Departamento Microbiologia - UNIVERSIDADE FEDERAL DE MINAS GERAIS
2 Laboratrio

Since the 1960s, several Vaccinia virus (VACV) strains (genus Orthopoxvirus, family Poxviridae) have been frequently isolated from different hosts (rodents, cattle, horse and humans) in Brazil. Phylogenetic analysis and biological characteristics regarding virus pathogenesis in murine model conrmed the existence of at least two viral populations circulating in the country. Moreover, previous studies indicated that a Brazilian Vaccinia virus strain, named Passatempo could be a natural recombinant virus. The aim of this study was to sequence the genome vaccinia virus, strain Passatempo. The virus was grown in Vero cells, puried using a sucrose cushion and total DNAwas extracted. Afragment genomic library was constructedusing the methodology of SOLiD system 3.0. A total of 8.324.453 reads (50bp) were generated but only 471.617 read were mapped to the reference genome Vaccinia virus Western Reserve, with estimated coverage of 121X, using the software SHRiMP 1.3.2 (http://compbio.cs.toronto.edu/shrimp/). BlastN analysis indicated that this was due to cellular DNA contamination (hits with Macaca mulata genome). The software SOLiD System De Novo Assembly Tools (SDNAT, http://solidsoftwaretools.com/gf/project/de version 1 and version 2 were used to perform de novo assembly. Using SDNATv1, 721 contigs were assembled with an average size of 163 bp, while SDNATv2 generated 45 contigs with an average size of 666bp. All the contigs were compared to M. mulata genome (BlastN parameters > 80% identity, > 80% of read extension and Evalue < 1e-5) to eliminate contaminants sequences from Vero cell genome. From 721 contigs (SDNATv1) 378 did not present similarity with M.mulata genome and were ltered, while no contig assembled by SDNATv2 was ltered. The 378 contigs were mapped to the reference genome and used to perform the genome assembly of Passatempo using the software Velvet, resulting in the assembly of 88 contigs. Mapping reads with BioScope and different strategies to eliminate contamination sequences will be carried out in order to improve those preliminary results. Supported by: CNPq, FAPEMIG (CBB-1181/08), NIH-FIC (TW007012), CAPES/CDTS-FIOCRUZ, FIOCRUZ-MG.

21

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP06

THE GENOME PROJECTS OF THE CLOSELY RELATED CACAO PATHOGENS MONILIOPHTHORA RORERI AND MONILIOPHTHORA PERNICIOSA
Carazzolle M F1,2 , Costa G G L1 , Herai R H1,3 , Junior O R1 , Nascimento L C1 , Teixeira P J1 , Tiburcio R A1 , Mondego J M C4 , Pereira G A G1
1 Laboratrio 2 Centro

de Genmica e Expresso, Instituto de Biologia, UNICAMP Nacional de Processamento de Alto Desempenho em So Paulo (CENAPAD-SP), UNICAMP 3 Laboratrio de Bioinformtica Aplicada (LBA), Embrapa Informtica Agropecuria - EMBRAPA 4 Instituto Agronmico de Campinas (IAC) The basidiomycetes Moniliophthora roreri and Moniliophthora perniciosa are the etiologic agents of the two most devastating diseases in cacao (Theobroma cacao ): frosty pod rot and the witches broom, respectively. The species are very closely related and even hybrid cells have been previously obtained. In cacao, both species infect pods, causing necrosis, and M. perniciosa is also able to invade other tissues causing changes in plant metabolisms, such as hyperplasia and hypertrophy. In order to understand the molecular basis of these organisms, we sequenced the genome of these two species. We also obtained transcriptomic data (RNA-seq) in several different conditions including the interaction between cacao and both pathogens. This work reports a pipeline utilized to assembly and compare these genomes, and to identify differential gene expression between libraries. The genomic and transcriptomic sequences for both organisms were obtained using high-throughput sequencing technology (454/Roche and Solexa/Illumina). The Solexa and 454 reads were assembled into longer contigs using de novo assembler Velvet and Newbler, respectively. The hybrid assembly was performed through the combination of Solexa contigs and 454 contigs resulting in nal hybrid contigs using a pipeline developed in our group. Ab initio and comparative gene predictions were obtained with Augustus and Exonerate programs using previous training set (including RNA-seq data) and the closely related organisms, respectively. The nal set of gene models were obtained through a union between ab-initio and comparative approaches. The identication of gene expressed in both pathogens, for each library, was obtained mapping the RNA-seq reads into the predicted gene models. In order to identify the gene expressed in T. cacao , the RNA-seq reads were mapped into transcriptome assembly of T. cacao. The transcriptome assembly was obtained using around 160.000 T. cacao ESTs available at NCBI and assembled by CAP3 program. The analysis of differential expressed genes between libraries was performed using DEG-seq package. The M. perniciosa genome project (www.lge.ibi.unicamp.br/vassoura) and M. roreri genome project (www.lge.ibi.unicamp.br/roreri) involve several Brazilian and international laboratories. The comparative genomics in the structure level, gene content level, and orthologous gene sequence can provide new insights that can help this community to increase efforts in cacao diseases investigation.

22

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP06

DIVERSITY AT THE COPY NUMBER VARIABLE BETA-DEFENSIN LOCUS IN NATIVE AMERICAN POPULATIONS: EVOLUTIONARY AND BIOMEDICAL IMPLICATIONS
Zuccherato L W1 , Hardwick R J2 , Silva M C F D1,3 , Zamudio R1 , Gilman R H4 , Hollox E J2 , Tarazona-Santos E1
1 Departamento 2 Department

de Biologia Geral, Universidade Federal de Minas Gerais/ Brazil of Genetics, University of Leicester/ UK 3 Fundao Hemominas/ Brazil 4 Universidade Peruana Cayetano Heredia/ Per 5 Johns Hopkins School of Public Health, Johns Hopkins University/ USA Copy number variable regions are segments of DNA greater than 1 kilobase in length that vary in their number of copies relative to a reference genome. It is estimated that up to 12% of the human genome shows copy number variation (CNV), and so represents a major component of human genetic variation. CNVs can also have clinical effects, and may account for susceptibility to some common and rare diseases. Beta-defensins are small cationic peptides that have a multifunctional role in the innate immune system. A cluster of at least seven beta-defensins located on chromosome 8p23.1 shows CNV, with 2-12 copies per genome. Although recent studies testing the association of beta-defensin copy number with Crohns disease have produced conicting results, a higher copy number of the beta-defensin genes are associated with an increased risk of psoriasis. The genomic structure of beta-defensins has been studied in European and African populations, but Native Americans have not been analyzed for these loci. We measured the copy number of the beta-defensin cluster on chromosome 8p23.1 in individuals from three non-admixted Native American populations settled between the Andean and the Amazonian regions of Peru: Shimaa (n=89) and Monte Carmelo (n=23) from the ethnic group Matsiguenga, and Ashaninka (n=150) from Ashaninka ethnic group, using the Paralogue Ratio Test (PRT) technique. PRT is a development of quantitative PCR, and uses a single primer pair to amplify both a test locus (for which the number of copies need to be determined) and a reference locus (that has a known number of copies), leading to an accurate copy number determination. For the beta-defensin locus, the Ashaninka and Monte Carmelo populations show a distribution of diplotypes more similar than observed in African and European populations, with 3 and 4 copies as modal diplotypes ( 26%). The Shimaa also showed 4 copies as the modal diplotype (33.7%), but the diplotype 7 copies, rare elsewhere, was common (12.4%). -defensin genes have an important role in the innate immunity and are an interesting focus for epidemiological and evolutionary studies in Native Americans. in the context of global genome diversity. The complex genomic structure of this region posits a challenge on CNV genotyping, and the PRT is an accurate technique for -defensin CNV determination. We are extending this study, quantifying CNV loci for CCL3, CCL4, FCGR3 and FCGR2 genes, and investigating methodological issues related with CNV quantication. Financial Support: FAPEMIG, EMBO, CAPES and Medical Research Council (UK).

23

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP07

EVOLUTIVE PATTERNS OF TRANSPOSABLE ELEMENTS IN DROSOPHILA GENOMES


Carvalho M O D1 , Loreto E L D S2
1 Universidade

Federal do Rio Grande do Sul, Programa de Ps-Graduao em Gentica e Biologia Molecular 2 Universidade Federal de Santa Maria, Centro de Cincias Naturais e Exatas, Departamento de Biologia The vast availability of complete sequenced genomes opens a new perspective in the analysis of evolutive phenomena that otherwise could not be completely understood. One interesting case is the complex patterns of evolution of transposable elements and more specically the occurrence of horizontal transfer of this kind of genome inhabitants. To identify the evolutive patterns that are characteristic of transposable element sequences in 12 Drosophila genomes, a comparative analysis between ortholog genes and transposon sequence groups was realized, both inter and intra-specically. From the groups of ortholog genes and transposon sequences it was extracted a number of evolutive variables like Ka, Ks, Ka/Ks, logdet distance, kimura distance, codon bias values determined by the indexes CAI, Enc, Fop and effective number of codons (Nc). Furthermore, variables describing nucleotide compositions like GC, GC3 and the counting of each nucleotide at the third codon position were also computed. These variables were determined for a total of 66 groups of ortholog gene sequences and 66 groups of transposable element sequences, representing all the possible combinations between two genomes in the studied dataset. With the generated data was possible to test the hypothesis that transposable element sequences that were subject to horizontal transfer have consistently lower values of Ks than the genes of the host genome, taking the Ks variable as an indicator of neutral evolution. For the 66 groups of data of transposable element sequences, 16 genome pairs had strong evidence of horizontal transfer. This was tested by the comparison of the Ks values for transposable and non-transposable ortholog genes with the non-parametric Mann-Whitney-Wilcoxon statistical test. Retrieving the associated similarity data values of the identied cases it was veried that this were also the most conserved transposable sequences for each correspondent genome pair, indicating that the sequences were recently acquired. In this form, it is possible to infer that the horizontal transfer of transposable elements is an present phenomenon, acting as one of the mechanisms for transposon maintenance in the species of Drosophila genus. Supported by: CAPES, FAPERGS

24

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

EVOLUTIONARY PLASTICITY INDEX: A STRAIGHTFORWARD METHOD TO EVALUATE EVOLUTIONARY PLASTICITY BASED ON ORTHOLOGS DISTRIBUTION
Dalmolin R J S1 , Castro M A A1 , Rybarczyk-Filho J L2 , Souza L H T1 , Almeida R M C D2 , Moreira J C F1
1 Departamento 2 Departamento

de Bioqumica - Universidade Federal do Rio Grande do Sul de Fsica - Universidade Federal do Rio Grande do Sul

Genetic plasticity may be understood as the ability of a functional gene network to tolerate alterations in its components or structure. The majority of the studies involving gene modications in the course of the evolution is concerned to nucleotide sequence alterations in close-related species. However, the analysis of large scale data about the distribution of gene families in non-exclusively close-related species can provide insights on how plastic or how conserved a given gene family is. Here, we analyze the duplicability (abundance) and distribution (diversity) of all eukaryotic orthologous groups (KOG) present in STRING database, resulting in a total of 4,850 KOGs. This dataset comprises 481,749 proteins distributed among 55 eukaryotes. We describe the Evolutionary Plasticity Index (EPI ) to evaluate the evolutionary plasticity and conservation of an orthologous groups based on their abundance and diversity across eukaryotes. To further KOG plasticity analysis, we estimate the evolutionary distance average among all proteins which take part in the same orthologous group. As a result, we found a strong correlation between the evolutionary distance average and the EPI . Additionally, we evaluate the EPI of mouse and yeast genes which cause severe tness impact when knocked-out. We found low EPI in Saccharomyces cerevisiae genes associated with inviability and Mus musculus genes associated with early lethality. At last, we plot the evolutionary plasticity value in different gene networks from yeast and humans. As a result, it was possible to discriminate among higher and lower plastic areas of the gene networks analyzed. According to our results, EPI represents one step in evolutionary relationship understanding by indentifying which gene networks have been more or less stable on the course of evolution.

25

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

COMPARATIVE ANALYSIS OF INTER-SPECIES DISTANCES MEASURED USING INTERNUCLEOTIDE DISTANCES AND STANDARD EVOUTIONARY MODELS
Iserte J1 , Goni S1 , Stephan B1 , Borio C1 , Lozano M1
1 Depto.

Ciencia y Tecnologa, Universidad Nacional de Quilmes, Buenos Aires, Argentina.

During the last years, the new massive sequencing technologies allowed the generation of immeasurable amounts of new sequences. The analysis of these data, particularly those who stem from complete genomes, has been very difcult with the traditional tools. Therefore, alternative methods have been developed. Afreixo et al. (Bioinformatics, 2009) proposed the use of the inter-nucleotide distance (IND) measure as a useful tool for genome analysis. Mainly, they propose that this method can be used to construct non-hierarchical trees that can be interpreted as phylogenetic trees. In this work, we have performed an in depth analysis comparing the inter-species distances obtained and the trees reconstructed by IND, and by traditional methods based on evolutionary models, on sequence alignments of the 16S RNAs of bacteria. For the comparisons we used several collections of whole genomes from diverse bacteria genera. As a general result, we found that the trees reconstructed by means of a modied version of the IND method are compatible with those obtained analyzing the 16S RNA by more traditional methods of phylogeny.

26

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP03

VARIATIONS IN THE HIV-1 POL GENE FROM DRUG-SENSITIVE AND RESISTANT HIV-1 STRAINS
Lopes L R1,2 , Lima J1 , Alcalde R2 , Duarte A2 , Casseb J2 , Paiva P1
1 Universidade 2 Universidade

Federal de So Paulo de So Paulo

Background: Eliminate or control Human Immunodeciency Virus type 1 (HIV-1) infection have been one of the most considerable efforts of scientic community. However, the ability to produce viral variants protects HIV from the immune system and promotes antiretroviral therapy (ART) resistance. The corrective inability of reverse transcriptase induces genetic mutations. This enzyme, encoded by the pol gene has been a great target of ART. Thus, some of treated HIV-infected patients present HIV strains with important mutations in pol gene able to resist to ART. This study aimed to investigate the inuence of the use of ART on HIV and to evaluate the evolution of resistant and non-resistant HIV strains. Material and methods: Blood samples were collected from HIV patients attended at the Hospital das Clnicas of the University of So Paulo, So Paulo, Brazil, totalizing 26 samples from 11 patients, all treated. Patients samples were classied according with their drug resistance in Ssensitive, Rresistant (< 3 drugs) and MRmulti-resistant (resistant to all drugs). Genotype tests and PCR reactions directed to HIV pol gene were made in all samples. Sequencing reactions were performed in both directions using ABI Prism Bigdye Terminator Cycle Kit 3.1 version on ABI 3130 DNA Sequencer (Applied Biosystems). Sequences were aligned using ClustalW software. Phylogenetic tree were constructed by neighbor-joining method using HYPHY software. Results : Phylogenetic neighbor-joining circular tree of the pol region with evolutionary distances determined by the full-likelihood methods for samples with variability associated with genotypes B, presents a marked division between the sensitive and resistant (R and MR) HIV. A superior branch gets exclusively R and MR virus, while the inferior branch present sensible strains, except two samples from the same patient. Additional tests are being done. Conclusions: Our results showed pol sequences from HIV-1 resistant (R) and multi-resistant (MR) strains were susceptible to selective pressure and remained distinctly in the phylogenetic tree comparing with sensitive strains.

27

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

ORIGIN OF HUMAN CHROMOSOMAL DOMAINS


Quadros L C D2 , Melo H V2 , Ortega J M2
1 Universidade 2 Departamento

Federal de Minas Gerais de Bioqumica e Imunologia - ICB, UFMG

Presently the position of genes in human chromosomes is given by physical mapping, since the sequencing of Homo sapiens genome is complete. Moreover, human genes have been clustered with homologues by several different projects, including Kegg Orthology (KO). Knowing the clusters we can determine the set of organisms which share homologues with Homo sapiens and therefore determine the clade in human lineage (e.g. Bilateria, Chordata or Mammalia) that comprises all organisms which share homologues with man, named, the clade that contains the Last Common Ancestor (LCA). Here we report the analysis of the distribution of human genes over two dimensions: chromosome positioning and evolutionary level in human lineage. In a global view, the origin of human genes is concentrated on a small number of clades: 15% ascend together with cellular structure (cellular organisms), 21% with the origin of nucleus (Eukaryota), several genes appeared between the origins of Metazoa up to Chordata (a total of 18%), and the following fractions at more modern clades: 19% at the origin of the pulmonary organisms (Euteleostomi), 9% by Tetrapoda/Aminiota origin, 14% mutually with the development of placenta (Eutheria/Euarchantoglires). Glancing chromosomal positioning, remarkably the global pattern is reproduced over all chromosomes, suggesting that the origin of genes is distributive along the genome. However, quantitatively slightly distinct patterns are observed for some chromosomes, e.g. genes of Chordata origin are more present in Chromosomes 7 and 21, comprising 5% and 3.5% of total, respectively. Thus, a hierarchical clustering of gene origin by Chromosomes could be built, having generated ve main groups. Remarkably, when we analyzed the distribution of paralogs, we found evidences that expansion of gene lineages are an ancient event. For instance, a great number of genes, dated to the Bilateria origin, are concentrated in three regions in Chromosome 11, and a detailed analysis showed that they belong to the family of olphatory receptors. Lineage expansion starting in Bilateria also comprise other seven Chromosomes (1, 9, 12, 14, 15, 18, 19). Chromosome 12 has an expansion starting at the Chordata and Chromosome 1 at Eukaryota. Almost undetectable lineage expansions occur in the more recent waves of gene origin (after the origin of Vertebrata). We conclude that the origin of human genes has a slightly similar pattern, although groups of Chromosomes can be identied, and the lineage expansion is concentrated on genes that have originated before the origin of Vertebrata. Supported by: FAPEMIG

28

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

ENRICHED ANNOTATION AND FUNCTIONAL ASSIGNMENT OF PROTEINS OF EIMERIA SPP.: GUILTY BY ORTHOLOGY
Rangel L T L D1 , Durham A M2 , Gruber A1
1 Department 2 Department

of Parasitology, Institute of Biomedical Sciences, University of So Paulo of Computer Science, Institute of Mathematics and Statistics, University of So Paulo

Avian coccidiosis is a worldwide economically important disease of poultry and is caused by seven species of genus Eimeria . The development of novel control measures requires a better understanding the molecular mechanisms involved in the parasite life cycle and the relationship with the host. Our group has recently sequenced ORESTES reads from three different Eimeria species: E. acervulina , E. maxima and E. tenella . These sequences were assembled and submitted to a comprehensive annotation pipeline using EGene 2 platform. The most common approach to assign putative functional to newly characterized protein sequences is by similarity, using BLAST or other pairwise alignment tools. However, given the fragmented and incomplete nature of cDNA sequences reconstructed from ESTs, conserved domains are often not covered, hampering a possible functional assignment. In addition, Eimeria is a member of the phylum Apicomplexa, and this group of parasites still presents a large fraction of proteins of unknown function. However, the identication of conserved hypothetical proteins Apicomplexa is important, since it provides good evidence that these proteins may play important physiological roles. To enrich our annotation of Eimeria proteins, we undertook a cross comparison using available proteome data of some apicomplexan organisms: T. gondii , P. falciparum , N. caninum , C. parvum , B. bovis and T. annulata . Using the Eimeria sequencing data, we generated sets of conceptually translated proteins longer than 50 amino acid residues, and used InParanoid to compare these sets of sequences to one another and against sets of proteins predicted from the apicomplexan parasite genomes. Next, we merged all pairwise ortholog clusters identied by Inparanoid into multi-species clusters using MultiParanoid. This analysis allowed us to identify proteins that are evolutionary conserved across different apicomplexan taxa, and that may potentially exert common functions in the members of the phylum. Also, we developed a new EGene component for orthology annotation, and used it with KOG (a database of eukaryotic orthologous groups), to perform the functional characterization of the proteins. After merging the orthologous group data with functional characterization data, we were able to identify proteins that have no ascribed KOG classication, but do have at least one annotated ortholog, beside the information on which proteins compose each cluster. Using this methodology, we managed to characterize 1,195 proteins from all sets by expanding the annotations of the orthologs, and to establish a relationship for 25,542 apicomplexan sequences as orthologs/inparalogs. We observed 142 orthologous groups containing proteins from the three Eimeria sets, and 49 of these groups were specic from this taxon. The total number of proteins included in common groups for the three eimerian species was 604. We also identied 47 orthologous groups that are shared by all analyzed apicomplexans, comprising a total of 1,085 proteins. This approach allowed us to improve our original functional annotation, and also establish relationships across a very large set of apicomplexan proteins. These results may contribute for a better understanding of proteins that exert some commons and fundamental roles in the biology of this important group of parasites. Supported by: FAPESP

29

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP04

ORIGIN OF HUMAN PATHWAYS


Velloso H1 , Ortega J M1
1 Departamento

de Bioqumica e Imunologia - ICB, UFMG

Human pathways can be obtained from the Kegg database, which also contains homologous groups (Kegg Orthology, KO) for each human gene. Using KO groups of homologues, we were able to estimate the origins of Homo sapiens genes by nding the clade (e.g. Bilateria, Chordata or Mammalia) which comprises the Last Common Ancestor (LCA) for each gene. With this information in hands, we sought to determine the evolution of human pathways. We analyzed 156 pathways comprising 4289 genes (1644 distinct) distributed amongst the pathways. From the 156, a total of 48, 38, 27, 13 pathways are formed by 1 to 4 sub-pathways, respectively. We selected the 48 pathways with 1 sub-pathway for further analysis due to simplicity. From them, 25 (52%) have acquired all their genes at the origin of Vertebrata, while 7 (15%) at Eukaryota and 5 (10%) are shared with all cellular organisms. The process of building of pathways was also analyzed. For example, the Phosphatidil Inositol Signaling System, comprising 26 reactions, is originated with one gene at the clade Cellular Organisms, forms two subpathways at Eukaryota (7 genes), evolves to three sub-pathways at Fungi/Metazoa clade (10 genes), returns to two sub-pathways at Euteleostomi (20 genes) and becomes a single unit at Eutheria (25 out of the total 26 genes). However, a total of 29 (60%) pathways represent the ones that evolve over a single core (lonely pathway). For example, the Jak-STAT signaling pathwaycontains 22 genes and evolves using the following pattern of genes/clades: 3/Eukaryota, 7/Fungi-Metazoa, 9/Metazoa, 14/Eumetazoa, 16/Coelomata, 22/Euteleostomi. In conclusion, around 50% of the catalogued human pathways have originated before the origin of Vertebrata. Also, the evolution of pathways surrounds a single core only in 60% of cases, where genes are acquired and aggregated during several periods of the human evolution.

30

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP03

COMPUTACIONAL APPROACH TO STUDY THE GENOME REDUCTION AND LIFESTYLE OF CORYNEBACTERIUM PSEUDOTUBERCULOSIS STRAINS
DAfonseca V1 , Ali A1 , Santos A R D1 , Pinto A C1 , Magalhes A A C1 , Faria C D J1 , Barbosa E1 , Dorella F A1 , Pacheco L G D C2 , Almeida S S D1 , Soares S D C1 , Abreu V A C D1 , Silva A3 , Moore R4 , Miyoshi A1 , Azevedo V1
1 Universidade 2 Universidade

Federal de Minas Gerais - UFMG Federal da Bahia - UFBA 3 Universidade Federal do Par - UFPA 4 CSIRO Liverstock Industries - Australia Corynebacterium pseudotuberculosi s, a Gram-positive, facultative intracellular pathogen, is the etiologic agent of disease known as caseous lymphadenitis (CL). CL affects mainly small ruminants such as goats and sheep and, on rare occasions, causes infections in humans. It is found throughout the world, but it has the most serious economic impact in Oceania, Africa and South America. Although C. pseudotuberculosis can cause major health and productivity issues in livestock, little is known about the molecular basis of its pathogenicity. In order to increase the knowledge about the C. pseudotuberculosis , the present work characterized two C. pseudotuberculosis genomes (Cp1002, isolated from goat; and CpC231, isolated from sheep). To achieve this goal, we have been involved in the assembly and structural and functional annotation of two Corynebacterium genomes. As a multi-step process, the whole annotation procedure involved the use of several algorithms. For structural annotation: FgenesB (genes predictor - incorporation of all the gene information corresponding to the DNA content. This program was standardized using C. diphtheriae genome); RNAmmer (rRNA predictor - identication of the RNAs of bacteria: 5S rRNA, 16S rRNA and 23S rRNA at high accuracy rate); tRNA-scan (tRNA predictor tRNA identication through the characterization of secondary structure and anticodon); and Tandem Repeat Finder (Repetitive DNA predictor - recognition of repetitive DNA regions by similarity search against a database of DNA repetition) were used. Functional annotation was performed through similarity analyses using several public databases, including NCBI and InterProScan. The manual annotation processes were performed in Artemis and similarity analyses were made using several databases, and searches were carried out at aminoacid level (BLASTp). As results, analizing the predicted genomes, the two bacteria showed a high similarity in regard to genomic architecture, gene content and genetic order. Analysis comparing C. pseudotuberculosis to other Corynebacterium species has demonstrated that these pathogens present remarkable gene losses, and are among the smallest genomes in the genus. Furthermore, it showed the smaller GC content at around 52%, and a minimal gene repertoire. C. pseudotuberculosis also has seven putative pathogenicity islands which contain several classical virulence factors, such as: mbrial subunits, adhesion factors, iron uptake genes and toxins secreted. Additionally, all factors in the islands presented characteristics of horizontal transfer. These particular genome characteristics of C. pseudotuberculosis , as well as its acquired virulence factors in pathogenicity regions, provide evidence of its lifestyle and of the pathogenicity pathways used by this pathogen in the infection processes. FINANCIAL SUPPORT: This work received nancial suport from CNPQ (Conselho Nacional de Desenvolvimento Cientco e Tecnolgico), CAPES (Coordenao de Aperfeioamento de Pessoal de Ensino Superior), FAPEMIG (Fundao de Amparo Pesquisa do Estado de Minas Gerais) and post-graduation in Genetics of UFMG. 31

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Genomics, Evolution and Phylogeny PI: GEP06

IDENTIFICATION, CHARACTERIZATION AND IN SILICO COMPARISONS OF CORYNEBACTERIUM PSEUDOTUBERCULOSIS PATHOGENICITY ISLANDS (PAIS) USING PIPS
Ali A1 , Soares S D C1 , DAfonseca V1 , Santos A R D1 , Almeida S S D1 , Pinto A C1 , Magalhes A A C1 , Faria C J1 , Barbosa E1 , Dorella F A1 , Pacheco L G D C2 , Abreu V A C D1 , Silva A3 , Moore R4 , Miyoshi A1 , Azevedo V1
1 Universidade 2 Universidade

Federal de Minas Gerais, Belo Horizionte, Minas Gerais, Brazil Federal da Bahia, Salvador, Bahia, Brazil 3 Universidade Federal do Par, Belm, Par, Brazil 4 CSIRO Liverstock Industries, Australia C. pseudotuberculosis is a Gram positive pathogen, non esporulatin and pleomorphic, Belongs to Actinobacteria group and causative agent of Caseous Lymphadenitis (CLA) [5]. CLA causes caseous lymphadenitis in lymphatic glands and/or abscesses in supercial lymph nodes and subcutaneous tissues of sheep and goats. The treatment of this disease with antibiotics is not effective and the commercial vaccines licensed for use in sheep do not confer protection on goats. Therefore, it is very important to characterize the virulence factors of C. pseudotuberculosis in order to create more efcient vaccines. The virulence genes are frequently harbored by Pathogenicity Islands (PAIs) which are responsible for the high virulence and genomic plasticity of several pathogenic bacteria. The comprehension of bacterial heterogeneity and the elucidation of the processes that takes bacteria to this high plasticity are very important for evolutionary, epidemiologic and phylogenetic studies. PAIs of C. pseudotuberculosis were identied by means of PIPS, a software that performs analyses using different strategies based on the classical features of PAIs, like: Codon Usage (Colombo SIGI_HMM) and G+C content deviations (Artemis: annotation tool); presence of virulence factors (mVIRDB), anking tRNAs (tRNAscan_SE) and transposases (ENTREZ database); and absence on Corynebacterium glutamicum, close related non pathogenic bacterium (ACT: Artemis Comparison tool). A manual curation was realized on 14 Putative PAIs automatically identied by PIPS to rene the results. A comparative analysis of 7 PAIs manually curated will be presented harbouring genes such as fagA,B,C and D operon (iron acquisition), pld gene (phospholipase D - sphingomyelinase action) and PAI 3 of Corynebacterium diphtheriae shading new lights that could assist directed strategies for production of more efcient vaccines against CLA. Finnancial support: CNPQ, CAPES, FAPEMIG-RGMG, FAPESPA-RPGP.

32

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 2 S TRUCTURAL B IOINFORMATICS AND M OLECULAR DYNAMICS

33

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD01

PROFRAGER: AN INTERACTIVE WEB PORTAL FOR CREATING FRAGMENTS LIBRARIES OF PROTEIN STRUCTURES
Santos K B1,2 , Custdio F L1,2 , Dardenne L E1,2
1 Laboratrio 2 Grupo

Nacional de Computao Cientca - LNCC/MCT de Modelagem Molecular de Sistemas Biolgicos - GMMSB/LNCC

The use of fragment libraries in protein structure prediction methods has demonstrated large potential in obtaining results with a good degree of accuracy. This comes from the fact that the fragments small segments extracted from proteins with known structures, implicitly contain important structural relations. Its use can reduce the space of congurations to be explored allowing bigger sequences to have their structures predicted, overcoming one of the biggest limitations of pure ab initio methods. In general, a good library of fragments is constructed by local similarity between the amino acid sequences of the fragments with the target sequence. Additionally the secondary structure of the fragments and the predicted secondary structure of the target sequence may be used. A program that generates the fragments library should be customizable, i.e., have options that can make it exible and should be automated in order to provide greater convenience for its users. The use of a graphical interface for that program in the form of an interactive portal is an attractive option because it does not require costly and time consuming local software installations and the creation and maintenance of the database containing geometries used in the construction of the libraries. The goal of this project was to create a program that generates fragments libraries, with an interactive internet portal as an interface, called ProFrager, that shows greater exibility in the choice of options for generation of fragments compared to other portals that offering similar services. ProFrager creates fragments libraries given target protein sequences from a non-redundant database of protein with known structures extracted from PDB (Protein Data Bank). The libraries may contain, for each position on the target sequence, any number fragments with any length chosen by the user. ProFrager has the distinction of offering the user advanced options for the creation of libraries. Such as, specifying the minimum score that must obtain a fragment to be included in the library, creating fragments exclusively from homologous (or non-homologous) structures to the target sequence, selecting the amino acids similarity matrix (e.g., BLOSUM62, PAM80, etc) and choosing between different databases, one with a maximum 20% identity between the sequences, resolution up to 2.0 angstroms and a total of 4.365 entries, or other with at most 50% identity between sequences, resolution up to 2.5 angstroms and a total of 9.485 entries. Fragments selection can be further enhanced using comparison between the secondary structure predicted for the target sequence and that found in the protein database. In this case, for the choice of fragments that will be part of the library, is possible choose to either combine the two scores (sequence similarity and secondary structure) or to use a Pareto Strategy. Supported by: CNPq

34

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

APPROACH APPLIED IN THE COMPUTATIONAL DISCOVERY OF NEW PHYTOCHEMICALS PRESENT IN MEDICINAL PLANTS, WITH THERAPEUTIC POTENTIAL IN NECK AND HEAD CANCER
Carvalho P V S D D1 , Silveira N J F1 , Omote D D Q1 , Gouva C M C P1
1 Universidade

Federal de Alfenas - MG

This research project has a computational approach applied in the discovery of new substances in medicinal plants with therapeutic potential in head and neck cancer. The increment of these computational tools and the use of knowledge from other areas, such as bioinformatics, will be extremely relevant in the development of drugs with fewer side effects and greater specicity for cancer of head and neck. The molecular marker selected has high expression in cancer cells in this cancer type and are related to inhibition of process cell death of neoplasic cells. However, nothing is known of the interactions of selected phytochemicals with the active site of our molecular marker and possible interference in the signaling pathway of this protein. The results of these interactions are presented. Thus, we have predicted the structure of our molecular marker for homology using the program Modeller 9V8 and have selected ligands that have greater afnity and specicity to the active site of the protein. For this purpose, the selection of this substance occurred by molecular docking, using the Maestro program. With this partial results, there will be, in a next step, the possibility to study the signaling pathway of our molecular marker in this type of cancer. Therefore, our study may serve as a basis for rational design of drugs with fewer side effects for patients with tumors of head and neck and with less cost. Furthermore, there is the possibility to extrapolate our results for the treatment of other type of cancer. Financial support: CNPq.

35

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

THEORETICAL STUDY OF THE PUGNAC ANALOGUES IN COMPLEX WITH HOGA.


Alencar N A N D1 , Sousa P R M D1 , Santos A M D1 , Carmo M C L1 , Silva J L1 , Alves C N1
1 Laboratrio

de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil. The post-transduction of a single saccharide residue of N-acetylglucosamine by way of a link -Oglucoside of serine and threonine residues of cytoplasmic protein core is called O-GlcNAc modication. This modication is commonly found in many proteins with a variety of functions. O-glycoprotein 2-acetamino-2-deoxy- -D-glucopyranosidase (O-GlcNAcase) is a member of the family of glycoside hydrolases 84 (GH84) that is able to cleave O-GlcNAc from post-translationally modied protein serine and threonine residues. O-GlcNAcase (hOGA) has been used as target for therapeutic agents due to the fact that dysregulation of cellular O-GlcNAc levels have been implicated in several diseases including Diabetes II, cardiovascular diseases, neurodegenerative disease such as Alzheimers and Parkinsons disease. The synthesis and development of new inhibitors is of great value in order to help in treating this disease and to obtain models for the study of biochemical structure and biological function of glycoproteins. Studies suggest that PUGNAc inhibits hOGA by mimicking the transition state of the OGlcNAcase-catalyzed hydrolysis of N-acetylglucosaminide by virtue of its sp2 anomeric C1, similar to the oxocarbenium ion-like transition state. The Three-dimensional structure of the N-terminus domain of hOGA has been determined by molecular homology, from the same bacterial CpNagJ (PDB code: 2CBJ) and reported in previous work. This study, we investigate the binding of PUGNAc derivatives to hOGA using computational modeling studies. We employed hybrid Quantum Mechanics / Molecular Mechanics (QM/MM) and molecular dynamics (MD) simulations to study the details of the hOGA-PUGNAc interaction, in order to explain the difference in inhibitory potency. Hybrid QM/MM molecular dynamics simulations of 1.5 ns were carried out for renement target (hOGA), PUGNAc has similar interaction with N-terminus hOGA, D174 presents hydrogen bond interactions with nitrogen of amide nitrogen (1.93 ), which is responsible for stabilizing the conformation of the acetamide group. W490 is not conserved between CpNagJ and hOGA, this residue projects from the CpNagJ C-terminal domain towards the catalytic domain, and stacks with the PUGNAc phenyl ring, however there is a tyrosine residue (Y286) that stacks with the PUGNAc phenyl ring in the hOGA. For the series of PUGNAc derivatives a good linear correlation (R2 =0.72), was observed in a plot of Log (1/KI ) values of the inhibitor versus QM/MM total ligand-protein interaction energies of compounds. This result indicates that incremental changes of the N-acyl chain of the parent compound decreases energy interaction between the PUGNAc analogues and hOGA. Free energies obtained from the FEP calculations at AM1/MM level for PUGNAc derivatives in complex with hOGA, together with experimental kinetic and biological activities values, for all systems under study. A direct relationship between the binding free energies and KI can be observed for hOGAPUGNAc complex, which could be used as a guide in the design of new inhibitors. The correlation coefcients obtained for the linear t of log(KI ) versus the free energy contributions are 0.65. Thus, we conclude that the model of hOGA obtained by homology modeling seems to be a better model for active site of this enzyme.Supported by: CAPES, PARD-UFPA, LPDF.

36

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

INTERATIONS NETWORKS IN PROTEIN-LIGAND COMPLEXES


Oliveira E L M D1 , Almeida V M G2 , Meira J W1,2 , Silveira C H3 , Minard R C M1,2 , Pires D E V1 , Santoro M M2 , Lopes J C D4
1 Department

of Computer Science- Federal University of Minas Gerais (UFMG) PhD program - Federal University of Minas Gerais (UFMG) 3 Federal University of Itajub (UNIFEI), at Itabira-MG 4 Department of Chemistry- Federal University of Minas Gerais (UFMG)
2 Bioinformatics

Introduction: The proteins form a dense network of atomic interactions. Atoms are nodes of this network, and the interactions between them are edges. The architecture or spatial distribution of the atoms networks can reveal important principles of organization and function of a class or protein family. In this study we have analyzed the networks of atomic interactions in contact surface, known as interface forming residues (IFR), in a group of proteins in proteic ligands complexes. Methods: Initially, we used a cutoff distance of 7between alpha-carbons as a criterion for contact or edge denition and identify all fully connected subgraphs (cliques) with size n=3. The nodes are distinguishable by residue name and the position in polipeptidic chain. The alphabet of twenty amino acids was reduced in six disjunct groups according to its physico-chemical properties. A database containing several protein complexes was carefully annotated in an attempt to identify networks of common interactions between proteins of the same class. We compared the networks formed by inter and intra chain interactions. Results: The results showed that networks of atomic interactions formed from contacts in the interface region of the protein and ligands complexes are well conserved into the same family. Furthermore, it is possible to observe that proteins of different classes or families that form complexes with same ligand have similar inter chain networks, suggesting that the ligand may determine or inuence the residues involved in the region of interaction of proteins, even if these differ in sequence and structure. Corresponding author: ELM Oliveira (e-mail: emaynart@dcc.ufmg.br) Supported by: CNPq, CAPES and FAPEMIG

37

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD06

IDENTIFICATION AND CHARACTERIZATION OF CAVITIES IN SPECIFIC REGIONS OF THREE-DIMENSIONAL STRUCTURES OF PROTEINS


Oliveira S H P D1 , Sobreira T J P1 , Neto J X2 , Oliveira P S L D1
1 Laboratrio 2 Laboratrio

de Gentica e Cardiologia Molecular - INCOR - HCFM/USP Nacional de Biocincias

Understanding the spatial and physical/chemical properties of protein cavities, and therefore their active sites, is a crucial step for driving and improving rational drug design and functional characterization of binding and catalytic sites. Our poster presentation will focus on presenting KFinder, an improved tool for mapping cavities located on specic regions of three-dimensional (3D) structure of proteins. Our software employs a matricial modeling that relies on an user-driven placement of a 3D grid. This matricial approach allows the extraction and identication of cavities in protein topologies. The application presents several features such as cavity volume quantication and surface shape and extent depiction. The exclusive user-driven placement feature boosts the programs capacities, enabling complex searches for cavities occurring in specic regions within the protein, such as sub-sites of proteases and specic anchoring sites of kinases. Further, this steered approach allows the partition of complex sites like NAD/ATP binding sites or the mapping of voids formed in protein-protein or protein/DNA interfaces. In addition, the software can also map the electrostatic properties of the surface, thus leading to a better physical and chemical characterization of binding and catalytic sites. The routine was integrated to a graphical modeling software, consequently providing an easy and powerful user interaction with the tool. The algorithm validation was performed using a set of distinct proteins. The software accurately described a diversity of cavity types that represented binding sites of enzymes and receptors. The volume evaluation of enzyme cavities was also carried in large scale, accompanying the pocket size evolution of the ALDH superfamily. Our algorithm has several advantages over existing softwares, providing shorter execution time, increased accuracy, greater accessibility and ease of integration in comparison to similar programs.

38

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

GENETIC ALGORITHMS WITH A COARSE-GRAINED MODEL FOR PROTEIN STRUCTURE PREDICTION


Oliveira L L D1 , Ishivatari L H U1 , Silva F L B D2 , Tins R1
1 Department 2 Department

of Physics and Mathematics - FFCLRP, USP of Physics and Chemistry - FCFRP, USP

Protein structure prediction (PSP) by computational methods has persisted as an important problem in molecular biology. The combination of physics-based and knowledge-based approaches to the PSP problem has obtained the best results according to CASP (Critical Assessment of Techniques for Protein Structure Prediction), all editions of CASP can be seen at http://predictioncenter.org/. However, there is still not a robust computational tool to solve the structure prediction problem for a large range of proteins. Considering that native proteins conformation is usually the thermodynamically most stable conguration, i.e., that one having the lowest free energy, PSP can be viewed as an optimizations problem, where the structure with the lowest energy should be found among all possible structures. Nevertheless, this is an NP-problem, where traditional optimization methods, in general, do not have good performance. In recent years, some researches have used genetic algorithms (GAs) to some NP problems due to their intrinsic characteristics. The objective of this work is to test the use of a coarse-grained (CG) force eld, described in Yap et al . in 2003, which uses -carbon to represent the protein backbone. This model was used to folding investigation and its effectiveness in PSP via GAs was not studied yet. The coarse-grained force eld is composed by the following terms: angle bond, angle dihedral, Van der Waals and hydrogen bond. In GAs, each individual represents a solution for the optimization problem which is, in this case, a possible conformation that will be evaluated by the force eld function. Thus, an individual is encoded by a set of torsion angles ( and ) of each amino acid, limited by Ramachandran plot. In order to reduce the search on conformational space, we used a database composed by angles, which were determined by crystallography and NMR. Initially, we used Protein G (PDB code: 2GB1) and an Acetylcholine Receptor (PDB code: 1A11) as test cases. The smallest RMSD obtained to 1A11 was 1.3 and 10.1 to 2GB1. Performed tests indicate that the used CG force eld is able to indentify topologies near from native state, although, it is difcult to evaluate more distant topologies. Also, the model was not able to build -sheets adequately, and it can be attributed mainly to non-bonded terms, which possibly need more information from side chain. The structures created have smaller energy than the native structure, when evaluated by the CG model, what does not occur with a full-atom model. The obtained results agree with the conjecture presented in Tozzini in 2005. Evaluating the structures generated by the CG model with a full-atom force eld, we observe a high Van der Waals energy value, which is due to the fact that CG force eld does not directly use the side chain information. On the other hand, the use of full-atom force eld causes loss of secondary structure in short simulations, which was better maintained by CG model. Supported by: FAPESP.

39

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

DOCKING MOLECULAR OF ANTILEISHMANIAL NEOLIGNAN ANALOGUES


Carmo M C L1 , Nascimento J P1 , Silva N D F1 , Alencar N A N D1 , Monteiro P R S1 , Silva J L1 , Alves C N1
1 Laboratrio

de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil. The leishmaniases are parasitic diseases caused by protozoa of the genus Leishmania and remain a severe public health problem, particularly in many tropical and subtropical regions. An estimated 1.5 to 2 million new cases of leishmaniasis occurs each year in the world, which in the visceral manifestation is often fatal if untreated. Unfortunately non-availability of satisfactory chemotherapeutic agents and failure to develop an effective vaccine are considered to be two stumbling blocks in the combat of this disease. The search of new drugs against leishmaniasis shows neolignans potentially compounds and derivatives. Usually, neolignans are organic dimmers derived from oxidative coupling of allyl and propenyl phenols. Previous studies have evaluated the antileishmanials activities of components sulfur and oxygen synthetic analogues of neolignans against parasite species that cause cutaneous and visceral Leishmaniasis. Molecular docking is a key tool in structural molecular biology and computer-aided drug design. The goal of ligand-protein docking is to predict the predominant binding mode(s) of a ligand with a protein of known three-dimensional structure. In the present study, neolignans compounds and derivatives were docked in the enzyme structure of cyclophilin from Leishmania donovani (LdCyp), obtained in the Protein Data Bank (Code: 2HAQ) using AutoDock Vina program in version 1.2, with a standard set of parameters. Binding site the enzyme is constituted of important residues Arg78, Phe83, Gln86, Gly95, Ala123, Asn124, Gln133, Phe135, Trp143, Leu144 and His148. The residues corresponding to His148, Asn124, and Trp143, and Arg78, exhibit similar interacted to the neolignans compounds in to enzyme 2HAQ. The energies obtained from docking study, afnity showed low correlation with activity biological, R2 of 0,31. However, the results to show sufcient promising since the compounds presented similar structure to a set of neolignans with antileishmania activity reported in the literature.Thus, these compounds may be subjected to biological activity and toxicological analysis against leishmaniasis. As perspectives, we suggest the analysis of molecular dynamics to see if the ligands remained in the active site of the enzyme.Supported by: CNPq, LPDF, UFPA

40

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

STRUCTURAL STUDIES OF A SURE PROTEIN FROM XYLELLA FASTIDIOSA


Reis M A D1 , Saraiva A M2 , Santos M L D1 , Souza A P D2 , Aparicio R1
1 Institute 2 Institute

of Chemistry - UNICAMP of Biology/CBMEG - UNICAMP

The bacterium Xylella fastidiosa is a phytopathogenic organism which infects a wide range of hosts, including grape, almond, peach and citrus. In Brazil, it causes citrus variegated chlorosis (CVC), a disease that attacks sweet orange trees, with losses reaching US$ 300 million per year. Genome data of Xylella fastidiosa strain 9a5c have identied several orfs related to its phytopathogenic adaptation and survival. The SurE gene codies a survival protein E (XfSurE) whose function is not well understood, with nucleotidase and exopolyphosphate activities being reported for the homologue protein from Escherichia coli. Functional and solution structural studies indicated that the enzyme is a tetramer in solution which exhibits a highly positive cooperative behavior. Extensive computational modeling of different molecular arrangements suggested that a probable mechanism for the allosteric behavior observed may be a result of torsion movements of the protein in solution. To better understand the mechanism of XfSurE, crystallographic studies were initiated and the main results are reported in the present work. The protein was crystallized in two crystal forms, both belonging to the space group C2 and differing in unit-cell parameters, solvent content and internal molecular arrangement. The rst form (1.93 resolution) has a = 172.36 , b = 84.18 , c = 87.24 , = 96.59, with a tetramer present in the asymmetric unit (ASU), while the second has a = 88.05 , b = 81.26 , c = 72.84 , = 72.84(2.9 resolution), and a dimer in the ASU. The structures were solved by molecular replacement with automated software packages and are currently under renement. Torsion angles calculated from the rened structures may conrm if different arrangements are present in the two crystal forms, thus reinforcing the hypothesis that SurE proteins employ an allosteric mechanism to carry out its biological functions. Acknowledgements: this work was supported by FAPESP and CNPq.

41

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

STUDY OF THE CATALYTIC MECHANISM OF THE ENZYME 5-ENOLPYRUVYLSHIKIMATE 3-PHOSPHATE SYNTHASE USING METHODS OF MOLECULAR DYNAMICS AND QM / MM
Santos A M1,2 , Lima A H1 , Alencar N A N D1 , Silva J L1,2 , Alves C N1
1 Laboratrio

de Planejamento e Desenvolvimento de Frmacos, Universidade Federal do Par, CP 11101, 66075-110, Belem, PA, Brazil 2 Instituto de Cincias Biolgicas, Faculdade de Biotecnologia, Universidade Federal do Par, Belem, PA, Brazil The enzyme 5-enolpyruvylshikimate 3-phosphate (EPSPS) is the sixth enzyme of the shikimate pathway and catalyzes the reaction between phosphoenolpyruvate and shikimate-3-phosphate to form 5-enolpyruvylshikimate 3-phosphate (EPSP) and inorganic phosphate. EPSPS is a promising target for developing nontoxic antimicrobial agents, herbicides, and antiparasitic drugs, because this pathway is essential in microorganisms, plants and parasites, but absent in mammals. The catalytic reaction of EPSPS occurs in two steps, an addition in which the product is a tetrahedral intermediate (THI) and ellimination, in which there is the formation of products EPSP and inorganic phosphate. The objective of this study is propose a mechanism energetically favorable to help the understanding of the second stage of the mechanism of EPSPS. In this report, the computational model for the QM/MM MD calculations was based from the crystal structure of the EPSPS-THI complex, PDB 1Q36 after performing the mutation Ala313Asp for the study of wild-type enzyme. The semiempirical AM1 Hamiltonian was employed to describe the QM part (THI), while the rest of the system was described using the OPLS-AA and TIP3P force elds for protein and water molecules, respectively, as implemented in the DYNAMO library. Before starting the MD simulations an accurate assignment of the protonation states of all these residues at pH=7 was carried using the empirical propKa. After adding the hydrogen atoms to the structure, she was placed in a cubic box of pre-equilibrated waters (80 ). Then, a series of optimization algorithms (steepest descent conjugated gradient and L-BFGS-B) were applied. Afterward, 100 ps of LangevinVerlet MD at 300 K and in a canonical thermodynamic ensemble were used to equilibrate the model. The system was preequilibrated and 1,3 ns of MD were run at a temperature of 300 K for the system. Thus, QM/MM optimizations were carried out from structures derived from the AM1/MM MD simulations to generate potential energy surfaces. In the mechanism, THI and residues Glu341 and Lys22 were treated with AM1 method, while the protein and water molecules of crystallization were treated with molecular mechanics. The potential energy surface of the reaction was constructed by varying the coordinates d1 in 3 and d2 in 2 . The coordinated d1 is related to the abstraction of a proton from the methyl group by Glu341. The coordinate d2 is the distance between the PO43 group and the atom C2. The results of these calculations show the formation of products EPSP and inorganic phosphate. On the surface the path of least energy represents a concerted step, occurring simultaneously output H of C3, followed by a spontaneous attack of H of Lys22 to PO43 and the consequent release of the phosphate group and formation of product EPSP, after overcoming a barrier 34.8 kcal/mol. The result shows that the deprotonation of C3 occurs simultaneously with the output of the phosphate group, after the transfer of a proton from Lys22. Other proposed mechanisms are being evaluated. Supported by: CNPq, LPDF

42

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

DOCKING MOLECULAR OF INHIBITORS TO THREE-TRIMENSIONAL STRUCTURE PFATP6 ENZYME FROM PLASMODIUM FALCIPARUM
Pinheiro S D S1 , Silva N D F1 , Nascimento S B1 , Alencar N A N D1 , Sousa P R M D1 , Silva J L1 , Alves C N1
1 Laboratrio

de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil Malaria is an infection caused by protozoa of the genus Plasmodium that affects more than 102 countries in tropical areas of the world. The disease can develop its severe form, known as cerebral malaria, and be lethal if not treated early. The primary challenge in infection control is the constant emergence of parasite resistance to antimalarials conventional. Thus, the search for new antimalarial drugs is urgently needed. The theoretical design of biologically active substances is a widely exploited tool that aids the search for new substances with a pronounced and / or lower toxicity, to develop alternative therapies for malaria. Several Ca2+ pumps were identied in the genome of the parasite and one of them PfATP6, an enzyme orthologue sarco/endoplasmic reticulum Ca2+ -ATPase, which has been reported to be the target of the Artemisinin. This P-type ATPase enzyme is a membrane protein responsible for active transport of ions across cell membranes. In this context, we have carried out docking molecular study in the series of compounds with antimalarial activity to enzyme PfATP6. The three-dimensional structure of the enzyme was determined by molecular homologies, from the mold Sarcoplasmic / endoplasmic reticulum calcium ATPase1 (PDB code: 2ZBD) and reported in previous work. Docking simulation was performed using AutoDock Vina program in version 1.2, where all the freedom degrees possible they where free for rotation, with the purpose of obtaining good results considering low computational cost. In docking were accomplished some preliminary stages as the addition of only the polar hydrogen in ligands and the freedom degrees increase for the rotation angles in the bonds with purpose of providing interaction mobility higher and possibility in the enzyme-inhibitor, where was searching a number of a hundred (100) conformations in ten (10) interaction possible. The results showed that among composed used in docking study, just three they presented in the site active of the PfATP6 enzyme. They were selected the ve best conformations in the site active of the enzyme for each one of the three ligands, and just one demonstrated value of lower energy (-8,1 KJ.mol1 ) and a larger similarity when compared with the structure crystallographic of the 2ZBD. Docking simulation, with use Autodock Vina program it show satisfactory exhibition just in the connection manners descriptions that the inhibitor presents in the site active of the PfATP6 enzyme, therefore it showed adequate for the interactions description between enzyme-inhibitor.Supported by: CNPq, LPDF.

43

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

EFFECT OF THE EXPLICIT FLEXIBILITY OF THE INHA ENZYME FROM MYCOBACTERIUM TUBERCULOSIS IN MOLECULAR DOCKING SIMULATIONS
Cohen E M L1 , Machado K S1 , Cohen M1 , NorbertodeSouza O1
1 Pontifcia

Universidade Catlica do Rio Grande do Sul - PUCRS

Protein/receptor explicit exibility has recently become an important feature of molecular docking simulations. Taking the exibility into account brings the docking simulation closer to the receptorsreal behaviour in its natural environment. Several approaches have been developed to address this problem. Among them, modelling the full exibility as an ensemble of snapshots derived from a molecular dynamics (MD) simulation of the receptor has proved very promising. Despite its potential, however, only a few studies have employed this method to probe its effect in molecular docking simulations. We hereby use ensembles of snapshots obtained from three different MD simulations of the wild-type (WT), I21V, and I16T mutants of the InhA enzyme from M. tuberculosis to model their explicit exibility, and to systematically explore their effect in docking simulations with three different InhA inhibitors, namely, ethionamide (ETH), triclosan (TCL), and pentacyano(isoniazid)ferrate(II) (PIF). The use of fully-exible receptor (FFR) models of WT, I21V, and 16T mutants of InhA in docking simulation with the inhibitors ETH, TCL, and PIF revealed signicant differences in the way they interact as compared to the rigid, InhA crystal structure (PDB ID: 1ENY). In the latter, only up to ve receptor residues interact with the three different ligands. Conversely, in the FFR models this number grows up to an astonishing 80 different residues. The comparison between the rigid crystal structure and the FFR models showed that the inclusion of explicit exibility, despite the limitations of the FFR models employed in this study, accounts in a substantial manner to the induced t expected when a protein/receptor and ligand approach each other to interact in the most favorable manner. Protein/receptor explicit exibility, or FFR models, represented as an ensemble of MD simulation snapshots, can lead to a more realistic representation of the induced t effect expected in the encounter and proper docking of receptors to ligands. The FFR models of InhA explicitly characterizes the overall movements of the amino acid residues in helices, strands, loops, and turns, allowing the ligand to properly accommodate itself in the receptors binding site. Utilization of the intrinsic exibility of MTBs InhA enzyme and its mutants in virtual screening via molecular docking simulation may provide a novel platform to guide the rational or dynamical-structurebased drug design of novel M. tuberculosis InhA inhibitors.

44

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

ANALYSIS OF SOLVATION MODELS DURING THE THERMAL UNFOLDING OF PROTEINS.


Rocha G K1 , Custdio F L1 , Dardenne L E1
1 Laboratrio

Nacional de Computao Cientca

The function of a protein is determined by its structure, and to be able to predict the native structure of a protein would help unleash the potential of the large amount of biological sequences information that is being generated by genome projects. Furthermore, the determination of the three-dimensional structure of a protein using rst principles computational methods is one of the most important problems of molecular biology. The inclusion and correct description of solvation effects is essential for the success of protein structure prediction methods. Despite recent progress, the modeling of protein-solvent interactions in computer simulations remains a challenge. Explicit modeling of the solvent, although very useful in molecular dynamics simulations, is not suitable for protein structure prediction because of the large amount of computer time required to survey a range of conformations. Alternatively, several implicit models have been proposed to model the solvation free energy of these systems. This work aimed to compare four different implicit solvation models, during the thermal unfolding of proteins, seeking a better understanding of the behavior of the solvation free energy during this process. The analyzed models were: EAS, I-SOLV, EFF1 and Generalized Born Model (GB). These models were implemented in the GAPF protein structure prediction suite (Genetic Algoritms for Protein Folding), developed in our research group GMMSB/LNCC. The GB model was used as reference, since it is an approximation to the Poisson-Boltzmann (PB) equation, which presents accurate solutions, however its computational cost is prohibitive. To calculate the solvent accessible surface area (SASA), required in the EAS and I-SOLV models, we used the POPS method, which is based on an empirical formula. The AMBER force eld was used in the GB model. Twelve proteins, ranging from 20 to 187 residues, were analyzed in their unfolding process. Explicit solvent NVT simulations of 1ns were performed using the PME method and the GROMOS force eld in the GROMACS simulation package. Each protein was submitted to the following protocol: (I) 50ps at 300K; (II) 450ps of a linear temperature increase until 600K; (III) 500ps of a linear temperature increase until 850K. Snapshots of the protein trajectory were saved at each 5ps (total of 200) and further analyzed using the different implicit solvation models. Our results showed that the solvation free energy decreased during the unfolding, showing a close relationship with the SASA, except for the EAS model. The GB model results, used as reference, showed the highest correlation with the SASA, however its computational cost is much expensive for protein structure prediction. The ISOLV, EAS and EFF1 models are much less expensive than the GB model. The I-SOLV model, despite its simplicity, yielded good results, showing a high correlation with the GB model. The EFF1 model showed similar results with the I-SOLV model. The EAS model showed the worst correlations. These models will be further evaluated in the context of rst principles protein structures prediction using the program GAPF, seeking to increase their performance and their predictive capacity. Key words: solvation models; unfolding; molecular dynamics; protein structure prediction;

45

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

THE USE OF MOLECULAR DYNAMICS TO INFER GOOD AND BAD HLA LIGANDS
Rigo M M1 , Antunes D A1 , Flber C C1 , Chies J A B1 , Sinigaglia M1 , Vieira G F1
1 Departament

of Genetics - Universidade Federal do Rio Grande do Sul

The human organism is constantly challenged against several pathogens and must defend itself but, at the same time, it should be able to differentiate self from non-self molecules. This difcult and complex task is driven by a cluster of molecules and cells that, combined, constitute the human immune system. One of these molecules with pivotal importance is the Major Histocompatibility Complex (MHC) - also called Human Leukocyte Antigen (HLA) in humans. These molecules are responsible by the presentation of small peptides from viruses, parasites, bacteria or even from human proteins to the T Cell Lymphocytes, triggering (or not) an immune response, through the interaction with the T Cell Receptor (TCR). It is known that topologies and electrostatics patterns of the peptide:MHC (pMHC) interacting region are important in the TCR recognition, although the afnity and stability between the peptide (ligand) and the MHC (receptor) plays a crucial role. Bioinformatics tools could be used for the analysis of the molecular basis underlying the peptide binding to the MHC. An important human MHC allele is the HLA-A*02:01 subtype which, according to recent data, is the most frequent allele in the whole world. In the present work, our goal was to characterize the molecular basis for binding of good and bad ligands in the context of the HLA-A*02:01 allele, through molecular dynamics (MD) approach currently assessed by our group. Five optimal ligands, of nine amino acids (9-mer) each were obtained from literature, while ve matching bad ligands, with alternative anchor residues, were inferred in silico , using immunoinformatics prediction tools (T Cell Epitope Prediction Tools from IEDB, NetMHC and SYFPEITHI). The p:MHC complexes were constructed based on a technique (D1-EM-D2 ) standardized by our group. The molecular dynamics starting from the 3D peptide:HLA structure was performed with the GROMACS v4.0.7 program in a periodic box including SPC water model and using the GROMOS56a6 force eld for each good and bad ligand. Briey, each molecule was submitted to one energy minimization step using the steepest descent algorithm, one molecular dynamics step using position restraints for all heavy atoms and ve molecular dynamics steps to gradually heat the system from 50K to 300K over 0.03 ns and a molecular dynamics of 30 ns. The rst results indicated that the majority of complexes with bad ligands disestablish before the good ones, especially in the peptide binding region. Besides the partial detachment of the peptide from the HLA cleft, it was also observed dissociation of alfa-helix, an essential feature for the TCR binding, before the end of the simulation in all complexes. However, alfa-helix dissociation for bad ligands occurred right before the good ones. The RMSD and RMSF were also analyzed and were into agreement with our observations. These preliminary results point to the efciency of MD to differentiate good from bad HLA peptide binders. Supported by: CNPq, CAPES and a grant from Bill & Melinda Gates foundation through the Grand Challenges Exploration Initiative.

46

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

STRUCTURAL ANALYSIS OF THE ZINC FINGER PROTEIN SMZF1 FROM SCHISTOSOMA MANSONI
Bitar M1 , Drummond M G2 , Costa M G S1 , Lobo F P3 , Bisch P M1 , Pierce R4 , Franco G R2
1 Laboratrio 2 Laboratrio

de Fsica Biolgica - Universidade Federal do Rio de Janeiro de Gentica Bioqumica - Universidade Federal de Minas Gerais 3 Laboratrio de Imunologia e Genmica de Parasitos - Universidade Federal de Minas Gerais 4 Institut Pasteur de Lille - France Zinc nger containing proteins are encountered throughout eukaryotes and represent an important class of DNA-binding proteins frequently involved in transcriptional regulation. Structurally, the zinc nger motifs are characterized by two antiparallel -strands and one -helix, stabilized by a zinc ion coordinated by conserved histidine and cysteine residues. In Schistosoma mansoni , the causative agent of schistosomiasis, regulatory proteins can modulate morphological and physiological changes, having crucial roles in parasite development. A previously described C2H2 zinc nger containing protein, named SmZF1, was already localized in different life cycle stages of S. mansoni and shown to activate gene transcription in a heterologous system. A recently identied inconsistency in the previously published SmZF1 coding sequence led to a reanalysis of SmZF1 cDNA, as well as the protein primary and tertiary structures. A high-quality tridimensional structure was generated using molecular modeling techniques, allowing a consistent assessment of protein domains. Molecular dynamics simulations were performed to evaluate the structural stability and native-like properties of SmZF1. Four C2H2 zinc nger motifs were characterized based on the protein tridimensional features and amino acids composition. A DNAbinding domain comprising three of the protein zinc nger motifs was determined and putative binding sites for protein interaction were identied through in silico approaches. Molecular docking calculations were carried out to assess protein-DNA interaction proles. Taken together, these results present a consistent base to the structural characterization of SmZF1. Further experimental analysis remains to be performed to fully establish protein function and structure.

47

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD06

ELECTROSTATIC REQUIREMENTS FOR PPAR GAMMA BINDING


Nascimento A S1
1 ABC

Federal University

PPAR gamma is a ligand regulated transcriptional factor that modulates the transcription of several genes involved in fat and sugar metabolism. Due to its easy bacterial expression and crystallization, several crystal structures of holo -PPARgamma have been reported and deposited in the PDB. Here, we investigated the three dimensional electrostatic properties of 55 PPAR gamma ligands and used this information for clustering them through principal component analysis. We found out that, according to their electrostatic potential, these ligands can be separated in three groups with different binding features. We also observed that non-selective and selective ligands show different 3D electrostatic properties and are separated in different clusters. The rst cluster groups the ligands that pose a negatively charged group close to Arm I in the binding pocket (close to helix H12), such as fatty acids, oxo- and nitro- fatty acids. The second cluster groups some ligands that pose a negatively charged group close to H3 and the b-hairpin, such as PLB, OCR, NZA and 241. The third cluster groups uncharged groups, including the TZD ligand rosiglitazone. Through this analysis, we dened three electrostatically important sites within PPARgamma binding pocket. The rst one seems to be relevant to PPARs gamma and alpha, since most of the ligands in the rst cluster are dual agonists. On the other hand, specic PPARgamma ligands tend to be polar but not charged. We propose that this analysis could be used to guide the development of new specic PPARgamma agonists. Supported by: FAPESP.

48

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

HOMOLOGY MODELING AND MOLECULAR DYNAMICS OF PROTEIN REGULATORY PHOP CORYNEBACTERIUM PSEUDOTUBERCULOSIS
Moraes G1,2 , Costa M4 , Azevedo V3 , Silva A2 , Lameira J1,2 , Alves C1
1 Laboratrio

de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais, Universidade Federal do Par, CP 11101, 66075-110, Belm, PA, Brazil 2 Instituto de Cincias Biolgicas, Universidade Federal do Par, Belm, PA, Brazil. 3 Laboratrio de Gentica Celular e Molecular, Instituto de Cincias Biolgicas, Universidade Federal de Minas Gerais, 31270-090 ,Belo Horizonte-MG, Brazil. 4 Ncleo de Genmica e Bioinformtica, Universidade Estadual do Cear, 60740-000, Fortaleza, CE, Brazil. The animal pathogen Corynebacterium pseudotuberculosis is a facultative intracellular Gram-positive bacterium of the class actinobacteria, which causes caseous lymphadenitis (CLA) in sheep, goats and occasionally other species. Economic losses due to CLA are caused by a decrease in the production of wool, meat and milk, reproductive disorders, premature culling and condemnation of carcasses and skins in abattoirs. This bacterium has a PhoPR two-component system that consists of a sensor histidine kinase (PhoR) and an effector response regulator (PhoP). The importance of this system is due to the fact that is involved in the regulation of proteins of various functions, including virulence. As regulatory cytoplasmic protein PhoP is important because it also activates several virulence determinants. In the N-terminal region of this protein (domain receptor) is located in the active site which is where the phosphorylation reaction occurs, and that activates the protein in an event that depends on magnesium ion (Mg2+ ) binding. In the work reported here we determined the 3D structure of the response regulator protein (PhoP) of C. pseudotuberculosis via molecular modeling by homology. The interaction energy between residue Mg2+ ion and structure of the PhoP protein after 20 ns of molecular dynamics (MD) has been investigated by Quantum Mechanics / Molecular Mechanics (QM / MM). The interaction energy per residue indicates the existence of attractive interaction of residues Asp16, Asp59, Asp66 and Met61 with magnesium. Furthermore, there are two water molecules that interact stabilizing the complex and an octahedral form, and the magnesium atom at the geometric center. The same conformation is observed in homologous Phob (PDB 2IYN). The Asp16 (-1536.2 kJ / mol) and Asp59 (-1230.4 kJ / mol) are residues that show more interaction attractive. The amino acid residue that contributes to greater intensity so repulsive is the Lys109. The model generated providing the rst structural full-length PhoP. It was found that the PhoP protein is highly exible which caused signicant conformational changes during MD, but the presence of magnesium stabilizes the receptor cavity, reducing the diversion of important residues for protein activation. The results can serve as a basis for seeking attenuation genetic inactivation of PhoP. Supported by: CAPES, CNPq.

49

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD06

RNA SECONDARY STRUCTURE SEARCH BASED ON RFAM DATA


Paschoal A R1 , Simes Z L P2 , Durham A M1
1 Bioinformatics

Program, Institute of Mathematic and Statistic, University of Sao Paulo - USP, Mato Street, 1010, SP, Brazil 2 School of Philosophy, Science and Literature, USP, Ribeiro Preto; School of Medicine, USP, Ribeiro Preto, So Paulo, SP, Brazil The annotation of non-coding RNA (ncRNA) is a hard and complex process. Much of this complexity comes from the diversity and variability in type, shape and size of RNAs. Differently from proteins, ncRNAs are lack signals such as domains regions, which aid the annotation process. Moreover, compensatory changes in base-pairs preserve secondary structure, but not necessarily the primary sequence, rendering the exclusive use of similarity search insufcient for ncRNA annotation. To help to ll this gap, we conducted a study of structural search that did not require the use of sequence similarity. We based this research on a dataset that we obtained from the RNA family database (Rfam). We hypothesized that if the method of structural search did work with this dataset, it could be extended to putative candidates lacking functional annotation information. For the task to analyzing this hypothesis, we collected group of an accurate data from Rfam, based on the parameters described in a CONTRAFOLD paper. Base on this dataset, we selected 19 families with 758 sequences. In this sense, our approach was divided into two steps: (i) the prediction of secondary structure of each sequence family; and (ii) to compare those secondary structures between families. In order to make and test predictions regarding this secondary structure, we used different types of tool and different strategies. The applied tools were: RNAfold, CONTRAFOLD, CentroidFold, Sfold and RNAShapes. The reason for such diversity is that we believe that the application of different predictive tools may assist the establishment of the best methodology for structural research. For comparison of all secondary structure families, we followed two more sub-steps: (a) we compared the distance between the structures in each family; and (b) we threshold them, in order to dene to which family each one of them belonged. The step (a) was performed using RNAdistance software. This application is strategic since it provides a score of dissimilarity between structures. The second one (b) was performed using two different metrics: application of a centroid in each family and cluster density of structure ensembles. The establishment of the centroid of each family supported family comparison. The same idea was implemented in relation to cluster density, now based on the ensemble of the structures. Although we achieved the ability to compare the families, the comparison of individual secondary structure between families showed a high number in false-positive. Much of the results that were obtained revealed two limitations: (i) variability in the prediction of secondary structure, regardless of the picked software; and (ii) incoherent results for distinct secondary structures, in regard to RNA distance. Finally, two recently published works conrm our ndings and one of than discussed some discriminatory problems in models of Rfam database. So, the variability in structure represents a major open problem for structural RNA search.

50

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

THEORETICAL STUDIES OF THE NAGZ ENZYME COMPLEXED WITH PUGNAC DERIVATIVES


Nascimento S B1 , Alencar N A N D1 , Sousa P R M D1 , Pinheiro S D S1,2 , Silva N D F1 , Costa K M1 , Moreira E C D O2 , Gordo S M C2 , Silva J L1,2 , Alves C N1
1 Laboratrio 2 Instituto

de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais de Cincias Biolgicas, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil. N-acetyl- -glucosaminidase is a member of the family of glycoside hydrolase 3 (GH3) known as NagZ. This enzyme is found in the cytoplasm and the cell wall in Fusarium solani. This fungus promotes the root rot disease of black pepper (Piper nigrum L). Thus, selective inhibitors of NagZ have received attention as new targets for controlling fungal diseases in plants. On the other hand, the structural determination of GH3 by X-ray techniques was not possible in fungus. Homology modeling is a technique used when is not possible the structural determination of a protein by X-ray techniques. One of the important facts that allow the use of this technique is that proteins are grouped into a limited number of families. In this study, we have employed homology modeling and docking in order to explain the difference in inhibitory potency. In addition, we have investigated the binding of PUGNAc derivatives in complex with NagZ by computational modeling studies. The Three-dimensional structure of NagZ was determined by molecular homology from F. solani (GH3NHa) using as template PDB 2X42 of the Thermotoga neapolitana (TnBgl3B). The model was built by Modeller9v7 and docking simulation was performed with use AutoDock Vina program. The primary amino acid sequence of GH3NHa shows 42% identity with the 2X42. After the building, the model was validated using the Ramachandran plot with 85.1 % of amino acid residues within regions of very favorable and the Root Mean Square Deviation (RMSD) of 0.49 . The active site remained conserved when compared with the 2X42 with key catalytic residues, which are D53, H64, R130, K163, Y210, W243 and E458. The result obtained from docking shows that incremental changes of the N-acyl chain of the parent compound decreases energy afnity between the series of PUGNAc derivatives and NagZ with a good linear correlation (R2= 0.76). Finally, the model obtained by homology modeling seems to be a good model for generation selective inhibitors of NagZ and further clear comparisons of this enzyme at a structural level would greatly accelerate these efforts at attenuating pathogenesis in plant. Supported by: CAPES, PARD-UFPA, LPDF.

51

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

NEW DIRECTIONS TO THE PROTEIN SECONDARY STRUCTURE PREDICTION PROBLEM


Melo J C B D1 , Cavalcanti G D D C2 , Cordeiro G C B2
1 Statistics 2 Center

and Informatics Department - Federal Rural University of Pernambuco of Informatics - Federal University of Pernambuco

A recently propose to treat the protein secondary structure prediction problem, the GMC predictor, achieved expressive results using the established databases RS126 and CB396 for training and testing neural networks with a simple architecture, using the raw data and introducing a preprocess phase for compressing the input data through one of the two methods: principal components analysis (PCA) and independent component analysis (ICA). For all measures used, the GMC predictor presented an excellent performance for helices. But this fact was not observed for the class Strand, motivating new research directions in order to avoid this problem. We observed that multi-layered feed-forward neural networks present some drawbacks when applied in problems like this, because of the complexity of the training. Our propose is to use one-class (OCC) classiers, in which only the genuine patterns are used to train the system. OCC has advantages, such as: it is robust to outliers and it deal properly with unbalanced data sets, once each class is treated separately.

52

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

HOW TO FIND A PARABOLIC COMPETITIVE INHIBITION SITE OF RAT TISSUE KALLIKREIN BY APROTININ - SECOND STEP: VALIDATE AND REFINE A MODEL
Oliveira A B N1 , Alves N R1 , Monteiro M1 , Veloso W N P2 , Xavier L P3 , Santoro M M1
1 Universidade 2 Instituto

Federal de Minas Gerais Educacional Santo Agostinho 3 Universidade Federal do Par Kallikreins are enzymes constituting a subgroup of the serine protease family known to have several physiologic functions. In a paper of Diniz, Carolina M. et. al., it was indicated that the inhibition of rat tissue kalikrein (rK1) (3 .15 nM), in a hydrolysis of D-valyl-L-leucyl-L-arginine 4-nitroanilide (120-640 M), by Aprotinin (10.4-34.6 nM) is a parabolic competitive inhibition, with two inhibitor molecules binding to one enzyme molecule. Parabolic competitive inhibition was also reported for the inhibition of human tissue kallikrein (hK1) by aprotinin. In this kind of inhibition, the rst binding site for aprotinin is the kallikrein active site, but the second binding site is unkwon; in this study, well try to nd the second binding site for aprotinin in rK1. We estimated models for the rK1, since this structure is not known. With such structures, we intent to nd the second binding site by appropriate means (molecular dynamics, docking, or other). We have used the server ModWeb and the software Modeller as tools for comparative modeling. They returned models based on structures in PDB and we chose three models based on the following structures: - rat submaxillary gland, serine protease, tonin (1TON) with 77% of identity; - a complex of mouse nerve growth factor with four binding proteins (1SGF, chain G) with 72% of identity; - mouse glandular kallikrein-13 (1AO5) with 66% identity. Since tonin (1Ton) shows the largest degree of similarity, we have selected it and its derivate model for the second step of validation and renement. To validate, the model and the protein were analyzed with PROCHEK. To compare the structures, we did their structural alignment and adjusted some loops for a greater structural similarity between them before doing energy minimization by Molecular Dynamics (MD). The MD is used to rene a model, namely to include solution aspects and exibility. Here, we nish the construction of this model and we will study the interaction of one Aprotinin molecule docked at the active site. In the sequence, we will study the interaction of the second Aprotinin molecule and nd the second binding site for Aprotinin in rK1

53

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD05

AN ARTIFICIAL NEURAL NETWORK BASED METHOD FOR THE PREDICTION OF APPROXIMATED 3-D STRUCTURES OF MINI-GLOBULAR PROTEINS
Dorn M1 , Buriol L S1 , Lamb L D C1
1 Universidade

Federal do Rio Grande do Sul

One of the main research problems in Structural Bioinformatics is associated to the prediction of three-dimensional (3-D) protein structures. Proteins are long sequences of 20 different amino acid residues that in physiological conditions adopt a unique 3-D structure. This structure determines the function of the protein in the cell: structural functions, catalysts in chemical reactions, transport and storage, regulatory proteins, gene transcription control, recognition proteins (antibodies and other immune system proteins). Knowledge of the protein structure allows the investigation of biological processes more directly, with higher resolution and ner detail. The determination of protein structures is both experimentally expensive (due to the costs associated to crystallography, electron microscopy or nuclear magnetic resonance (NMR)), and time consuming. These might explain why currently scientists are largely dependent on computational methods that can predict the correct 3-D protein structure only from extended and full amino acid sequences. A long the last 20 years several computational methodologies and algorithms have been proposed as a solution to the protein structure prediction problem. Floudas et. al. divide these computational strategies into four classes: (1) First principle methods without database information, (2) First principle methods with database information, (3) Fold Recognition (FR) methods and (4) Comparative Modeling (CM) methods. Predicting the correct 3-D structure of a protein molecule is an intricate and often arduous task. The PSP and Protein Folding (PF) problems are classied in computational complexity theory as NPcomplete problems. Considering the computational complexity of the PSP problem, current 3-D protein structure prediction methods make use of a wide range of optimization algorithms. Meta-heuristics are used in order to provide a near optimal solution. In addition, considering the limitations of current prediction methods (the four classes of protein structure prediction methods), researchers have recently developed hybrid methods which combine principles of these four classes. In this article we present an statistical-fragment-based method to acquire structural information from small protein samples. Structural data obtained from protein templates were used in order to train an articial neural network and a sequence-structure mapping function is used in order to build the approximated 3-D structure of the target amino acid sequence. We tested the developed method in four mini globular proteins whose size vary from 19 to 34 amino acid residues. The RMSD and Ramachandran plot values shows that the approximated structures adopt a fold similar to the experimental structure. This approximated structures can be rened through molecular mechanics methods such as Molecular Dynamics (MD) Simulation. Approximated 3-D structures of protein can reduces greatly the conformational search space and ab initio methods can demand a much reduced computational time to archive native-like structures.

54

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

STRUCTURE-BASED FUNCTIONAL INFERENCE OF HYPOTHETICAL PROTEINS FROM MYCOPLASMA HYOPNEUMONIAE


Fonseca M M B D1 , Zaha A3 , Caffarena E R4 , Vasconcelos A T R2
1 Universidade 2 Laboratrio

Federal Rio de Janeiro Nacional de Computao Cientca 3 Universidade Federal do Rio Grande do Sul 4 Fundao Oswaldo Cruz Enzootic pneumonia caused by Mycoplasma hyopneumoniae is a major constraint to efcient pork production throughout the world. This pathogen has a small genome with 716 CDS, from which almost 42% are annotated as hypothetical or conserved hypothetical protein. Nowadays, there are more than 1200 complete genomes, among all predicted protein genes about half have no inferable function. Alternative methodologies, as molecular modeling (MM) and molecular dynamics (MD) are used to full this gap. They can be used to answer specic questions about protein properties often more readily than experiments. We analyzed eight proteins structures (MHP0263, MHP0618, MHP0500, MHP0402, MHP0285, MHP0674, MHP0167, MHP0684 - Genesul numbering scheme) with no assigned function from M. hyopneumoniae 7448 by MM. Among them one was selected to MD analysis. The sequences with best function prediction threading scores (not published results) were selected to have their 3D structure built. Sequences were retrieved from GeneSul and locally aligned with their templates using MAFFT and manual editing. The alignment served as input to Modeller (v. 9v7) which outputs were assessed by structure evaluation softwares. MHP0263 nal model was submitted to MD simulations. Firstly, MD simulations consisted of two 1000 step minimization with and without constraints, and equilibration process of 500 ps. Equilibrated structures were used as starting points for 34 ns trajectories, performed at NTP. The force eld and water model used was GROMOS and SPC, respectively. MHP0263 electrostatic surface potential distribution shows a negatively charged region believed to mimic the surface of DNA, where DNA-binding proteins would bind. At the other side, positively charged amino acids, particularly lisines and a large number of glutamines, amino acids known to bind DNA were found. The dimer interface consists of a leucine zipper, characteristic of eukaryotic DNAbinding proteins. This region stays unchanged for 30 ns of MD but inuences regions beyond the zipper, which forms a Y-shaped conrmation that grips the DNA. The 3-D structure of MHP0618 revealed the presence of a large positively charged region, typical feature of nucleic acid-binding proteins. M. hyopnemoniae 7448 has only one protein assigned in the synthesis of NAD. MHP0500 and MHP0402 3-D structure predictions roles in this metabolic pathway, completing the syntesis of this compound. MHP0285 structure has features found in proteins implicated in FAD synthesis. As MHP0500 and MHP0402, the annotation of MHP0285 gives great contribution for further studies concerning the metabolism of this bacteria. MHP0674 3-D structure resembles a sigma factor with extracytoplasmic function, which members controls the expression based on cellular or environmental signals, important characteristic for a pathogenic bacteria like M. hyopnemoniae 7448 where only one sigma factor is annotated. Also predicted to participate in transcription, MHP0167 has structural features of NusB, a transcription antitermination factor. This process requires proteins, as NusA and NusG, annotated in M. hyopneumoniae. The structure of MHP0684 was built partially. Only the C-terminal region shown homology with YrdC, putative ribosome maturation factor. The yrdC family codes proteins that occur independently or as a domain in other proteins, which might be the case. 55

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD06

USING INTERACTION FINGERPRINTS AND POCKETS INFORMATION TO CONSTRUCT A BINDING SITE MODEL OF PROTEINS FOR DOCKING STUDIES
Martins-Jos A1 , Lopes J C D1
1 NEQUIM

- Ncleo de Estudos de Quimioinformtica - Departamento de Qumica - Universidade Federal de Minas Gerais The interaction ngerprints are binary vectors that represent the binding modes and intermolecular interactions between ligands and aminoacid residues of a protein. All proteins have different ligands which could produce similar but different ngerprints. These ngerprints can be used to construct a model representing the most common interactions in the set. At same time, all proteins have pockets and cavities on their surfaces that are subject of dynamic motions and each crystallographic structure is different from others, even when collect at the same conditions. The pockets and cavities of a protein can be represented as binary vectors in the same format of the interaction ngerprints. The analysis of the pockets ngerprints and the interaction ngerprints can be used to determine the binding site of proteins, even when there is no ligand binding to the protein. In this work the binding site model represents the residues in the pocket or pockets that the ligands are bound in the crystallographic structures. Many structures are used and the residues lists are more trustworthy than residues lists generated from only one structure. One advantage is the possibility of transpose the residues lists to cristalographic structures without ligand. Another advantage is that the residues list takes in account the dynamics aspects of the protein structures, thus incorporating the motions of the proteins in the analysis. When many 3D structures are available for one protein target just one is selected to be used in the docking studies. Our strategy has been used in molecular docking calculations to select PDB les which represent different conformation of the active site, discarding those that will no introduce new information about the structure. Thus, the dynamic of protein structures is incorporate in the simulation with a low computational cost. The use of target mobility by multiples targets 3D structures makes our docking simulations perform better. The choice of the 3D structures set is an important step in theoretical studies. The interaction ngerprints and the pockets information can be used to make the best choice and get better results. Supported by: CNPq

56

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

MOLECULAR MODELING OF PROTEIN CONTAINING THIOESTER (TEP) IN BIVALVE MOLLUSK (MUSSEL)


Gordo S M D C1,2,3 , Sampaio M I C1,2,3 , Moreira E C D O1,2,3 , Nascimento S B1,2,4 , Silva J L1,4
1 Universidade 2 Programa

Federal do Par de Ps-Graduao em Gentica e Biologia Molecular 3 Laboratrio de Gentica e Biologia Molecular - Campus de Bragana 4 Laboratrio de Planejamento e Desenvolvimento de Frmacos Thioester-containing proteins (TEPs) are characterized by a single intra-chain -cysteinyl- -glutamyl thioester which represent an important role in the innate immune response. This protein was detected in hepatopancreas and gonads of bivalves such as Pecten maximus (scallop) and in vertebrates it plays a signicant role in the immune response of these organisms because they are components of factors C3, C4 and C5 of the complement system or may act as protease inhibitors. TEPs are highly reactive and hydrolysed by water. TEPs act as mediators in connection with the antigen of the pathogen. Therefore the denition of three-dimensional structure by molecular modeling is indispensable to determine parts of the mechanism of infection to the molecular level.The sequence of nucleotides for the modeling was obtained from cDNA library generated from the clam M. guyanensis which obtained a transcript with homology of 79% compared to the TEP Chlamys farreri (scallops). For this protein modeling technique was used to determine molecular homology of the three-dimensional structure of PTE M. guyanensis. The primary amino acid sequence of M. guyanensis shows 43% identity with the model obtained 1HZF database PDB / BLAST program and aligned with Modeller9v8. This tempalte was selected based on achieving resolution of 2.3.The model was built and validated through the Ramachandran plot with 90.7% of amino acid residues in the most favorable. The model can be considered good quality, although it should be rened using computational methods of molecular dynamics.

57

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD06

SECONDARY-STRUCTURE OVERLAPPING GROUP (SOG) COMBINES HOMOLOGOUS GROUPS WITH SAME ONTOLOGY
Coelho-Jr O1 , Ortega J M1
1 Laboratrio

de Biodados - ICB - Universidade Federal de Minas Gerais

The arrangement of secondary structural elements such as alpha helix and beta strands are often used to select templates for structural modeling and also to cluster structural neighbors. To examine the biological meaning of secondary structure array, we investigated secondary structural overlapping using SOV metrics. First, we downloaded the complete proteome of Escherichia coli (Txid 316385) and submitted it to secondary structure prediction using SSPro4, one of the most accurate software for this task. Subsequently, we compared in a pair-wise way the entire proteome, obtaining a normal curve with 61.5% overlap mean value, and a frequency tending to zero as SOV surpasses an 90% overlap threshold. Our rst approach was to investigate the distribution of SOV values when comparing only groups of homologous genes obtained in the Kegg Orthology (KO) database. All E. coli KO containing clusters were investigated. Thus, results appear to suggest that a secondary-structure overlapping group (SOG) clusters homologous genes with a restriction to clade, being more accurate as the clade is more evolutionarily restrained. Thus, when a single proteome such as E. coli is examined, SOGs should tend to clusters paralogs. To investigate this, we set up to build hierarchical clusters varying the secondarystructure overlapping cutoff from 80% up to 95%. We observed that Soto OG80, SOG85, SOG90 and SOG95 grouped members from a single KO cluster with a fraction of, respectively: 0.4, 0.47, 0.49 and 0.60; and two different KO cluster with lower fractions, respectively, 0.36, 0.36, 0.37 and 0.30. However, when analyzing the large SOG80 clusters, we observed that it had grouped members from different KOs, but with the same Brite classication (a Kegg equivalent of Gene Ontology). Indeed, SOG80, SOG85, SOG90 and SOG95 grouped members with a unique Brite designation under a fraction of, respectively, 0.78, 0.85, 0.86 and 0.88. In conclusion, SOG metrics reported here is able to associate proteins under their functional classication.

58

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

CUTOFF SCANNING MATRIX (CSM): FUNCTION PREDICTION AND FOLD RECOGNITION BY PROTEIN INTER-RESIDUE DISTANCE PATTERNS
Pires D E V1,2 , Melo-Minardi R C D2 , Santos M A D2 , Silveira C H D3 , Santoro M M1 , Meira-Jr W2
1 Department 2 Department

Biochemistry and Immunology - Universidade Federal de Minas Gerais of Computer Science - Universidade Federal de Minas Gerais 3 Advanced Campus at Itabira - Universidade Federal de Itajub Background : The unforgiving pace of growth of biological data production and storage in public repositories have been generating an increasing demand for efcient and scalable paradigms, models and methodologies for automatic annotation. In this paper we present a novel structure-based protein function prediction and fold recognition method: Cutoff Scanning Matrix (CSM). CSM generates feature vectors that represent distance patterns between the protein residues. These feature vectors are then used as classication evidences. Singular Value Decomposition is used as a preprocessing step to reduce dimensionality and noise. A series of experiments were performed on datasets of mechanistically different enzyme superfamilies and others derived from SCOP release 1.75. Results : CSM was able to achieve a precision up to 99% after SVD preprocessing for a database derived from manually curated protein superfamilies classied in terms of families with similar functions. Moreover, we carried out experiments that aimed to verify our ability to assign SCOP class, superfamily, family and fold, to protein domains. An experiment using the whole set of domains nd in last SCOP version were performed obtaining very high levels of precision and recall (up to 95%). Finally we made a comparison of our fold recognition results, in order to put this work on a literature context. Our method was capable of signicantly overcome the recall of the previous study while preserving a compatible precision level. Conclusions : We show that the patterns derived from CSMs can effectively be used to predict protein function and thus help with function annotation. We also demonstrate that our method is effective in fold recognition tasks. These facts reinforce the idea that the pattern of inter-residue distances is an important component of family structural signatures. Furthermore, the Singular Value Decomposition provided a consistent increase in precision and recall, which makes it an important preprocessing step when dealing with noisy data. This work was supported by the Brazilian agencies: CAPES, CNPq, FAPEMIG and FINEP.

59

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD05

THE MOLECULAR DOCKING STUDY OF TRYPANOTHIONE REDUCTASE USING VINA PROGRAM


Rocha J A P D1 , Medeiros I G D1 , Ramos F C2 , Silva N D F2 , Nascimento J P2 , Molfetta F A D1 , Alves C N2
1 Laboratrio

de Modelagem Molecular, Instituto de Cincias Exatas e Naturais, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil 2 2Laboratrio de Planejamento e Desenvolvimento de Frmacos, Instituto de Cincias Exatas e Naturais, UNIVERSIDADE FEDERAL DO PAR, CP 11101, 66075-110, Belm, PA, Brazil Chagas disease or American Trypanosomiasis is a endemic problem of the Latin America. This disease is considered the larger sixth neglected world tropical disease, and current data estimates that between 18-21 million people are infected every year. This disease is caused by the agellated protozoan Trypanosoma cruzi (T. cruzi ). There are many efforts to control this disease, but until now the therapy is based only in two drugs, which are nifurtimox and benznidazole. The Trypanothione Reductase (TR) enzyme is a validate target and it can be used to plain new inhibitors of Chagas disease treatment. The enzyme is dependent of NADPH and catalyzes trypanothione dissulfeto T(S)2 reduction in trypanothione ditiol T(SH)2, producing a series of responsible events for the oxygen reactive species neutralization. Besides, the TR conserves an ambient reducer inside the parasite, preventing the oxidative stress. In this work a docking study was performed with the objective of verify the interactions of different compounds that presented constant inhibition values against TR enzyme. The protein structure used in the docking study was obtained from Protein Data Bank (1AOG) and the resolution of this enzyme is 2.30 . The Chimera program was used to remove the ligand and all water molecules from protein structure. The Vina program version 1.1 was used to docking study. From this, we recuperated 61 structures of literature with constant inhibition (Ki values), and these molecules were docking in TR enzyme. The results showed that 61 structures that were studied, only 3 out 61 structures were selected. These three structures presented best docking energies and interactions with important amino acid residues for activity of the enzyme. The structures 25, 31 and 45 were selected as the best candidates for presenting the best interactions in the active site. In addition, the 31 structure presented the best energy docking and makes interactions with the amino acids residues Thr335, Cys53, Ala365, Met333, Ser15, Gly16 and Gly17, respectively. These interactions are important and could be explored to design selective inhibitors of TR. Thus, the use of TR, that is essential in the parasite life cycle that indicates that the virtual screening using the Auto Dock Vina program can be employed to study of TR inhibitors to design new drugs against Chagas disease.

60

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD04

PREDICTING PROTEIN FUNCTION BASED ON 3D RESIDUE MOTIFS HOMOLOGY


Fonseca L A C1 , Minardi R C D M1 , Jr W M1
1 Universidade

Federal de Minas Gerais

In this work we investigate if it is possible to predict protein function based on 3D residue motifs homology. We also want to identify what models and algorithms are more appropriate to achieve this goal. Related studies are base on graph modeling and subgraph isomorphism, geometric hashing and constraint-based 3D template matching. We propose a novel approach based on linear programming to protein 3D motif search and a comparison of other paradigms already used with this goal. Linear programming is a technique for optimization of a linear objective function given a list of requirements represented as linear equations or in-equations. We model the residues from the 3D motif query as points represented by the last heavy atoms (LHA) from theside chains and create a clique where each edge is labeled by the distance between the adjacent nodes.We want to match the edges from the query graph and the search space in order to: mmin dij xijj=1 where n is the number of amino acid residues from the query motif and m is the number of residues from the search space, i.e., the residues from the protein where we are searching for the query. xij are binary variables that encodes possible matches between edge i from the query graph and j from the search space graph.We optimize this equation subject to the following constraints: m for every j, xij = 1, with i from 1 to n (1) for every i, xij 1 with j from 1 to n (2) which means that (1) every edge from the query graph must be matched with an edge from the search space graph and (2) each edge in the search space graph must be matched to up to one edge from the query graph.We selected 2000 random PDB les and separated their chains ve different families from SCOP (Concanavalin-A, Cupraredoxin, Cysteine proteases, Trypsins and Zinc-nger proteins). We veried that we are able to retrieve the members from each family from the whole database by analyzing the area under a ROC curve which is usually used to computed the precision of the retrieval system.Any score to compare biological entities must be considered beside the likelihood that is could have been observed by chance (i.e. its statistical signicance). It needs to be able to separate true similarities from noise and to compare different searches.As future work, we are going to implement metrics to guarantee statistical signicance of the patterns. Supported by: CAPES, CNPq, FAPEMIG, FINEP.

61

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Structural Bioinformatics and Molecular Dynamics PI: SBMD03

MOLECULAR DYNAMICS STUDY OF THE ARCHAEAL AQUAPORIN AQPM


Araya-Secchi R R1,2 , Garate J A1,3 , Holmes D S4 , Perez-Acle T1
1 Computational

Biology Laboratory (DLab). Centro de Modelamiento Matemtico, Facultad de Ciencias Fsicas y Matemticas, Universidad de Chile; Santiago, Chile 2 Programa de Doctorado en Biotecnologa. Facultad de Ciencias Biologicas, Universidad Andrs Bello; Santiago, Chile 3 School of Chemical and Bioprocess Engineering, University College Dublin, Dublin 4, Ireland 4 Center for Bioinformatics and Genome Biology (CBBG), Fundacin Ciencia para la Vida and Depto. de Ciencias Biologicas, Facultad de Ciencias Biolgicas, Universidad Andrs Bello; Santiago, Chile Aquaporins are a large family of trans-membrane channel proteins that are present throughout alldomains of life and are implicated in human disorders. These channels, allow the passive butselective movement of water and other small neutral solutes across cell membranes. Aquaporinshave been classied into two sub-families: i) strict aquaporins that only allow the passage ofwater and ii) the less selective aquaglyceroporins that transport water and other neutral solutes,such as glycerol, CO2 or urea. Recently, the identication and characterization of a number ofarchaeal and bacterial aquaporins suggested the existence of a third sub-family; one that isneither a strict aquaporin nor an aquaglyceroporin. The function and phylogeny of this thirdfamily is still a matter of debate. Twenty nanosecond molecular dynamics (MD) simulation of afully hydrated tetramer of AqpM embedded in a lipid bi-layer permitted predictions to be madeof key biophysical parameters including: single channel osmotic permeability constant (pf),single channel diffusive permeability constant (pd), channel radius, potential water occupancy ofthe channel and water orientation inside the pore. These properties were compared with those ofwell-characterized representatives of the two main aquaporin sub-families. Results show thatchanges in the amino acid composition of the aromatic/arginine region affect the size andpolarity of the selectivity lter (SF) and could help explain the difference in water permeabilitybetween aquaporins. In addition, MD simulation results suggest that AqpM combinescharacteristics of strict aquaporins, such as the narrow SF and channel radius, with those ofaquaglyceroporins, such as a more hydrophobic and less polar SF. These results extend previousevidence that AqpM exhibits hybrid features intermediate between the two known aquaporinsub-families, supporting the idea that it may constitute a member of a novel class of aquaporins. Supported by CONICYT (PFB16 and PFB3), Fondecyt 1090451 and a Microsoft SponsoredResearch Award. The presenting author is a CONICYT Fellow.

62

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 3 T RANSCRIPTOMICS AND P ROTEOMICS

63

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP01

COMPARISON OF TRANSCRIPTIONAL RESPONSE TO LOW PH IN S. BOULARDII AND S. CEREVISIAE W303 USING MICROARRAY
Dias L L C2 , de L D M2 , Laat D M D3 , Grynberg P2 , Franco G R2 , Castro I M4
1 1Departamento 2 Departamento

de Bioqumica e Imunologia, ICB, UFMG, Belo Horizonte, MG - Brasil de Bioqumica e Imunologia, ICB, UFMG, Belo Horizonte, MG - Brasil 3 Centro de P & D em Recursos Genticos Vegetais, IAC, Campinas, SP-Brasi 4 Laboratrio de Biologia Celular e Molecular, Ncleo de Pesquisa em Cincias Biolgicas, ICEB, UFOP, Ouro Preto, MG-Brasil Key words: yeast, acid stress, probiotic, gene expression analysis Recent studies have shown that S. boulardii and S. cerevisiae are members of the same species, despite their metabolic and physiological differences, especially concerning growth and survival at 37o C and acid stress. Tolerance to heat and low pH are the most important microbial characteristics to be considered for choosing a probiotic. Microarray analysis employing S. boulardii and S. cerevisiae W303 strains growing in pH 2,0 and 85 mM NaCl for 10 min and 30 min were carried out to nd differentially expressed genes that confer resistance to this condition. Intra-strain comparison did not reveal genes with differential expression after 10 and 30 min of treatment (B value > 3 used as the cutoff). In the opposite, inter-strain comparison evidenced many differentially expressed genes, reecting variations in viability between both strains, when cells were treated with pH 2.0 and 85mM NaCl. Heatmaps showed similarity in global gene expression between non-treated cells and between cells treated for 30 min, when comparing S. cerevisiae versus S. boulardii . However, substantial differences in global gene expression after treatment for 10 min were observed, when comparing both strains. Treatment for 10 min with pH 2.0 and 85mM NaCl may be more appropriate to address questions concerning response to low pH, once Saccharomyces has sets of mRNAs with half-lives as short as 3 min. In addition, it is plausible to consider that cells treated for 30 min are nearly recovered and adapted to the low pH condition. Increase in expression of genes related to amino acid metabolism and pH homeostasis in S. boulardii can point out a mechanism for resistance to acid stress in this strain. Further qPCR analysis will be performed to validate genes involved in the differential acid response in S. boulardii . Supported by FAPEMIG

64

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP06

TILING ARRAY OF PLASMODIUM VIVAX: BIOINFORMATICS CONTRIBUTIONS FOR THE STUDY OF MALARIA
Corra B R S1 , Fernndez-Becerra C2 , Portillo H A D2 , Vncio R Z N1
1 LabPIB 2 CRESIB

- Laboratrio de Processamento de Informao Biolgica - Universidade de So Paulo, Brasil - Centro de Investigacin en Salud Internacional de Barcelona - Universidade de Barcelona,

Espanha Plasmodium vivax is the parasite that causes human malaria with broader distribution on the planet, being responsible for reducing the quality of life for millions of people around the world. Nevertheless, its study is often overlooked in relation to P. falciparum that causes lethal malaria, perhaps due to the misperception of its different severity. Through systems biology methodologies, it is believed that it will be possible to elucidate the mechanisms used by P. vivax to escape the spleen elimination and thereby developing a chronic infection. The tiling arrays are crucial tools to achieve this goal by enabling, among other types of study, the observation of the expression level of genes not yet identied, in addition to studying the structure of known or predicted gene transcripts. The construction of a tiling array requires intense computational efforts in their planning, especially when one cannot print probes for the whole organisms genome. In this project, a tiling array of P. vivax is being designed using the platform eArray Agilent . Probes of 60 bases that cover the entire transcript portion of the parasite genome being are designed, with an overlap of 15 bases, on average, between the probes. Until now, 280.277 probes have been designed, covering 5500 genes of P. vivax , available at: labpib.fmrp.usp.br/-bcorrea/Tiling-tdt-v01. After selecting and ltering all the probes of interest in the genome, a glass slide formatted with one highdenition 1M array, with capacity of 974.016 features, will be commissioned to the company Agilent . Then, experimental studies of transcriptome of parasites isolated from patients in the Amazon region will be undertaken, coordinated by molecular parasitologists of Barcelona Centre for International Health Research, CRESIB Spain, international partners of this project. Later, statistical analysis of data from this experiment will be performed using appropriate available softwares and methods implemented in the programming language R. Therefore, we expect to contribute through bioinformatics, for clarication of the strategies to establish chronic infections used by P. vivax , knowledge that may be used to combat malaria caused by this parasite. Supported by: FAPESP

65

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP04

PRINCIPAL COMPONENT ANALYSIS AS AN APPROACH TO FIND SIMILARITIES AND DIFFERENCES IN A GENE EXPRESSION DATA
Garcia J C O1 , Vncio R Z N1
1 Universidade

de So Paulo, Faculdade de Medicina de Ribeiro Preto - Dept. de Gentica, Laboratrio de Processamento de Informao Biolgica - LabPIB The knowledge extraction from microarray data is a classical bioinformatics application. This kind of data is multivariate with hard interpretation and analysis for those with a traditional biology background. To overcome such task is necessary to use mathematical techniques, widely used in the Pattern Recognition area, such as the Principal Component Analysis (PCA) among others. This statistical technique may be used in microarray data in order to reduce the dimensionality of the analysis. It may thus provide, in some cases, a better interpretation of the biological studied phenomenon. Recently, the technique PCA was employed in a transcriptomes comparative analysis of tumor cells and normal cells of human prostate tissue (Pascal et al . BMC-Cancer 2009). According to Pascal et al .s work, there are relations between the normal and cancer transcriptomes that can be better observed graphically with help of PCA technique. The dimensionality reduction intended with this technique demand for redundant factors that inuence the distribution of the set as a whole. The technique works as follow: rst we seek for gradients of higher variation and subsequently rotated the data according to these gradients. Therefore, each one of these gradients become, coordinated axis in a Cartesian space. The rotation of the data allows us a new visualization form. The data is now distributed according to the differences and similarities rise into account the whole set of features. A PCA graphic has not an intuitive meaning. However it serves to put forward questions such as: What is particularly in the samples capable of separating them into distinct groups? What are special features that make it different? What the importance of genes analyzed for the pattern of distribution of samples? In this work we applied the PCA technique in Pascal et.al. dataset and others such as ES (embryonic stem H1 cell line culture) and EC (embryonic carcinoma - NCCIT culture cells). PCA showed the relatedness of the ES and H1 transcriptomes to each other and to those of the other major cell types of the prostate: stromal (S), luminal (L), basal (B), and endothelial (E). The analysis suggested that, according with the major differences in gene expression of cell types, H1 was most similar to NCCIT than any of the other histologically normal prostate cell types. Supported by: Capes

66

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP03

MOLECULAR, BEHAVIORAL AND ANATOMICAL SOPHISTICATION IN SPIDER WEBS: INSIGHTS FROM SPINNING GLAND RNA-SEQ EXPERIMENTS IN PRIMITIVE AND MODERN SPIDERS
Prosdocimi F1 , Bittencourt D3 , Silva F R D2 , Motta P C4 , Rech E2
1 Universidade

Catlica de Braslia Ocidental de Braslia

2 EMBRAPA-CENARGEN 3 EMBRAPA-Amaznia 4 Universidade

RNA-seq experiments conducted in 454 sequencers were carried out to produce 87.000 short-reads representing the transcriptome of two spidersspinning glands. We produced sequences from (i) Actinopus sp., a spider from the Mygalomorphae clade, and (ii) Gasteracantha cancriformis , an Orbicularia spider. Mygalomorphae spiders are known to retain a number of primitive morphological and behavioral characters. They use mixtures of a primitive web, soil, and plants only to cover a burrow they make on the ground for shelter and predation. On the other hand, Orbicularia spiders show a number of derivative spiders characters and they are capable to build different and complex silks used in a variety of situations. It is interesting to note that the complexity of web production, usage and behavior in these spiders is reected both by (i) the variety of the repertoire of protein molecules (spidroins) they use to make their webs and (ii) the complexity of their anatomical spinning gland apparatus used to produce silk. Here we have rst conducted a broad analysis of the spinning gland transcriptome in both spiders producing unigenes and categorizing annotated genes in biological functions. Then we started to analyze the number and variety of spider silk proteins and families found in different spider clades. We have shown that spiders using web only for a limited number of situations present a less sophisticated morphological spinning apparatus and produce a small repertoire of spidroin molecules. Phylogenetic analyses were conducted in the 3region of spidroins and we try to relate (i) the evolution of silk protein families, (ii) the evolutionary complexication of silk production behavior and web usage, and (iii) the appearance of new specialized spinning glands along the evolution of specimens and clades in the Araneae order. Supported by: CNPq, FAP-DF.

67

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP06

IN SILICO EVALUATION OF COFFEE TRANSCRIPTOME: INFERENCE OF NEW MECHANISMS OF GENE EXPRESSION REGULATION AND IDENTIFICATION OF SNPS
Vidal R O1 , Mondego J M C2 , Tokuda E K1 , Parizzi L P1 , Costa G G L1 , Pot D3 , Ambrosio A B1 , Andrade A C4 , Pereira L F P5 , Colombo C A C2 , Vieira L G E6 , Carazzolle M F1,7 , Pereira G A G1
1 Laboratrio 2 Instituto

de Genmica e Expresso, UNICAMP Agronmico de Campinas (IAC) 3 CIRAD, UMR DAP, France 4 LGM-NTBio - Embrapa Recursos Genticos e Biotecnolgicos 5 Embrapa Caf - Instituto Agronmico do Paran (IAPAR) 6 Instituto Agronmico do Paran (IAPAR), Laboratrio de Biotecnologia Vegetal 7 CENAPAD-SP - Centro Nacional de Processamento de Alto Desempenho em So Paulo, UNICAMP Coffee is one of the most important crops in the world, being worldwide consumed and having signicant participation in under development economies. Coffea arabica and Coffea canephora are responsible for 70% and 30% of commercial production, respectively. Cytogenetic analysis established mya) hybridization between the that C. arabica is an autogamous alotetraploid formed by a recent (1 diploids C. canephora and Coffea eugenioides . C. eugenioides is a wild species which grows in higher altitudes near forest edges, and produces few berries with small beans of low caffeine content. On the other hand, C. canephora is alogamous and grows better in lowlands. It is also characterized by higher productivity, more tolerance to pests, and higher caffeine content, but it has an inferior beverage compared with C. arabica . During the last decade, research initiatives have been launched to produce genomic and transcriptomic data about Coffea spp . This EST collection represents a good overview of C. arabica and C. canephora transcriptome, being appropriate as a resource for Coffea molecular analysis. This work aimed to obtain further information about Coffea spp. gene structure and expression and to identify genes that are specic or expanded in coffee plants. Moreover, it also intended to study the homeologous gene expression regulation in the alotetraploid C. arabica . In order to investigate these data two different EST assemblies were performed: (i) with each species individually, aiming the comparative analysis between the C. arabica , C. canephora and other crops; and (ii) with both coffee species together, allowing the identication of SNPs between C. arabica and one of its direct ancestors C. canephora and the examination of evolutive issues in C. arabica .. The identication of differentially expressed transcripts and new gene families offered a starting point for the correlation of gene expression proles and Coffea sp . development traits. We detected different GC3 proles between C. arabica and C. canephora that could be related with genome structure and mating sytems of these species. Protein domain and Gene Ontology analyzes suggested signicant differences between the data of coffee species analyzed, mainly in relation to complex sugar synthases, nucleotide binding proteins and retrotransposons. OrthoMCL tool identied specic or prevalent coffee protein families when compared with other ve plant species. In addition, two-dimensional hierarchical clustering was used to independently group C. arabica and C. canephora clusters according to expression data extracted from EST libraries. Using the high quality discrepancies, found in overlapped ESTs from C. arabica and C. canephora, sequence diversity proles were evaluated within both species and used to deduce the transcript contribution of the C. canephora and C. eugenioides ancestors in the C. arabica . The assignment of the C. arabica homeologous genes to the ancestral genomes allowed us to analyze gene expression contributions of each subgenome. We suggest that this phenomenon has an important issue in Coffea gene expression and physiology. 68

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Supported by: FAPESP (Grant 07/51031-2 to R Vidal); Consrcio Pesquisa Caf, CNPq and Agronomical and Environmental Genomes (AEG)/FAPESP (Grant 00/10154-5)

69

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP09

EMERGING ROLES OF DESUMOYLATING ENZYMES IN SCHISTOSOMA MANSONI


Pereira R1 , Queiroz K1 , Carvalho S1 , Barboza N1 , Passos L2 , Borges W1 , 1
11

Departamento de Cincias Biolgicas/ Ncleo de Bioinformtica/Ncleo de Pesquisa em Cincias Biolgicas - UNIVERSIDADE FEDERAL DE OURO PRETO 2 2CENTRO DE PESQUISAS REN RACHOU Schistosoma mansoni during its developmental processes suffers intense and complex structural reorganization of the tegument and several important metabolic transformations. It is suggestive that signaling pathways may play a role giving molecular signals to dene the route which the parasite must carry on. Posttranslational modication by small ubiquitin-like modier (SUMO) controls diverse cellular processes including transcriptional regulation, nuclear transport, cell-cycle progression, DNA repair and signal transduction pathway. Sumoylation is a highly dynamic process that is reversed by a family of Sentrin/SUMO-specic proteases (SENPs). Thus, desumoylation process must be important for regulation of the fate and function of SUMO-conjugated proteins as well as SUMOylation process. SENPs catalyze the removal of SUMO from SUMO-conjugated target proteins as well as the cleavage of SUMO from its precursor proteins. Since the rst report of yeast desumoylating enzymes, many studies have revealed the structural and cellular biological properties of SENP family. Recently, in S. mansoni we have found two paralogues SMT3B/C that may play a role on SUMOylation of targeted substrates. The primary goal of this work was to identify the set of subclass SUMO-specic processing proteases in genome and transcriptome databases of the parasite S. mansoni using amino acids sequences of putative orthologs proteins as queries from Homo sapiens . Our in silico analyses retrieved 2 putative sequences (SmSENPs) that contain the domain Peptidase_C48 (PF02902) in the parasite database (GeneDB). Moreover, multiple alignments of SmSENPS enzymes were performed by ClustalX 2.0 and phylogenetics analyses were conducted in MEGA 4.1. The analyses revealed that the SmSENPs are well-conserved at the amino acid level when compared to their orthologs. In addition, the transcript levels of these genes were analyzed by qRT-PCR using total RNA from cercariae and adult worms, and in vitro cultivated schistossomula with 3.5h (MTS-3.5), 24h (MTS-24), 48h (MTS-48), 72h (MTS-72), 5 days (MTS-5d) and 7 days (MTS-7d) and normalized to an endogenous transcript (Sm-tubulin). We observed that the transcripts of SmSENPs gene family were expressed in all investigated stages and their levels of expression differed signicantly (p<0.05). SmSENP1/7 transcripts were expressed in high level in the initial stages as MTS-3,5. SENP1 was 2-fold more expressed when compared to SENP7 levels in cercariae and MTS-3,5h. We conclude that SmSENPs gene can be a novel regulator of the SUMO system and the regulation of several cell functions in S. mansoni . Now, our group started to be appreciated and experimentally addressed to understanding how SmSENPs functions may contribute to parasite development. Keywords: Sumo, Schistosoma mansoni , gene expression Supported by: CNPq, FAPEMIG- TCT 12.009/09.

70

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP04

CHANGE POINT PROBLEMS IN RNASEQ DATA: A BAYESIAN APPROACH FOR TRANSCRIPT SEGMENTATION
Vncio R1 , Baliga N3 , Koide T2
1 LabPIB,

Lab. Processamento de Informao Biolgica, Dept Gentica FMRP-USP Lab. Biologia Sistmica de Microorganismos, Dept Bioqumica FMRP-USP 3 Institute for Systems Biology
2 LabSisMi,

The RNAseq specialized literature insists to claim that this emerging experimental technique can achieve an amazing 1 (one) base resolution when studying transcriptomes from the transcript structure point of view. Although theoretically possible, simple simulations can readily dismantle such strong claim if one considers all transcripts and their whole length, not only start/end. This happens because, in a reasonable sequencing setup, the desired resolution would be achieved only by an unrealistic large number of runs. This is a problem for transcript extremities determination, where one can compare a zero count base position with its non-zero count neighbor to dene the transcript start/end; however it is even worse for internal structure determination, where there is a position specic differential expression (non-zero vs non-zero). Recent RNAseq and tilling array data have been showing that transcripts have internal structure and some of the sub-sequences inside of it may have their own functional role in gene expression regulation. These observations pushed us to resort on statistical methods in order to estimate such start/end positions and to segment sub-sequences inside the transcript by its expression level signals measured by next-generation sequencing platforms. Our data, generated by a collaboration between the Laboratrio de Biologia Sistmica de Microorganismos (LabSisMi) and the Institute for Systems Biology (ISB), is a triplicate measurement of the extremophile archea Halobacterium salinarum transcriptome. The data was obtained using the Illumina platform and was pre-processed (read-genome mapping, base counting, etc) as usual in the eld. The mathematical problem to which our bioinformatics challenge is related is known generically as the Change Points Problem (aka Detection of Jumps, Regime Shift Analysis among others). Several methods and software packages are available for that but, to the best of our knowledge, none ne-tuned for RNAseq analysis. Our starting point to attach this problem is the Bayesian method proposed by Barry and Hartigan (Erdman and Emerson, Journal of Statistical Software 2007). Our preliminary results show that the Bayesian method, implemented at the bcp R package, yield good results being able to segment internal signals that may indicate sub-sequences with biological importance other than the whole messenger sequence that host it. Supported by: FAPESP, CNPq, FAEPA

71

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP05

META-ANALYSIS OF GENE EXPRESSION IN SCHIZOPHRENIA BY A MIXED LINEAR MODEL


Cancherini D V1 , Brentani H2 , Pereira C A D B1
1 Departamento

de Estatstica - INSTITUTO DE MATEMTICA E ESTATSTICA DA UNIVERSIDADE DE SO PAULO 2 Departamento de Psiquiatria - FACULDADE DE MEDICINA DA UNIVERSIDADE DE SO PAULO Schizophrenia is a psychiatric disease whose importance stems both from its prevalence of approximately 1% in the general population and from its usually devastating consequences on the lives of affected individuals. In an effort to understand its pathophysiological mechanisms, several studies have used microarray technology to evaluate, in deceased patients, gene expression in prefrontal cortex, where functional anomalies are regularly seen in this disease. However, this type of evaluation has been plagued by a series of difculties, among which major ones are: variability from the microarray technique, biological variability inherently present in this complex disease, RNA quality limitations deriving from the long minimum post-mortem interval for human cortical samples, difculties in obtaining an adequately large bank of human cerebra. The consequence is that each study has not been able to analyze data from enough patients for consistent conclusions about gene expression in schizophrenia to be reached. This work tries to circumvent this problem by meta-analytic methodologies. We sought for publicly available data of these studies, and selected only studies that have made use of two popular and recent Affymetrix chips (HG-U133A and HG-U133P). We found seven studies that looked upon RNA samples from a total of 90 schizophrenia patients and 94 healthy controls. However, four of these studies shared a subset of 35 patients and 35 controls, which required a specially tailored statistical approach. In order to uncover differences between patients and controls in a reunion of data from all the seven studies, we used a classical mixed linear model that treated study as random factor, and considered as xed factors/covariates both diagnosis and a series of potentially confounding variables: sex, age, post-mortem interval, brain pH, and microarray chip. We compared identities and functional proles of differentially expressed genes in individual studies and in our meta-analysis. Supported by: CAPES.

72

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP04

UBIQUITIN-LIKE MODIFIERS AND THEIR DECONJUGATING ENZYMES IN PROTOZOAN PARASITES


Barboza N R1 , Olmo R P1 , Pereira R V1 , Carvalho S1 , Queiroz K B D1 , Leal D A1 , Renata G S1
1 Departamento

de Cincias Biolgicas/Ncleo de Bioinformtica/Ncleo de Pesquisa em Cincias Biolgicas- UNIVERSIDADE FEDERAL DE OURO PRETO The genome sequences of Leishmania major , Trypanosoma brucei and T. cruzi revealed that each genome contains 8300-12000 protein-coding genes, of which approximately 6500 are common to their genomes. In this study we focused the mining the trypanosomatid databases looking for proteins involved in ubiquitin metabolism. There are basically nine distinct subfamilies of enzymes involved in this pathway, denominated DUBs: ubiquitin C-terminal hydrolases (UCHs), ubiquitin-specic processing proteases (USPs), Machado-Joseph disease proteases (MJDs), ovarian tumour proteases (OTUs), JAMM motif proteases, sentrins, autophagins, Permuted Papain fold Peptidases of DsRNA viruses and Eukaryotes (PPPDEs), and Wss1p-like metalloproteases (WLMs). We used bioinformatic approaches to identity these set of enzymes in L. infantum , L. major , L. braziliensis , T. brucei and T. cruzi public genome databases. Our in silico search retrieved 163 putative entries coding to DUBs in these species among databases when compared to known GenBank ortholog proteins through BLASTx. Of these, 84 belong to USP subfamily, 23 to JAMM, 10 to UCH, 15 to OTU, 5 to sentrin, 24 to PPPDE, 10 to autophagin, 5 to WLM, and where not found entries to MJD. In T. cruzi genome was also observed the presence of duplicated genes, where the 63 entries might be resumed, with at least 95% of similarity, in 34 sequences. However, in each Leishmania genome studied were found on average 32 entries for DUBs, and most of them have its orthologues into the same genus. Comparative analysis of genes responsible for ubiquitin removal showed signicant difference in their respective nucleotide sequence lengths as well as amino acid composition, especially with regards to regions outer conserved domains, suggesting maintaining of function, but different substrate specicity. In addition, the relative gene expression of TcUSP7 , -10, -14, -15 and TcUCH-L 3 determined by qRT-PCR showed similar prole in epimastigotes forms between Berenice-62, Berenice-78, Colombian and Y strains of T. cruzi . However, the levels of TcUSP15 were lowest when compared to strains analyzed in this study, while the levels of TcUCH-L3 where highest. Finally, we determinate the DUB activity using the substrate CBZ-Gly-GlyArg-7-amido-4-methylcoumarin in nuclear and citosolic extracts and the ubiquitin and SUMO protein conjugation in the same extracts using western blot and specic antibody to ubiquitin and SUMO. Our results showed that the activity were predominantly citosolic. The western blotting showed that preferentially conjugate ubiquitin was found in the cytoplasm and sumoylation in the nucleus. These results evidence the complexity and diversity of DUBs in tripanosomatids and open the possibility to explore the relevance of their interactions in regulation of ubiquitin mediated pathways in these parasites. Keywords: Deubiquitination enzymes, Trypanosome cruzi, differential expression Supported by: CNPq, FAPEMIG- TCT 12.009/09

73

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP03

SIGNALP PREDICTION: HOW GOOD IS IT FOR DIFFERENT PLASMODIUM SPECIES AND OTHER ORGANISMS?
Ribeiro R D S1 , Neto A D M1 , Rezende A M1 , Brito C F A D1
1 Rene

Rachou Research Center

Malaria is a major public health problem in many countries and more than 2.5 billion people are exposed to infection by the four main human species of the genus Plasmodium . It is estimated that the disease affects an estimated 300-500 million people in tropical and subtropical areas of the planet, resulting in more than one million deaths each year, mostly children under 5 years in sub-Saharan Africa. An effective malaria vaccine is a priority and many efforts have been made to identify promising candidates. Despite these efforts, there are few proteins being tested as antigens for the formulation of an effective vaccine. An important feature that a vaccine candidate should have is the presence of the signal peptide, a sequence of 15-40 amino acids present at the N-terminal region, which addresses the protein to export pathways. There are many software tools for prediction of these signal sequences. The most popular is SignalP , based on NN and HMM predictions of signal peptides and their cleavage sites. The eukaryotic dataset used to train this program is dominated by sequences of mammals, predominantly humans. We do not know if this training could induce a biased result when different species are used. Thus, in this study, we investigated SignalP scores for proteins from different species of Plasmodium and model organisms such as Arabdopsis thaliana , Homo sapiens and others. For this, the whole dataset of protein sequences for studied species were obtained from RefSeq or PlasmoDB databases and submitted to the program. The analysis of D scores (NN) showed a bimodal distribution for the majority of organisms, one mode corresponding to signal peptide positive (SP+) sequences and the other to signal peptide negative (SP-) sequences. We submit the data to a Kruskal-Wallis test and the D score medians of SP+ sequences from different organisms varied signicantly (P<0.0001). Similar results were obtained comparing SP- sequences. Concerning Plasmodium species, only three pairs out of 15, do not have D score median of SP+ sequences signicantly different (Mann-Whitney test, P>0,05): P. vivax X P. knowlesi, P. vivax X P. chabaudi and P. knowlesi X P. chabaudi . All studied species varied signicantly from H. sapiens . The bimodal distribution is not clear for Plasmodium berghei and P. yoelli, which are the ones that have the lowest D-score median. P. chabaudi , P. knowlesi and P. vivax did not have signicant differences between the D-score medians when tested pair wise. HMM scores will be analyzed as well. More analyses are necessary to verify whether species specic peculiarities showed here compromise signal peptide prediction.

74

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP03

NEW COMPUTATIONAL STRATEGY BASED ON DATA MINING TECHNIQUES FOR ASSESSING PEPTIDE AND PROTEIN IDENTIFICATIONS BY SHOTGUN PROTEOMICS
Cerqueira F R1
1 Department

of Informatics (DPI) and Group for Bioinformatics Research (NUBIO), University of Vicosa, Minas Gerais, Brazil The shotgun strategy, by means of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) and database (DB) search algorithms (e.g. Mascot and Sequest), has been the method of choice to the identication of proteins in complex mixtures. On the other hand, a single MS/MS experiment may generate thousand of spectra. There are currently two largely used techniques for assessing peptide-spectrum match (PSM) produced by DB search methods: PeptideProphet and the target-decoy search strategy. In the PeptideProphet approach, standard statistical distributions are used to t observed positive and negative score distributions. However, certain datasets might present score distributions that are completely different from assumed (phosphodata, for example). The target-decoy search strategy, in turn, works without any a priori assumption about the data. In this strategy, besides using the target proteins, decoy (false) sequences are also included in the search. In this case, the obtained number of decoy PSMs is an excellent estimate to the number of wrong hits among target PSMs. A desired false discovery rate (FDR) can be achieved by varying score thresholds and counting decoy results until reaching suitable cutoff values. Cerqueira et al (2010) proposed a new procedure called MUDE to extend the target-decoy method. Using Sequest in their experiments, the authors proved that a much higher sensitivity (number of PSMs) can be achieved. First, the authors consider many more quality parameters than usual. Second, the problem of nding threshold values leading to the desired FDR is treated as an optimization problem. Note, nevertheless, that the MUDE approach provides linear decision boundaries to separate false from true positives. Furthermore, the heuristic used in the optimization has to be executed several times. In their work, Cerqueira et al. performed 45 runs of the proposed procedure. Each run takes on average 10s, meaning a total time of 7.5 minutes, approximately. We propose here an improvement to the MUDE method by the use of powerful machine learning algorithms to nd better decision boundaries (higher sensitivity) taking shorter running times. Experiments were performed on the same PSMs used to test MUDE, coming from phosphorylated/nonphosphorylated proteins and generated by Sequest. Among several algorithms tested, neural network (NN) was the most successful technique. The features used in all runs were the six scores applied by MUDE. We used the resulting ROC curves to determine thresholds leading to the highest sensitivities for FDRs varying from 0 to 0.05 so that we could compare our results with MUDE output for the same error rates. Sensitivity was ca. 16% and 11% better for phosphodata and nonphosphodata, respectively. Furthermore, the running time of our procedures was strikingly shorter. The NN models took on average 8.5s to be built. This means that our method is approximately 53 times faster than the MUDE approach. In conclusion, experiments of MS-based proteomics can now have a better performance concerning both time and proteome coverage. Acknowledgment This work is supported by Fundacao Arthur Bernardes (Funarbe), Vicosa, Minas Gerais.

75

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP03

GENOME-WIDE ANALYSIS OF TISSUE ENGINEERED ENDOTHELIAL CELLS


Porto L M1 , Hess S2 , Edelman E R2
1 Integrated

Technologies Laboratory, Federal University of Santa Catarina, InteLab/UFSC, Florianpolis, SC, Brazil 2 Biomedical Engineering Center, Harvard-MIT, Cambridge, MA, USA Tissue engineering conveys a broad range of techniques that ll the gaps between biological sciences and engineering methods, and is being used to advance our understanding of complex cell-materials interactions, and to develop novel therapeutic approaches that repair, regenerate, enhance or substitute decient or pathologic tissues. It has been found that matrix embedded endothelial cells (MEEC) are stronger inhibitors of local and systemic immune response when compared to cells injected as pellets or adherent to material surfaces. MEEC has been successfully tested as an implantable biomaterial to repair arterial lesions in animals. Here we investigated genome-wide mRNA expression levels of endothelial cells seeded on a gelatin coated tissue culture polystyrene plate (TCPS) and in a Gelfoamembedded construction. Affymetrix HG-U133A_2 oligonucleotide microarray chips containing 22,277 genes were subjected to expression analysis and clusterization using dChip software. Samples were either from EC cells cultured on TCPS and Gelfoamalone, or stimulated by TNT- or FGF-2 for 1h and 6h. The dynamics of mRNA expressions was correlated with immune responses and growth on 2D (TCPS) versus 3D (MEEC) scaffolds. Supported by: CNPq

76

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Transcriptomics and Proteomics PI: TP04

PYBIOSIG: OPTIMIZING GROUP DISCRIMINATION USING GENETIC ALGORITHMS FOR BIOSIGNATURE DISCOVERY
GCArnoldi F1 , Rodrigues R F1 , Silva C L1
1 Department

of Biochemistry and Immunology, Ribeiro Preto School of Medicine, University of So Paulo, Ribeiro Preto, SP, 14049-900, Brazil. In medical sciences, a biomarker is a characteristic that is objectively measured andevaluated as an indicator of normal biological processes, pathogenic processes, orpharmacologic responses to a therapeutic intervention. Molecular experiments are providingrapid, efcient, and systematic approaches to search for biomarkers, but because single-molecule biomarkers have shown a disappointing lack of robustness for clinical diagnosis,researchers have begun searching for distinctive sets of molecules, called biosignatures.However, the most popular statistics are not appropriate for their identication, and thenumber of possible biosignatures to be tested is frequently intractable. In the present work,we developed a wrapper methodusing genetic algorithms (GA) as a feature (gene) selectorto optimize a measure of intra-group cohesion and inter-group dispersion. This method wasimplemented using Python and R (pyBioSig, available at http://code.google.com/p/pybiosig/under LGPL) and can be manipulated via graphical interface or Python scripts. Using it,we were able to identify putative biosignatures with few genes included and capable ofrecovering multiple groups, even ones that were not recovered using the whole transcriptome,within a feasible length of time using a personal computer. Our results allowed us toconclude that using GA to optimize our new intra-group cohesion and inter-group dispersionmeasure is a clear, efcient, and computationally feasible strategy for the identication ofputative omicalbiosignatures that discriminate among multiple groups simultaneously. Supported by: CNPq

77

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 4 DATABASES AND B IOINFORMATICS T OOLS

78

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT01

SNP-ARRAY CHIPS FOR ILLUMINA PLATFORM BASED ON CONSTRAINTS.


Salvanha D M1 , Andrade E S2 , Simes A L2 , Vncio R Z N1
1 Laboratrio

de Processamento de Informao Biolgica (LabPIB) - Faculdade de Medicina de Ribeiro Preto (Universidade de So Paulo) - Departamento de Gentica / Programa Interunidades de Bioinformtica. 2 Laboratrio de Gentica Bioqumica - Faculdade de Medicina de Ribeiro Preto (Universidade de So Paulo) - Departamento de Gentica High Throughput SNP (single nucleotide polymorphisms) genotyping has been widely used in studies of population genetics. BeadChip from Illumina Company has been one of the most used platform and offers great exibility and customization for high throughput SNP genotyping. Designing a SNPArray Chip is one of the most important steps on the SNP genotyping process, and the process of choosing the right SNPs to be spotted in a given array slide is, essentially, a data-mining effort. Therefore, we have been developing an application to identify the best SNPs over a constrained scenario and genomic range imposed by the biologist team. As a rst usage of the application being developed, we are assisting the Genetical Biochemistry group at the University of Sao Paulos Medical School at Ribeiro Preto to design a SNPArray chip that will be used to study Linkage Disequilibrium. This collaboration aims to describe the evolutional events experienced by Native-Brazilian populations. In order to achieve this goal, a SNPArray chip is being designed with features around the human leukocyte antigen system (HLA) complex. The HLA complex is interesting because is a highly variable genomic region. The biologist team impose some constrains to the suitable SNPs: (i) it has to be a known SNPs 10-10 kb apart on groups of 4 SNPs each, separated by 100-100 kb, in a total of 10 clusters at the HLA complex region; (ii) it has to be 2 groups of 4 SNPs 10-10 kb apart separated by 250-250 kb and two groups of separated apart by 500-500kb, in both sides (right and left, where the middle is HLA region) in a total of 4 groups each side; (iii) the chosen SNPs have to have at least 5% of MAF (minor allele frequency). These constrains maximize the informatively from the SNPs, and reduce the numbers of SNPs needed to successfully carry out this kind of populational study. This application attends all constrains imposed by the biologists group and was developed using shell-script and Java Language. It can be run on any Unix-like platform and can be modied as necessary as you need to. The source code and more information about it can be freely obtained at: http://labpib.fmrp.usp.br/dmartinez/range2snp/ Supported by: CAPES

79

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

CNVIEWER, A BROWSER-BASED APPLICATION FOR ANALYSIS OF HUMAN GENOME COPY NUMBER VARIATION
Palu C C1,2 , Vasconcelos A T R1,3 , Almeida J S2
1 Bioinformatics 2 Department

Laboratory, NATIONAL LABORATORY OF SCIENTIFIC COMPUTATION - LNCC of Bioinformatics and Computational Biology, THE UNIVERSITY OF TEXAS M D ANDERSON CANCER CENTER - UT MDACC 3 NATIONAL INSTITUTE OF METROLOGY, STANDARDIZATION AND INDUSTRIAL QUALITY INMETRO Copy number variations (CNVs) of genomic regions alter the DNA sequence in different ways and can cause changes in gene expression. They have been associated with a variety of pathologies such as tumor formation. Recently, the emergence of high-throughput techniques has allowed a better understanding of the distribution and occurrence of CNVs, increasing interest in their study and enhancing its potential use in diagnosis and prognosis. The CNV research described here focuses on the computational tools needed to handle the large amount of data and complex analysis. Although there are different kinds of programs available for CNV analysis, key features still need to be developed and to be improved to achieve wide availability. CNViewer is a computational application for human CNV exploratory analysis, which provides new features and uses a different implementation approach, aiming to overcome barriers to accessibility. The CNViewer application described here is able to graphically represent several samples simultaneously and also compare molecular proles. Through a dynamic interface, the user can dene any genomic region combination for analysis, and can also display additional data on CNVs. Therefore it is possible to query different genomic targets and its relation with complementary information, such as clinical data. All features are available in a friendly interface and explained in the user guide, making CNViewer accessible for users which are not familiar with bioinformatics. CNViewer is free, platformindependent and does not require installation; the project is hosted at http://cnviewer.googlecode.com. Using only the features native to the Web browsers (JavaScript), CNViewer explores the novel browser-centric platform in order to process data and perform tasks quickly regardless of server, because it manages data in memory during its use, enhancing the user interactivity. In addition, an export module was created, which allows users to save and retrieve their analysis in a portable form that can be used for data sharing. The CNViewer is an application that overcomes the limitations of traditional Web-based programs by running directly on the web browser, in fact behaving like a desktop application, with the added advantage of not requiring installation or depending on upgrades. The increasing use of Web browsers as work environment, and more recently even as operating system, suggests that native applications to this environment could become the norm in biomedical informatics. Supported by: CAPES, CNPq, FAPERJ, CTSA/NIH and NCI.

80

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

HUMAN TRANSCRIPTIONAL REGULATION INTERACTION DATABASE (HTRIDB): 2010 UPDATE.


Bovolenta L A1 , Acencio M L1 , Lemke N1
1 Instituto

de Biocincias de Botucatu, UNESP - Universidade Estadual Paulista

The understanding of gene regulation demands accurate and comprehensive knowledge of transcriptional regulation interactions. In order to respond to increasing needs of an integrated repository for transcription factors (TFs) and their respective regulated genes, various organism-specic and general TFs databases have been created, such as the C. elegans differential gene expression database (EDGEdb), the Yeast Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT), the TRANSFAC database and the transcriptional regulatory element database (TRED). These two latter databases contain data on human transcriptional regulation interactions, but they both suffer from limitations: TRANSFAC is the leading TFs database, but it is marketed as a commercial resource with only a portion of its contents available in public version; TRED is an open-access database, but it contains a limited number of human TFs (113 TFs from a total of ~1500 known TFs). In an effort to overcome these limitations, we have constructed the Human Transcriptional Regulation Interactions database (HTRIdb), an open-access PostgreSQL-based database of TF-regulated gene interactions with Pubmed reference to the experimental evidence of regulation. So far, HTRIdb has been populated with ~110 TFs that regulate ~2000 genes. The HTRIdb content can be accessed via a web interface (http://tinyurl.com/lbbc-htridb) where users are able to retrieve data on either a specic regulatory interaction of interest or all regulatory interactions present in database. Moreover, users can also download the retrieved data either in a spreadsheet or in a text format. This work has been supported by CNPq and Fapesp.

81

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

PESTADTOX, A PIPELINE FOR EST ANNOTATION AND DISCOVERY OF TOXINS.


Carvalho V1 , Mudado M D A1
1 Laboratrio

de Bioinformtica, SBMB/DCB/DPD - Fundao Ezequiel Dias (Funed), Rua Conde Pereira Carneiro, 80, B. Gameleira, Belo Horizonte, MG, Brasil Small transcriptome projects of venomous animals, that still use Sangers sequencing technology, are commonly run in research institutes of Brazil. These projects are commonly executed with the objective of toxin discovery. In this work we describe a webtool (PESTADTox Pipeline for EST Annotation and Discovery of Toxins) that has special features for toxin discovery in EST transcriptome projects, like a search for signal peptides in predicted ORFs, verication of the presence of Koslovs PQM sites for propeptide cleavage and inspection for cysteine patterns commonly found in neurotoxins. It also does the basics of a common transcriptome pipeline: EST edition, clustering/assemblage and annotation. PESTADTox runs in a Linux machine and is entirely written in PHP and Perl and uses a MySQL database for data storage and retrieval. It works with several available free software for EST edition (PHRED/Crossmatch), EST clustering and assembling (TGICL), sequence alignment (BLAST), ORF nding (getorf from EMBOSS) and signal peptide prediction (SignalP). It also uses public databases and Perl scripts to assign Uniprot and GO annotations for toxin annotated transcripts and KOG classications for cellular function annotated transcripts. We are currently developing several new features for PESTADTox to make toxin discovery and research easier, like a protein multiple alignment tool, CDD alignment for conserved domain annotation and phylogenetic tree automated construction. PESTADTox is supposed to be accessed by an ordinary webrowser from an intranet or the internet. Supported by: Fapemig (NuBio - TCT 12.009/09 e CBB APQ-00163-08)

82

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

ATLAST4SS: A HIERARCHICAL DATABASE OF TYPE IV SECRETION SYSTEMS


Costa M O C1 , Souza R C1 , Netto D S1 , Lima N C B1 , Klein C C1 , Saji G R Q1 , Vasconcelos A T R1 , Nicols M F1
1 Laboratrio

Nacional de Computao Cientca

The type IV secretion system can be classied as a large family of macromolecule transporters divided in three recognized sub-families according to their well-known functions. The major sub-family is the conjugation system, which allows transfer of genetic material, as a nucleoprotein, via cell contact among bacteria. The system of effector proteins transport constitutes the second sub-family, being indispensable for infection processes of several mammalian and plants pathogens. The third sub-family corresponds to the DNA uptake/release system involved in genetic transformation competence, independently of cell contact. Several essential features of the T4SSs are well-known, but the knowledge about proper classication and annotation of their system subunits is not absolutely understandable. Therefore, the purpose of this work was to organize, classify and integrate the knowledge about T4SSs into a database, called AtlasT4SS, the rst public database devoted exclusively to this bacterial secretion system. The AtlasT4SS is a manually curated database that describes a large number of proteins related to type IV secretion system reported in gram-negative and gram-positive bacteria as well as in archaea. The database was created using the SGBD MySQL and Perl programming language with a web interface (HTML/CGI). The current version holds a comprehensive collection of 1,235 T4SS proteins from 61 bacteria (56 gram-negative and 5 gram-positive), one archaea and 12 plasmids. Each T4SS protein record contains sequence (DNA and protein), predicted topology, cross-references, access to the corresponding phylogenetic tree, as well as manual annotation data (e.g. function, subcellular location and proteinprotein interactions) captured from topology prediction tools and public scientic literature. In our database we present one way of classifying ortholog groups of T4SSs in a hierarchical classication scheme with three levels. The rst level comprises four classes that are based on the organization of genetic determinants, shared homologies and evolutionary relationships: (i) F-T4SSs, (ii) P-T4SSs, (iii) I-T4SSs, (iv) GI-T4SSs. Second level designates both specic well-known protein families as well as uncharacterized protein families, being currently described in the database a total of 86 families. Finally through the third category, each protein of an ortholog cluster is classied according to its involvement in a specic cellular process, such as conjugation, effector translocation, DNA uptake/release or even as a bifunctional protein. The AtlasT4SS will be an open access database. Supported by: FAPERJ (fellowship process number E 26/102.214/2009), CNPq (fellowship process number 473707/2010-1) and CAPES.

83

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

EXPLORING COMPLEX ALTERNATIVE SPLICING EVENTS THROUGH A NEW METHOD BASED ON TERNARY MATRICES
Kroll J E1,2 , Navarro F C P2,3 , Ohara D T2 , Souza S J D2 , Galante P A F2
1 PhD

Program of Bioinformatics, IME - University of So Paulo of Computational Biology - Ludwig Institute for Cancer Research 3 PhD Program of Biochemistry, IQ - University of So Paulo
2 Laboratory

Alternative splicing is known to affect more than 90% of all human coding genes, and has been proposed as a primary driver of the evolution of phenotypic complexity in mammals. Although the identication of alternative splicing events (ASEs) be one of the most important issues in the analysis of transcriptome, nowadays it is mostly limited to simple events, from which the most common are exon skipping, alternative splice sites 3/5 and intron retention. However, many of these simple events occur together (complex alternative splicing events - CASEs) in transcripts from a gene. Until now, CASEs havent been well studied because of their high complexity and lack of efcient methods to their exploration. Considering these actual problems, we developed a new method for the analysis of CASEs based on ternary matrices, which can be easily interpreted and translated to genomic positions. Through this method, it is possible to identify all kind of ASEs, including, for example, the uncommon and rarely studied dual-specicity splice sites. Moreover, for the analysis of these matrices we developed a webbased tool named Splooce, which allows the users to search for CASEs using simple and human readable expressions. Splooce shows its results in a very clean interface without hiding complex data and options. Comparing to other tools, Splooce has many advantages, such as an user-friendly interface, a complete and exportable output and option of use the most fresh public next-generation sequencing data. Splooce can be used also as a general tool for the analysis of simple and complex alternative splicing events and ASEs occurring specically in some tissues or pathologies. We believe using Splooce and its new methodology, it will be possible to explore and understand deeper the alternative splicing and even ASEs inuence in some kind of diseases, as cancer. Supported by: CAPES

84

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

PREDIP: AN IN SILICO APPROACH FOR AUTOMATED PROTEIN INTERACTION PREDICTION IN GENOMIC AND TRANSCRIPTOMIC SEQUENCE DATABANKS
Alvarez J C1 , Herai R H1,2 , Carazolle M F1 , Pereira G G A1
1 Laboratorio 2 Laboratrio

de Genmica e Expresso, Universidade Estadual de Campinas - Unicamp de Bioinformtica Aplicada, Embrapa Informtica Agropecuria - EMBRAPA

The fast-growing amount of protein interaction databanks have made possible to analyze deeply several metabolic or signaling pathways in different organisms. The information available in such databanks can be used to predict new protein interactions in transcriptome or proteome sequences, most of times this task is done by sequence homology or by structure similarity. In this way, several in silico systems were already proposed, like Peimap, which makes its predictions based on experimentally validated protein-protein interaction databanks of different organisms. A restricting criteria to use such tools is the input data, that only accepts proteomic sequences. Based on the importance of the protein complex prediction, and in the fact that several genomes still do not have a transcriptome or an annotated proteome, we propose an integrative Web-based system that can perform, automatically, protein-protein interaction predictions considering also genomic sequences. Our prediction approach was organized in two steps. In the rst, a system analyses the input dataset and uses FrameDP and Augustus, two free softwares, to predict proteins. In the second step, the predicted proteins are aligned against avaiable databases (DIP, Mint, BioGrid, Bind, HPRD, StringInts, IntAct) of protein-protein interactions using psi-blast software. Finally, the aligned sequences are analyzed by our pipeline, called PredIP, that creates an interactive chart representing all relationships between each protein and its respective interactor. Our in silico strategy to predict pairs of protein complexes can be applied in genomic, transcriptomic or proteomic data, is not necessary knowing previously the transcriptome or proteome of a considered organism. PredIP has already been tested to predict protein interactions between proteins of Moliniopthora perniciosa fungus genome. Hundreds of possible protein complex have been predicted, and have already been described in the literature as possible. The PredIP system structure can be incrementally updated with new discovered protein-protein interactions to perform new predictions. Predicted protein complex in M. perniciosa has already been proved to exist, and new predicted complex will be experimentally validated by our group, enabling a better understands of relationship between the fungus proteins. Furthermore, PredIP can be a useful tool to predict interactions between two distinct organisms. As described in specialized literature, PredIP is the rst automated pipeline to predict protein complexes from genomic or transcriptomic sequences.

85

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

DEVELOPMENT OF A STRUCTURAL DATABASE OF LEISHMANIASIS TARGET PROTEINS AND SPECIFIC BIOACTIVE COMPOUNDS.
Gouveia N M D1 , Amaral L R2 , Lopes J C D3 , Espindola F S1
1 Instituto

de Gentica e Bioqumica - INGEB, Universidade Federal de Uberlndia, Uberlndia, Minas Gerais 2 Universidade Federal de Gois, Jata, Gois. 3 Ncleo de Estudos de Quimioinformtica - NEQUIM, Departamento de Qumica da Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais Leishmaniasis is a group of diseases caused by various species of protozoa of the genus Leishmania, and is included among the six most serious parasitic infectious diseases in the world, which makes it a strain to the health-care system. Treatments for this disease are toxic, expensive and not always efcient. Therefore, the identication of new molecules to treat this disease by rapid, efcient, and affordable methods becomes clearly relevant. The analysis of bioactive compounds by computational techniques might be burdened by the large amount of available information. Among the biggest obstacles for virtual screening is the lack of databases of small molecules and structures of target proteins associated with data mining tools with appropriate algorithms. Thus, this study aims to develop a database with information on structure and function of Leishmania target proteins and their interaction with active compounds. In order to do so, we constructed a data model that best represents the sequences in FASTA format, from several databases (UniProt - 178, TDR Targets - 8,273, PSI TargetsDB - 9,482, SRS 25,957, GeneDB-16,249 and NCBI structure 19,391) and used this model to build a database system using PostgreSQL. We created software that receives FASTA les as input and inserts all records in the database. Each sequence is read from the FASTA le, and then the software checks whether it has already been entered into the database. If it has not been entered, it stores the information into a database including the header, source and the actual sequence. If the sequence has already been inserted into the database, the software stores the source and the header. With all the inserted sequences into the database, we generate a report containing all sequences, regardless the source, and then BlastClust (http://toolkit.tuebingen.mpg.de/blastclust) is used to eliminate redundancies among these sequences. The software generated a report with 34,000 non-redundant sequences. Moreover, these sequences are compared to PSI-BLAST to dene the sequences with less similarity to human and canine proteins, so we can build models and use them to study molecular docking. Supported by: CAPES, UFMG, UFU.

86

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

MINING THE LEISHMANIA GENOME FOR NEW TARGETS TO BIOTECHNOLOGICAL APPLICATIONS


Maciel T E F1 , Sousa R F D1 , Fernandes G D R2 , Ortega J M2 , Fietto J L R1
1 Universidade 2 Universidade

Federal de Viosa Federal de Minas Gerais

Leishmaniasis are a group of diseases with visceral and cutaneous manifestations caused by over 15 different species of protozoan parasite Leishmania . Leishmaniasis are neglected zoonotic diseases with an estimate of 12 million infected people worldwide. The complete genome sequence of L. major , L. braziliensis and L. infantum opened new opportunities for the identication of new targets that could be used in diagnosis and immunization. The main aim of this research is the selection of potential exclusive Leishmania proteins with hot spot application probability as target in diagnosis and vaccine candidates using genome mining. To do this we used the UEKO database (Enriched Uniref KO) [unpublished] for selection of Leishmania exclusive orthologs groups absent in the main hosts dog and human. UEKO is a enriched UniRef groups database with the same structure of KO, but with more sequences than the original databases (KEGG Orthology (KO) and UniProtKb). Using this approach the search in the database resulted in 143 orthologs groups containing at least one common protein in the three analyzed Leishmania genomes (all of then are absent in dog and human genomes). Secreted and ecto-localized protein targets seems to be more able to be recognized in a diagnostic test or as vaccine candidates, because of that, the resulted protein sequences were separated from clusters of orthologous and submitted to the SignalP 3.0 and Phobius programs analyses. Using this approach we found 104 sequences containing signal peptides, but only 22 proteins were common in all Leishmania genomes. Then these proteins were analyzed using Wolf PSORT to predict the sub cellular location. As a result we obtained ve secreted proteins. Two other proteins were classied as ecto-membrane after analysis in GPI-SOM program that detects signal peptide and GPI anchor. The next steps in this work are the analysis of expression levels in Leishmania amatigote forms using microarray databank (GEO) analysis and immunogenicity predictions. SUPPORTED BY: FAPEMIG, CAPES.

87

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

VISUALIZATION TOOL FOR SEQUENCE ALIGNMENTS


Samora A D4 , Oliveira D R4 , Monteiro-Vitorello C B3 , Alves W A L2,4 , Paschoal A R1,4
1 Bioinformatic 2 Department

Program, Institute of Mathematic and Statistic, University of So Paulo of Computer Science, Institute of Mathematics and Statistics, University of So Paulo 3 Department of Genetics, Luiz de Queiroz College of Agriculture, University of So Paulo 4 Department of Computer Science, Nove de Julho University The cross_match program, which is part of the Phred/Phrap package, is used for comparing any two DNA sequence sets. It is slower but more sensitive than BLAST and used in a number of ways during genome sequence assembly. By giving two sequence of interest, the result is a list of aligned regions coordinates between two sequences (pair-wise alignments). Cross_match is relatively easy to use, even by those biologists lacking extensive bioinformatics training. Many genome sequence projects around the world use Phred/Phrap package along with cross_match, even now with the new generation sequencing machines. The basic limitation of cross_match in comparing any two sequences is the lack of a graphical overview. To help to ll this gap, we have developed a program to be used in a standalone way, called GraphicsGS, that allows users to interactively visualize the matching regions, even in complex matching patterns. Using a very simple and friendly control panel window, the user need rst to upload the cross_match output le results. Based in this input le, the software will read and extract all the information about the alignment to obtain the coordinates of matching regions. The coordinates are used to plot each matching region as colored lines and also to identify putative specic regions (not align). It is possible to open more than one cross_match alignment result. In this case, each le will open in a separate window. So, the user can manage each plotted matching region in an easy mode in order to explore in detail. The program allows the user to perform series of tasks, including: to zoom in or out the graphic of the alignment, to export the image in several les format, to export sequence fasta of only match regions, to include annotation features in both sequences compared and also two different reports (one of all alignment information and other with only sequence fasta of match regions). In the graphic view the user can also choose which region type (synthenic or repeat) he/she wants to see and highlight the region of interest. The website has a simple explanation and screenshots of this program. GraphicGS provides a user-friendly web interface to visualize the standard cross_match output for investigating wide-ranging sequence comparison problems, including: (i) synthenic blocks regions; (ii) repeated regions; (iii) and also specic regions to each sequence. This software will help on studies of sequence organization by indentifying events such as duplication, translocation, and inversion. The GraphicGS is a stand-alone interface accessible at http://www.graphicsgs.com by a Java webstart (the software is freely available).

88

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT06

EVALUATION OF THE TOPOLOGY OF MIRNA ASSOCIATIVE NETWORKS OF HISTORICAL VERSIONS DEPOSITED IN MIRBASE
Godinho C P D S3 , Silva A C E2 , Weber G1
1 Department 2 Department

of Physics - UFMG of Physics - UFOP 3 Department of Biological Sciences - UFOP MicroRNAs (miRNAs) are a class of non-coding RNAs (ncRNAs), with length between 21 to 25 nucleotides, when mature, which are involved in the regulation of gene expression. This regulation is achieved post-transcriptionaly by the process of RNA-interference (RNAi) in which molecules of miRNA are bound to a protein complex, including argonaut family of proteins resulting in the RNAinduced silencing complex (RISC). The discovery of new miRNAs is currently a major challenge and many of the recently identied miRNAs are homologous to those already present in databases. After being discovered and validated miRNAs sequences are typically deposited in miRBase which is, to our knowledge, the most complete and up to date miRNA database available. This database stores miRNA sequences in historical versions which make it possible to access them in the order in which they were discovered. In other words, it is possible to retrieve the complete miRBase deposits in the state they were at certain periods in the past. In the present work we used all the historical versions available at miRBase to study and compare the network structure of known mature miRNA. The mature sequences were compared to each other with the use of global Needleman-Wunsch alignment algorithms. For each comparison we obtained a score between 0 (totally different) and 1 (identical). We then build an associative network of mature miRNAs such that two miRNAs were considered linked if they compared with a certain minimum score. Typically, for a minimum score starting at 0.7 we obtained a scale-free network, that is a network driven by a preferential attachment mechanism. Scale-free networks are found in many biological and non-biological networked structures of which the internet is probably the best known example. The scale-free network is easily identied by a power-law which appears as a straight line in the log-log plot of the number of links with k connections, that is, the connectivity distribution P(k) vs k. The question which arises for miRNAs is whether the scale-free network which we observe is a result of some unknown biological mechanism that promotes the preferential attachment or if it comes from the way in which new miRNAs are discovered. To nd out, we repeated our analysis for all historical versions in miRBase and calculated the linear regression of the straight line which appears in the log-log plot for each database version. We then monitored the Rparameters which indicates how well the data points t to a straight line, with R=1 representing a perfect linear relation. We discovered that over time Rapproaches 1 monotonically which is an indication that the scale-free nature of the network is a result of the homology based methods which are used to identify new miRNAs. Supported by: Fapemig and CNPq.

89

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

FREDOWS: A METHOD TO AUTOMATE MOLECULAR DOCKING SIMULATIONS WITH EXPLICIT RECEPTOR FLEXIBILITY AND SNAPSHOTS SELECTION
Machado K S1 , Schroeder E K2 , Ruiz D D1 , Cohen E M L1 , NorbertodeSouza O1
1 Pontifcia 2 Universidade

Universidade Catlica do Rio Grande do Sul - PUCRS Federal do Rio Grande do Sul - UFRGS

In silico molecular docking is an essential step in modern drug discovery when it is driven by a well dened macromolecular target. Hence, the process is called structure-based or rational drug design (RDD). In the docking step of RDD the macromolecule or receptor is usually considered a rigid body. However, we know from biology that macromolecules such as enzymes and membrane receptors are inherently exible. Accounting for this exibility in molecular docking experiments is not trivial. One possibility, which we call a fully-exible receptor model, is to use a molecular dynamics simulation trajectory of the receptor to simulate its explicit exibility. To benet from this concept, which has been known since 2000, it is essential to develop and improve new tools that enable molecular docking simulations of fully-exible receptor models. We have developed Flexible-Receptor Docking Workow System (FReDoWS) to automate molecular docking simulations using a fully-exible receptor model. In addition, it includes a snapshot selection feature to accelerate the virtual screening of ligands for well dened disease targets. FReDoWS usefulness is demonstrated by investigating the docking of four different ligands to exible models of Mycobacterium tuberculosiswild type InhA enzyme and mutants I21V and I16T. We nd that all four ligands bind effectively to this receptor as expected from the literature on similar, but wet experiments. A work that would usually need the manual execution of many computer programs, and the manipulation of thousands of les, was efciently and automatically performed by FReDoWS. Its friendly interface allows the user to change the docking and execution parameters. Besides, the snapshot selection feature allowed the acceleration of docking simulations. We expect FReDoWS to help us explore more of the role exibility plays in receptor-ligand interactions.

90

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

BUSINESS INTELLIGENCE APPLIED TO THE VERIFICATION OF PDB DATABASE


Freitas J S M1 , Dias S R2
1 Depto 2 Depto

Sistemas de Informao, Faculdade Fabrai Anhanguera Sistemas de Informao, Faculdade Fabrai Anhanguera

Due to increasing amount of biological data resultant of advanced scientic research, had appeared some needs dependent of computational solutions, such as: integration, extraction and data handling. A methodology applied to the economic scene and called Business Intelligence (BI), uses ETL tools (Extract, Transform and Load) to meet those needs in order to optimize the processing of huge amounts of business information by data transformation. In such a way, and in set with other technologies and tools, the BI reduces the time of a search in the data base, supporting an organization in its constants strategical decisions. The Kettle is a tool of ETL and open source code that will be able to assist the bioinformatics, offering a friendly interface to facilitate and to speed up the processes of integration, extraction, and management of ows and data handling, eliminating the use of advanced knowledge in computational programming. In this study, as source of data for processing, is being used PDB (Protein Data Bank), a database of public domain that contains three-dimensional information on protein structure and nucleic acids, being updated by biologists and biochemists from around the world. The structure resolution method chosen is X-ray diffraction, that measures with precision the atomic coordinates and represents about 90% of available proteins in the PDB. Objectifying the verication of integrity of these structures, this study aims, by means of a free software solution, validate them in compliance with the latest version of the specication Atomic Coordinate Entry Format Description v3.2, published in 2008 by the wwPDB, whose role is to establish a standard format of data and provide a single PDB le. The rst step of this study consists of extracting all the information for the Kettle, in accordance with the structure of the database. Then, amino acid sequences and atoms are handled and analyzed, searching possible errors occurring during the annotation process. Finally, an archive is generated having the identied errors, and its localization, that can invalidate a protein structure for researches. Financial Support: Anhanguera Educacional S.A.

91

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

COMMUNITY STRUCTURE ANALYSIS OF CORRELATED MUTATIONS: A TOOL TO IDENTIFY FUNCTIONAL CLASSES IN PROTEIN FAMILIES
Bleicher L1 , Lemke N3 , Garratt R C2
1 Universidade 2 USP-So

Federal de Minas Gerais Carlos 3 UNESP-Botucatu Correlated mutations in multiple sequence alignments have been discussed since the eighties in the compensatory mutationeffect. They gained considerably more attention in the nineties as a way to search for contact pairs and, later, were believed to report for energetical coupling and allostery (a notion which was eventually disproved). Recently, we observed that correlated amino acid sets in this kind of analysis can be caused by the existence of functional classes - in the Fe/Mn-superoxide dismutase protein family, the correlated sets of amino acids were able to predict the metal uptake (either iron or manganese) and oligomer state (dimeric or tetrameric) of a given superoxide dismutase, with a similar prediction rate when compared to a method based on dataset training using already characterized SODs. This result can be explained by the following argument if protein families have different functional classes, they must be determined by a set of functional residues. Therefore, if the multiple sequence alignment is sufciently populated by members of different classes, then communities of correlated residues (related to those classes) should arise from a network analisys of a graph dened by correlated residues. In order to explore this property, we developed a methodology to extract determinants of functional classes by constructing a correlation network, decomposing it into communities and analyzing the results. A software implementation capable of providing a large amount of useful information based on conservation and correlation measures is described. Supported by: FAPESP and FAPEMIG.

92

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

REVERSE VACCINOLOGY: IN SILICO PREDICTION OF EPITOPES AND SUBCELLULAR LOCALIZATION OF PROTEINS IN THE PROTEOME OF LEISHMANIA INFANTUM TO DEVELOP A VACCINE AGAINST CANINE VISCERAL LEISHMANIASIS (CVL)
Brito R C F D2 , Oliveira M M2 , Reis A B3,4 , Corra-Oliveira R4 , Ruiz J C5 , Resende D D M2,4
1 Universidade 2 Laboratrio

Federal de Ouro Preto de Pesquisas Clnicas, Programa de Ps-graduao em Cincias Farmacuticas, Universidade Federal de Ouro Preto, Ouro Preto, MG, Brazil 3 Laboratrio de Imunopatologia, Ncleo de Pesquisas em Cincias Biolgicas, Universidade Federal de Ouro Preto, Ouro Preto, MG, Brazil 4 Laboratrio de Imunologia Celular e Molecular, Centro de Pesquisas Ren Rachou - Fiocruz Minas, Belo Horizonte, MG, Brazil 5 Laboratrio de Parasitologia Celular e Molecular, Centro de Pesquisas Ren Rachou - Fiocruz Minas, Belo Horizonte, MG, Brazil Canine visceral leishmaniasis (CVL) is a zoonose in Latin America, and dogs have a central role as reservoirs of the parasite and in the transmission of the infection to the vector in the urban cycle of Leishmania infantum . Reverse vaccinology allows making epitope prediction for T and B cells in silico , which are important for protective immune responses, allowing the design of vaccines with reduced cost and time. Previous studies showed the feasibility of making epitope prediction in proteins of protozoa using open source algorithms. Also, immunoinformatics uses subcellular localization prediction of proteins in order to select as vaccine targets proteins that are in contact with cells of immune system, that is, secreted proteins or proteins exposed to cell membrane. The objective of this work is to select genes from the protozoan L. infantum candidates to vaccine against CVL. For this, download of the 36 chromosomes of L. infantum genome was made. Then, predicted proteome, with 8,214 proteins, was used in all subsequent analysis after being extracted using Artemis software. Epitope prediction was made using the following algorithms: a)to MHC-I, NetCTL and NetMHC; b)to MHC-II, NetMHCII; c)to B cells, BepiPred and BCPreds (AAP12 model). Copies of each algorithm were installed in local servers. Algorithms analyzed 12 human MHC-I alleles and six mouse MHC-I alleles and, in the context of MHC-II, they analyzed 14 human alleles and three mouse alleles. NetMHC predicted 21,696 strong binders and 195,127 weak binders, with a total of 216,823 predictions. Yet regarding MHC-I prediction, NetCTL totaled 1,511,866 predicted epitopes. NetMHCII predicted 210,849 strong epitopes and 1,243,357 weak epitopes. Concerning B cells epitopes, BepiPred did 46,986 predictions and AAP12 did 2,212,212 predictions, total of 2,259,198 predicted B-cell epitopes. Adding together, predictions reached 5,442,093, and these numbers were obtained after processing of algorithmsreports by specic parsers that were developed in PERL language. The predictions of subcellular localization were made with the following algorithms: a)Sigcleave, to signal peptides of excretion; b)Targetp, to proteins of secretion and mitochondria; c)WolfPsort, to protein localization in multiple subcellular compartments. Sigcleave predicted 1,909 signal peptides that indicate secretion of these proteins. Targetp predicted 1,722 secreted proteins and 2,025 mitochondrial proteins. Regarding subcellular localization of proteins, WolfPsort did 784 predictions of extracellular proteins, 1,136 predictions of plasmatic membrane proteins, 3,159 predictions of nuclear proteins, 1,833 predictions of cytoplasmic proteins and 1,302 predictions of mitochondrial proteins, with a total of 8,214 proteins. The next step in this work is to create a relational database, in a Management System Database, that will be responsible for integrating results of all predictions made for 93

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

subcellular localization and of B and T cell epitopes predictions. This database will be populated with all data obtained, and will make it possible to dene vaccine targets in the L. infantum proteome. Supported by: CAPES, FAPEMIG, CNPq, UFOP, FIOCRUZ.

94

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT07

COMMUNITY STRUCTURE ANALYSIS OF CORRELATED MUTATIONS IN NUCLEAR HORMONE RECEPTORS.


Bleicher L1 , Afonso M1
1 Departamento

de Bioqumica e Imunologia - ICB/UFMG

Nuclear receptors (NRs) are a family of transcription factors which bind small hormones such as thyroxin, estrogen and androgen. They are involved in a wide variety of physiological roles and are represented by 48 members in man. They share a modular structure, where the main domains are the DBD (DNA binding domain ) and the LBD (ligand binding domain ), which are linked by a hinge which allows the movements of the two domains upon the formation of a superstructure during transcription, which involves dimerization and binding of a co-factor protein. NRs are usually subdivided in three classes: the steroid receptors, such as ER (estrogen receptor) and AR (androgen receptor), the non-steroid receptors, such as TR (thyroid receptor) and VDR (vitamin D receptor) and the orphan receptors, such as NGFI-B, which seem to be unable to bind ligands. This division into classes has received more interest after the discovery of nuclear receptors which seemed similar to liganded homologs in humans but do not bind the expected ligands. In this work, we review the current set of sequenced nuclear receptors using community structure analysis of correlated mutations, analyzing the distribution of functional residues in the family upon different classes in order to identify possible functional class determinants. Supported by: FAPEMIG

95

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

SEQUENCE RECONSTRUCTION BASED ON PROFILE HMM SEEDS: IMPLEMENTATION AND PROOF OF PRINCIPLE FOR THE DIAGNOSIS OF NOVEL VIRUSES
Oliveira A L D1 , Sobreira T J P2 , Toledo M A F D1 , Zanotto P M D A3 , Gruber A1
1 Department 2 Laboratory

of Parasitology, Institute of Biomedical Sciences, USP, So Paulo SP, Brazil. of Cardiology and Molecular Genetics, Heart Institute, USP, So Paulo SP, Brazil. 3 Department of Microbiology, Institute of Biomedical Sciences, USP, So Paulo, Brazil. Our group has reported the development of GenSeed, a program that implements a seed-driven progressive assembly using nucleotide or protein seed sequences. The method is particularly useful for target-specic reconstruction of DNA sequences using unassembled databases. The original method was severely restricted when using heterologous seed sequences and databases, especially for phylogenetically distant organisms. To circumvent this limitation, we present here GenSeed-HMM, a new version incremented with three main features: (1) it can now use prole HMMs as seeds; (2) multiple reconstructions can be performed simultaneously; and (3) it is able to deal with short reads derived from next-generation sequencers. One of the most exciting applications envisaged for this program is the detection of novel viruses. In fact, Palacios et al . reported an interesting clinical case where three patients transplanted with organs from a common donor came to death from a febrile illness of unknown etiology. A mix of culture, serological and nucleic acid-based assays was unable to dene the pathogen. However, high-throughput sequencing of organs from recipient patients has revealed the presence of a novel Old-World arenavirus. This case is a paradigm for the potential use of bioinformatics tools to diagnose novel and emergent infectious diseases. A primary assumption of any diagnostic assay is that the target molecule must be known a priori. One way to break this paradigm is to use methods that allow the detection of more distant organisms, without compromising the specicity of detection. In fact, even novel viruses possess protein domains that are conserved in known viruses. Our premise is that prole HMMs, derived from known viruses, could be used as seeds to reconstruct novel virus genomes from fragmentary data. To test the feasibility of this approach, we initiated a proof-of-principle study. First, we tested the ability of a prole HMM to identify products conceptually translated from the typical short-length read of next-generation sequencers. We observed that sequences as short as 33 bp (11 aa) were able to be detected by prole HMMs. Next, we developed a fully functional beta-version of GenSeed-HMM, and tested it using prole HMMs derived from polymerase and envelope protein of dengue virus type 1 to 4. Using these models as seeds, we were able to successfully reconstruct the genomes of 39 remaining viruses of the genus Flavivirus that have been articially splitted into short-length fragments. Finally we assessed the ability of prole HMMs to detect and specically reconstruct a particular viral sequence contained in a heterogeneous database (a common real-life situation). For this experiment, we used TBE virus genome and human chromosome 22 to compose a database, and then splitted the sequences into short-length fragments. Using a Flavivirus-derived prole HMM as a seed, we succeeded to specically reconstruct the TBE genome. We are now working on the optimization of the program code and then we intend to start studies on real-life metagenomic data to screen for the presence of variant/novel viruses. We envisage that GenSeed-HMM will make possible a de novo diagnosis of novel viruses, with no need of prior information on viral molecules such as antigens or DNA targets. Supported by: FAPESP and CNPq.

96

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

FEATURE SELECTION FOR PROTEIN SEQUENCE CLASSIFICATION BY USING LOGISTIC REGRESSION MODELS AND SINGULAR VALUE DECOMPOSITION
Couto B R G M1,2 , Santoro M M3 , Santos M A4
1 Programa

de Doutorado em Bioinformtica, Universidade Federal de Minas Gerais / UFMG. de Cincias Exatas e Tecnologia, Centro Universitrio de Belo Horizonte / UNI-BH. 3 Departamento de Bioqumica e Imunologia, UFMG. 4 Departamento de Cincia da Computao, UFMG, Av. Antonio Carlos 6627, Belo Horizonte, Minas Gerais, 31270-010, Brasil.
2 Departamento

Background: searching for relevant patterns in protein sequences is a critical goal of Bioinformatics. Here we present a computational tool to support genomic research that uses logistic regression models and singular value decomposition to feature selection and protein sequence classication. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p- peptide frequency vector using all possible overlapping p-peptides window. With 20 amino-acids it is generated a 20p high-dimensional vector, where p is the word-size. Each row vector is the peptide that is analyzed by logistic regression to feature selection for the protein sequence classication. With p=2 we can identify bipeptides associated to a specic group of sequences. Besides peptides we include sequence length as another feature candidate. The model-building strategy for the feature selection was an automatic forward stepwise logistic regression. After the feature selection step, proteins are recoded again only by the p-peptides selected as important for each group of sequences. The protein frequency matrix produced for each target group is reduced by singular value decomposition (SVD), in a latent semantic indexing (LSI) information retrieval system. A database with 516,081 sequences from the Swiss-Prot section of the Universal Protein Resource (UniProt) was the protein collection used in all analysis. We tested the method in seven target groups: insulin, globin, keratin, cytochrome and proteins related with cystic brosis, Alzheimer disease and schizophrenia. A case-control study was done to study each target group. In this approach, sequences from the target group (the cases ) are selected from database for comparison with a series of random sequences which the protein is absent (the controls ). For all groups, the number of available cases in database is xed and restricted, much smaller than the number of controls. In order to try an optimal allocation of cases and controls during each feature selection analysis, we used a 1:4 case:control ratio. The ratio of four random controls to each case (4:1) compensates the few number of cases, being enough to detect the features related with each group of protein. Results: combined method was able to identify the amino acids and bipeptides important to each protein group. Sensitivity to classify unknown sequences using the SVD system based on the initial matrix ranged from 76% for proteins related with Alzheimer disease and more than 90% for other six groups. After frequency matrix reconstruction using only bipeptides identied by the logistic regression, decomposition by SVD and subsequent rank reduction, query retrieval has a sensitivity ranging from 74% for cytochrome to more than 90% for globin, keratin and proteins related to cystic brosis and schizophrenia. All specicities were over 90%. Conclusions: besides the feature selection, combining logistic regression models with singular value decomposition method allows better classication of unknown sequences than using SVD alone. Matrices used by the combined method are much smaller than the original one, which leads to optimized oracles. The tool is perfectly scalable and adaptable to huge problems because it is independent of reference database size and much less from the length of involved sequences. 97

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT06

PIDIP - A PIPELINE FOR THE DISCOVERY OF IMMUNOGEN PROTEIN CANDIDATES FOR A STREPTOCOCCUS PNEUMONIAE VACCINE
Carvalho V1 , Leclerqc S1 , Mudado M A1
1 Fundao

Ezequiel Dias (Funed), Rua Conde Pereira Carneiro, 80, B. Gameleira, Belo Horizonte,

MG, Brazil Streptococcus pneumoniae is a Gram-positive bacteria and common etiologic agent of several diseases such as pneumonia, meningitis, otitis and sepsis. Pneumonia kills an estimated 1.8 million children every year more than AIDS, malaria and measles combined. There are already 11 completed genome projects of Streptococcus pneumoniae available in GenBank, with 22.970 annotated proteins (CDS). The objective of this work is to develop a pipeline to identify novel conserved immunogen protein candidates for a Streptococcus pneumoniae vaccine, using free bioinformatics tools. The 11 complete genomes of Streptococcus pneumoniae were downloaded from GenBank (http://www.ncbi.nlm.nih.gov ). All 22.970 annotated proteins were obtained by parsing the GenBank le using the CDS tag. Several softwares were used in order to predict proteins that could be used as potential targets for a future general vaccine (orthologs from all genomes that could be exposed to the immune system of the host). First, a computer program for the prediction of protein localization sites in cells (PSORTb) was used to predict cytoplasmic membrane and cell wall candidates. After this step 4.547 putative membrane proteins were selected. Then, the Signalp software was used to predict the presence of a signal peptide in the selected proteins, revealing 587 proteins to have potential to be exported to cytoplasmic membrane. These sequences were aligned against each other with BLAST to obtain the level of proteins conservation among each genome. Perl scripts and the MySQL database were used to select the core genome of S. pneumoniae (orthologs with 100% of identity and perfect alignment among each other) and to eliminate redundancy, ending with 546 proteins. Protein sequences were selected by groups of equal or more than 6 (from different genomes). A total of 19 groups of sequences with 152 proteins were found (5 groups with 6 proteins, 1 group with 7 proteins, 6 groups with 8 proteins, 4 groups with 9 proteins, 2 groups with 10 proteins and 1 group with 11 proteins). All groups were used as input to VirulentPred to predict the virulence potential of the proteins. As a nal result, 4 virulent groups with 34 proteins were selected as targets to be good vaccine candidates. These proteins belong to groups of 6, 8, 9 and 11 different genomes respectively. As a future work we intend to develop a web server to make this pipeline publicly available. Supported by: Fapemig (NuBio - TCT 12.009/09 e CBB APQ-00163-08)

98

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

AN ALGORITHM TO SEARCH AND REPAIR ERRORS AND NONCONFORMITIES IN A PDB DATABASE


Lopes K P1 , Silva F G D1 , Dias S R1
1 Faculdade

Anhanguera Belo Horizonte

A great event in modern molecular biology was not only the DNA structure discovery, but also the conclusion that DNA was the substance that carried genetic information of nucleic acids to proteins. Since then, with the technology and scientic research advances, well as methods used in the sequencing of biological information, the volume of the generated information has increased exponentially, making a challenge to nd the best way of storing and handling such information. So, the biological databases were created to deal these data. Specically for this work was used biological database Protein Data Bank (PDB). It has three dimensional information about protein structures, nucleic acids and complex sets, primary and secondary structure of proteins, as well as angles and distances between atoms. These data are stored in a set of les containing a standardized nomenclature: at les. These les are submitted to periodic review, in order to maintain the trust of the stored data. They are organized, so that the researcher can draw the necessary information. But, not all les following the pattern proposed or errors go unnoticed by the researcher, which can avoid the process of manipulation and analysis of data for other researchers. Then, this work aims to develop and present an algorithm implemented using PERL language to scan the PDB for errors and nonconformities, based on document of the entry format description available at PDB site. So, we expected to contribute positively to subsequent research based on this database. After reviewing some PDB les and based in the document above, we could note errors and nonconformities in the PDB database, and export the incorrect base for a local server to run the scripts. We already generate a log and we suggest corrections of some situations listed follow (not an exhaustive list): 1) The atoms are not numbered consecutively indicating that not all atoms were resolved, whatever, not all the atoms of the protein are present in the le. 2) The number of residue is not consecutive indicating the same above. 3) The number of residue contains, besides the number, a letter. 4) Many residue having the same number. 5) The resolution high and the resolution low values exchanged. Financial Support: Anhanguera Educacional S.A. , FAPEMIG.

99

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

RESIDUE-RESIDUE INTERACTION DATABASE: DISULPHIDE BRIDGE DATABASE TO PROPOSE POSSIBLE SITES TO THE MODIFICATION OF PROTEINS
Dias S R1 , Garratt R C2 , Nagem R A P1
1 Instituto

de Cincias Biolgicas / Departamento de Bioqumica e Imunologia, Universidade Federal de Minas Gerais, Brazil 2 Instituto de Fsica de So Carlos / Departamento de Fsica e Informtica, Universidade de So Paulo, Brazil There are some algorithms used to predict amino acid residue pairs that can form disulphide bridges in a target protein if they were mutated to cysteine. Disulphide bonds were studied because they can be used to improve the protein stability as reported in several works. For example, some authors studied the increase of the autolytic stability of the enzyme Subtilisin BPN with the introduction of disulphide bridges in its tertiary structure. Others performed some experiments mutating two cysteine residues of the Aqualysin I to serine and observed the disruption of the disulphide bridges and the lost of the thermo stability of the protein. There are on the Literature several studies concerning the topic of protein stability. Some authors developed a function to predict the effects of single or multiple mutations on the proteins stability or reactivity. This function and a number of experimental results on the Literature indicate the need of an efcient mechanism to identify possible acceptable mutations in target proteins. This might result in an in silico mutant with stereochemistry possibilities to exist in vitro . To target this problem, we have created a PDB-based database composed of pairs of interacting amino acid residues. This database can be used to infer two concomitant mutations in target proteins aiming to introduce a novel residue-residue interaction in the mutant protein and, therefore, increase its thermo and conformational stability. The mutations are proposed in a way to maintain the proteins fold (and function) as, basically, the main chain conformation of mutated residues are supposed to be conserved. In this work we present the main aspects of this database and some results using the specic case of disulphide bridges as the protein-protein interaction. Another point is the use of the distances between atoms to describe an interaction and to search for two sites in the target protein to introduce a novel interaction. These distances were extracted from several disulphide bridges from existent PDB les and analyzed in order to obtain: the average distances between N, C, C, O from one residue and the other residue within the pair, the average angles C-C -S, C -S-S, the number of cysteine residues per sequence, the number of residues between two cysteines that form a bridge and other parameters that dene the disulphide bridge and its neighborhood. A complete search in the target protein structure must be performed to identify each residue pair that could be mutated using some of the pairs in the database. We intend to use this procedure to verify a number of possible mutations in different enzymes with potential application in bioremediation processes, where aggressive environmental conditions are expected. Financial Support: FAPEMIG

100

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

MOLECULAR SEROTYPING TOOL FOR STREPTOCOCCUS PNEUMONIAE


Pais F S1 , Oliveira F S1 , Oliveira M A A2 , Volpini A C1,3 , Oliveira G C1,3 , Coimbra R S1,3
1 Center

for Excellence in Bioinformatics - FIOCRUZ-MG for Bacterial and Fungal diseases - LACEN-MG - FUNED 3 Genomics and Computational Biology Group - FIOCRUZ-MG
2 Service

Streptococcus pneumoniae is a lancet-shaped Gram-positive diplococci responsible for several infections worldwide, such as otitis, pneumonia, meningitis, and sepsis. The conjugated vaccine produced by Biomanguinhos-FIOCRUZ and distributed by Brazilian public health system (SUS) covers the ten most prevalent serotypes in this country, but 91 have been described worldwide. Knowing that serotype shifting is a common phenomenon after massive immunization, epidemiological surveillance is crucial for developing new adapted vaccines. Specic immunochemical differences between the pneumococcal capsular polysaccharide (CPS) are responsible for its diversity. In classical serotyping, sera from immunized rabbits are used to divide these organisms into serotypes, as well as serogroups. However, the process is expensive, labor-intensive and susceptible to errors due to cross-reactivity to similar CPS antigens. Moreover, classical serotyping is centralized in a few reference laboratories which hold all serotype-specic antisera. We propose a new tool for molecular serotyping of S. pneumoniae. The method is based on restriction fragment length polymorphisms (RFLP) of the PCR-amplied cps loci, which encode the enzymes of the Wzx/Wzy-dependent pathway, responsible for CPS synthesis. Fragments are separated by agarose gel electrophoresis producing serotype specic cps -RFLP pattern. All publicly available cps sequences, representative of the 91 serotypes, were searched for internal endonuclease cleavage sites. A database was produced with the cps -RFLP patterns for each serotype predicted by in silico digestion with the enzyme Xho II. This enzyme cuts each cps locus in at least four different positions. Fragments smaller than 250 bp and greater than 4300 bp were excluded based on our previous observation that these outliers are more prone to errors in fragment sizing during the experimental procedure. In the end, patterns had 3 to 17 fragments ranging from 254 kb to 4274 kb. Using our previously published software named Molecular Serotyping Tool (MST), restriction patterns represented by ordered fragment sizes can be aligned and their similarity calculated as the sum of the penalties for the edit operations that transform one pattern into the other. The edit operations are band insertions, band deletions or errors in fragment sizing. MST easily distinguished each serotype-specic cps -RFLP pattern from all other in the database. The rare exceptions correspond to some of the serotypes with reported cross-reactivity in classical serotyping. We are validating our method comparing the predicted cps -RFLP patterns in our database with those experimentally obtained from a collection of clinical S. pneumoniae isolated from different regions of Minas Gerais State (Brazil) and previously characterized by classical serotyping. Real time epidemiological surveillance may support continuous improvement of pneumococci vaccines. Supported by: FAPEMIG (CBB-1181/08), NIH-FIC (TW007012), CAPES/CDTS-FIOCRUZ, FIOCRUZ-MG.

101

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

USING DATA MINING TO IDENTIFY HIDROPHOBICITY PATTERNS AND SEQUENCES MOTIFS IN PROTEINS CLASSES
Stelle D1 , Scott L P B1 , Barioni M C1
1 Centro

de Matemtica, Computao e Cognio (CMCC) - Universidade Federal do ABC

Data mining is the process of extracting patterns from data. Basically, it consists in analyzing data from different perspectives, summarizing it into useful information. Proteins are, among others, the macromolecules that perform all important tasks in organism. A protein composed of a sequence of amino acid residues which folds into a particular three-dimensional structure. The three-dimensional structure area very specic and determines their function. The number of primary structures deposited in databases is growing faster than our ability to solve the tertiary structures using experimental methods. Efcient computational techniques can aid to predict the protein structures and they can help to understand the folding process. Several works have explored this subject. Materials and Methods Our rst testes were realized with subset of our local database (DB), consisting around 20000 proteins of four folding class (-helix, -sheet, / and + ). The amino acids that compose these proteins was organized in windows of different sizes (7, 11, 15, 21) in the DB, each window is associated with its structural motif (this classication was obtained from the DDSP). Results The results obtained so far indicate this technique is efcient to investigate hidrophobicity patterns to be used for 2D and 3D prediction and for identication of sequence motifs candidates. We can observe that some patterns for -sheet are presents in all windows (7, 11, 15 and 21). We can observe yet several patterns high hydrophobic for -sheets that is correct according the literature. We observe an interesting and different result that are some patters high polar for the -sheets. This result need to be more investigated. The patterns found for -helix is coherent with the data of literature too. The results indicated that the algorithm (Apriori algorithm) works well to identify these hidrophobicity patterns. Conclusions This work presents a methodology/software to extract rules of from secondary structures of proteins. Considering these initial results, we can observe that the algorithm implemented worked well. We obtained several interesting results as small -helix own a hydrophobic patterns more specicity and that -sheets presents high hidrophobicity. We concluded that the association rules can be efcient to study this kind of problem and they can be used to identify possible sequence motifs and hidrophobicity patterns. Some of the patterns identied is being veried through prediction 2D and molecular dynamics. As future work, we will test the algorithm with a data base of transmembrane proteins and enzymes of virus (proteases). We intend to identify hidrophobicity patterns and candidates for sequence motifs for these especial classes of proteins.

102

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

ORGANIZING OF DATA WAREHOUSE USING PERL AND DBMS FOR GENOTYPING DATA
Andrade L G D1 , Muniz M N M1,2 , Tagliatti R F1 , Guedes E1 , Silva M V G B D1 , Arbex W A1
1 Laboratrio 2 Ncleo

de Bioinformtica e Genmica Animal - Embrapa Gado de Leite - MG - Brazil de Bioinformtica - Universidade Federal de Juiz de Fora/Embrapa Gado de Leite - MG -

Brazil The process of genotyping SNPs in domestic animals for using in animal breeding programs generates large sets of data. All these large data sets need to be collected, and stored for further analysis. Extract-transform-load (ETL) process might be very useful to select the desired data from a raw database. After then, it is possible to add relations between data sets into a data warehouse and used it to assist in a decision through analysis of minor allele frequency or any other parameter that needs to be implemented as, for instance, in a stored procedure. For this study, three cattle breeds were used: (a) Gyr (with 319 animals); (b) Nellore (with 959 animals); and (c) Girolando (with 54 animals), as well a database management system (DBMS). A total of 77,256,000 genotypic information from 58,000 SNPs markers for each animal (Illumina Bovine SNP50 Bead Chip) were generated. MySQL was chosen as DBMS for its high performance in query operations. DBMS showed higher performance when compared to the text les, besides the reduction of corrupted data. This study showed that the integration of DBMS and ETL applications to create a data warehouse for genotyping SNPs data in domestic animals is more efcient compared to the traditional way. Financial support: FAPEMIG, Embrapa, CNPq

103

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

PROTR-3D: A WEB TOOL FOR PAIRWISE AND MULTIPLE PROTEIN STRUCTURE ALIGNMENT
Souza R O1,2 , Dardenne L E2 , Goliatt P V Z C2
1 Instituto 2 Grupo

Superior de Tecnologia em Computao Cientca de Petrpolis - ISTCCP de Modelagem Molecular de Sistemas Biolgicos - LNCC/MCT

Due to the growing interest in the comparative analysis of three-dimensional (3D) structures of proteins from different biological organisms, we need exible tools to measure how these structures are similar to each other pointing the degree of evolutionary relationship between them. This comparison can be performed using techniques of 3D protein structural superposition and the calculus of the distance between the atoms returning the root mean square deviation (RMSD). Smaller the RMSD, greater the structural similarity between proteins involved in the analysis. To calculate the optimal RMSD value, we used the Kabsch method that returns the best rotation of atoms in the protein which will be aligned. In this work, we developed the ProtR-3D, a program in C++ language that uses libraries in C language for the calculation and minimization of the RMSD. A web-based system was developed in PHP language and any remote user can connect and access all functionality of this program. ProtR-3D is free available at http://www.gmmsb.lncc.br/index.php?pg=16 . The ProtR-3D program works with a single input le, in PDB (Protein Data Bank) format, containing a sample of proteins structures, or with multiple les, each containing a single protein strucuture. The user can choose to perform a global or local alignment (the region is determined by the user) of the backbone atoms (C-O-N-C alpha) or only the Carbon alpha of the proteins. Moreover, whether there are missing atoms in some of the submitted proteins, the program will automatically disregard them in the other ones in order to meet the requirements for a successful calculation of the RMSD. After processing the input les, the ProtR-3D returns the value of the RMSD for each submitted protein in relation to the rst one. A resulting PDB le containing the 3D alignment is also returned. This structural alignment can be downloaded or visualized at the ProtR-3D site using the Jmol applet. To validate the ProtR-3D program, we performed different tests of protein structure alignments using three calmodulins structures (PDB1CLL, PDB1CLM e PDB1CFD). The results showed that the ProtR3D program presents similar performance (time and RMSD values) in comparison with the webservers DaliLite, SuperPose and MAMMOTH. Keywords: Bioinformatics Tools, Pairwise Protein Structure Alignment, Multiple Protein Structure Alignment Acknowledgments: This work was supported by CNPq.

104

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

PHARMASITE: BINDING SITE COMPARISON BY 3D FINGERPRINTS


Santos F M D2 , Domingues B F1,2 , Lopes J C D1,2
1 Universidade 2 NEQUIM

Federal de Minas Gerais Chemoinformatics Group - UFMG

Pharmaceutical targetsbinding sites are of crucial importance in understanding and predicting the biological function of a protein. Its also important in designing safer drugs to modulate their activity [1]. Looking to associate proteins to their functions, the bioinformatic can and should bring in the next decades his major contributions to health. This work has as main objective the comparison of active sites of proteins from a data-set through generation of pharmacophorics models and its transformation in ngerprints. So we developed a set of tools that enable comparison of protein binding sites by coding them in ngerprints that can be easily and efciently compared and ranked. From the data-set proteins Aung [2], the PDB les were rst converted into the pharmacophore model by THINK software (Treweren Consultants) and PharmaSite (NEQUIM, that use carbon alfa), where the pharmacophoric points were calculated for each for PDB-Ligand. A pharmacophore model is an ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specic biological target and to trigger (or block) its biological response, acording to the very recent denition by IUPAC. Then these pharmacophore models were converted into a ngerprints and made a comparison all against all, were used for this computational algorithms developed in NEQUIM-UFMG (Pharmagen and Pharmacompare respectively). As a result were generated matrices of similarity that could be visualized with the software MEV (TM4 Microarray Software Suite) allowing the clustering of active sites as for possible potential of similarity. As for validation, was used several datasets containing classes of binding sites and decoys sites. We compare all binding sites of a given dataset against themselves and expect that the comparing tool is able to successfully group the active set from the decoys. From these results it is expected to gain knowledge for the development of a tool that facilitates comparison of sites and identication of potentially active ligands and targets and the development of pharmacophore model that best represents the characteristics of the targets binding sites in different conformations available in data banks for the same protein. Supported by: CNPq, CAPESReferences: [1] Xie, L.; Li, J.; Bourne, P.E. PLoS Comput. Biol. 2009 [2] Aung, Z.; Tong, J. C. Genome Inf. 2008

105

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

DEVELOPING A WORKFLOW FOR MANAGEMENT OF INFORMATION FROM A PLANT TRANSFORMATION FACILITY


Botelho C1 , Folgueras-Flatschart A V2 , Faria-Campos A C1 , Fernandes-Rausch H1 , Noda R W2 , Carneiro A A2 , Campos S V A1
1 Laboratrio 2 Ncleo

de Universalizao de Acesso - DCC - UFMG de Biologia Aplicada - EMBRAPA Milho e Sorgo

Good Laboratory Practices, or GLP, are a set of management principles of research laboratories activities that provide a framework within which studies are planned and performed in order to ensure results consistency and reliability. In order to enforce the GLP adoption, it is important to demonstrate that the information produced and processes applied to obtain it are correct and traceable. Therefore, the use of automated systems to follow each step of the research in a laboratory can be of signicant assistance. These systems are called Laboratory Information Management Systems (LIMS) and are now often used in research laboratories. In this work we describe the use of a LIMS based on workows to manage the data from the Plant Genetic Transformation Laboratory at EMBRAPA Milho e Sorgo, that works on the analysis and production of transgenic maize and sorghum plants. Transgenics are organisms whose DNA is modied using genetic engineering aiming to introduce a new characteristic which does not occur naturally in that species. We have developed a workow to manage the information from the processes to obtain genetically modied plants to be used in the LIMS Flux (Satya Sistemas). A workow in XPDL format was constructed using the Together Workow Editor tool and imported into the system where data are stored in a MySQL database. This workow models the data and the activities of maize transformation mediated by Agrobacterium tumefaciens, in which the user denes the genetic construction specications, the plant genotypes and conditions for the transformation, transgenic callus selection and plant regeneration. Finally, from workbench to glasshouses, the system stores experimental protocols and data from the whole processes up to obtaining and storing modied seeds, thus ensuring that GLP are followed. This work presents a solution for the management of information produced by a laboratory working with transgenic plants and our hope is to contribute with a quality improvement of the produced data. Supported by: FAPEMIG, CNPq.

106

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

SCHISTODB 2.5 - AN UPDATED SCHISTOSOMA MANSONI BIOINFORMATICS RESOURCE


Zerlotini A1 , Pais F S1 , Simes M C1 , Oliveira G C1,2
1 Center 2 Genomics

for Excellence in Bioinformatics - FIOCRUZ-MG and Computational Biology Group - FIOCRUZ-MG

The database SchistoDB (http://schistodb.net) is a community bioinformatics resource for parasitic organisms of the genus Schistosoma , the causative agents of schistosomiasis worldwide. The database currently incorporates all available Schistosoma mansoni sequence data in a single user-friendly location. This new version of SchistoDB integrates recently published expression data from platforms such as microarray, SAGE, and ChIP-Seq. Microarray data was extracted from an article published by Corra Soares et al . and this data set enables us to identify genes that are differentially expressed in the presence of quinoline methanols. SAGE data was extracted from two articles published by Taft et al . and Ojopi et al . These SAGE tags provide us with the evidence of gene expression in different developmental stages: 6 and 20 days sporocysts, miracidia and adult worms. ChIP-Seq data was loaded from a dataset published by Cosseau et al ., containing more than 20 million sequences. This experiment delivered sequences from the developmental stages adult, miracidia and cercaria from the strain GH2 and miracidia from the strain BRE. This information sheds light in the regulation of gene transcription by histones. SchistoDB 2.5 is undergoing nal tuning to efciently show such amount of data and there are still microarray data sets to be loaded. We are also working on the visualization of the expression data under the metabolic pathways perspective. We are planning to provide annual releases and the next one will include both new data, analysis and tools provided by an improved user interface. Keywords: Schistosoma mansoni , database, GUS, bioinformatics, genomics Supported by: FAPEMIG (CBB-1181/08), NIH-FIC (TW007012), FIOCRUZ-MG

107

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

CONSTRUCTION OF A WORKFLOW FOR MANAGEMENT AND STORAGE OF MEDICAL RECORDS FROM NEUROMYELITIS OPTICA PATIENTS
Faria-Campos A C1 , Hanke L A1 , Fernandes-Rausch H1 , Cysne M B4 , Talim L E C4 , Rocha C F4 , Lana-Peixoto M A4 , Oliveira G C2 , Coimbra R S2 , Melo A3 , Campos S V A1
1 Laboratrio 2 Centro

de Universalizao de Acesso - UFMG de Excelncia em Bioinformtica - FIOCRUZ -MG 3 Satya Sistemas 4 Centro de Investigao em Esclerose Mltipla - UFMG 5 Grupo de Estudo de Genmica e Biologia Computacional - FIOCRUZ - MG The management of large amounts of data is a clear demand nowadays and the use of Information Management Systems (LIMS) to achieve this is growing each day. The need for systems such as these for Medical Enterprises keeps increasing also. A LIMS is a complex computational system used to manage laboratory data with emphasis in quality assurance. Several LIMS are available currently and many of those use workows as an approach to manage data. LIMS are usually designed to comply with the needs of one kind of laboratory. However the needs for management and storage of medical data on several diseases have not yet been properly addressed. Neuromyelitis optica (NMO) is an idiopathic inammatory demyelinating disease of the central nervous system predominantly affecting optic nerves and spinal cord. Clinical, radiologic, and immunologic features distinguish neuromyelitis optica from other severe cases of multiple sclerosis and the amount of data from interviews with patients is increasing everyday. All the information regarding these aspects have been so far kept at medical charts with no automatic system for patient follow ups, statistic analysis or data mining. In this work we propose the development of a workow for the analysis of NMO patients data to be used in the LIMS Flux (Satya Sistemas). This system will enable a more efcient storage and retrieval of patient data, assisting the analysis performed by the physicians. Furthermore, our tool will enable mining the data pertaining to the disease, making it possible to identify patterns in the data and increase the accuracy of the diagnosis. This is particularly important for a disease such as NMO, because it is a rare disease that is often diagnosed incorrectly as multiple sclerosis. This workow will be implemented in the Flux System, a exible LIMS that is able to use different workows to express the properties of different types of experiments and data being analyzed. Through the construction of the NMO workow we intend to simplify data collection and retrieval from medical interviews. With that we aim to contribute to the management of information on NMO study centers and clinics, providing an automatic system for management and analysis of important data that can assist in distinguishing NMO from multiple sclerosis and identifying patients at high risk for recurrent myelitis and optic neuritis. Supported by: CNPq(PROTCOMP-483628/2009-3, 306879/2009-3), CAPES(PNPD-0156086), FAPEMIG(SIGCINAPQ02226-09, 485/2009), NIH-Fogarty (D43-TW007012), CBB-1181/08 and FIOCRUZ-MG.

108

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT07

FILE FINGERPRINTING: A NEW METHODOLOGY FOR REDUCES TIME AND COMPLEXITY UPDATING AND SYNCHRONIZATION OF LOCAL DATABASE WITH THE NCBI GENBANK
Guizelini D1 , Pedrosa F D O1 , Gehlen M A C1 , Tieppo E1 , Marchaukoski J N1 , Raittz R T1
1 Federal

University of Parana

The current DNA sequencing technologies are causing exponential growth of sequences in public databases. The number of sequence records in the NCBI Database is doubling at approximately every 35 months. The NCBI has updated information daily and releases a bi-monthly full GenBank update. The update and synchronization of local databases with the information provided by NCBI are demands a high trafc data (download) and strategies to identify changes in data les. The use of le ngerprinting allows reduction in the time and size of data copied, parsed and loaded from international database. Digital signatures (ngerprinting) of les can be generated using the MD5 hash algorithm. This signature is recorded into a new le named with the original name plus the sufx md5. This method can be applied to each le available in public databases. Each scan cycle of the upgrades and updates occurs through signature comparisons. Using the proposed method the time to compute the signature le NC_003070.gbk (51.646Kb) is 3 seconds. The previous signature can be retrieved from the local database in approximately 3 seconds. In less than 6 seconds it is then possible to identify whether a le has been modied or not. This approach avoids the longer period of time (about 9 minute) required to parse and load a le of the size of the NC_003070.gbk into the local database. These results are even more signicant when the local database has more than the 2100 les available in GenBank for Bacteria and Archaea. The association of signature les with the data loaded into local databases enabled the identication of variations in le contents. The use of digital signatures reduces the complexity to identify les that were modied between update and synchronization cycles of database. The use and availability the signatures of the les containing the gene information by NCBI GenBank and other publics databases facilitate and signicantly reduce the download of les. Supported by: CNPq, CAPES, UFPR, INCT - FBN

109

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

CNBI: THE NEW BRAZILIAN NATIONAL CONSORTIUM FOR BIOINFORMATICS


Herai R H1,2 , Vidal R O2,3 , Carazzolle M F2 , Costa G G L2 , Falcao P R K1 , Yamagishi M E B1 , Franchini K G3 , Pereira G G A2
1 Laboratrio 2 Laboratrio

de Bioinformtica Aplicada (LBA), Embrapa Informtica Agropecuria - EMBRAPA de Genmica e Expresso (LGE), Departamento de Gentica, Evoluo e Bioagentes -

UNICAMP 3 Laboratrio Nacional de Biocincias (LNBio), Associao Brasileira de Tecnologia de Luz Sncrotron - ABTLuS PRESENTATION: The National Consortium for Bioinformatics (CNBi) is a multidisciplinary center dedicated to research, development, management and technical support on questions arising from bioinformatics. It was created by a joint effort between three laboratories from So Paulo state in Brazil: the Genomic and Expression Laboratory (UNICAMP), Applied Bioinformatics Laboratory (EMBRAPA/CNPTIA) and National Laboratory of Biosciences (LNBio), and is organized as a distributed center inside the mentioned laboratories. MISSION: As a national center for bioinformatics, CNBis mission is the generation of new biotechnological information and advanced methods of computer-based information processing. Moreover it intends to act in the eld of genomics, proteomics (including protein structure and mass cryptography) and systems biology. In this way, CNBi brings together scientists from many knowledge areas, including computer science, molecular biology, genetics, mathematics, and physics, all of whom sharing a common interest. STRUCTURE: CNBi provides an advanced structure with powerful machines. Actually, there are several high-performance computational systems capable to analyze and simulate massive sets of data, very large storage devices to house major data collections, high-speed networking services to facilitate location-independent access and collaboration among investigators, software applications supporting research in bioinformatics; technical support staff. CONCLUSIONS: CNBi plays an important role in producing innovative research will high impact in scientic community, helping in the understanding of fundamental molecular and genetic processes that control any synthetic biological system or living organism, and how to change them. Moreover, this collaborative consortium intends to contribute in developing Brazils skill base in bioinformatics through training undergraduate or postgraduate students and postdoctoral fellows. ACKNOWLEDGEMENTS: National Consortium for Bioinformatics (CNBi) members are nancially supported by several governmental research foundation, and are grateful to FAPESP, CNPq, EMBRAPA, UNICAMP and ABTLuS.

110

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

DIVERGENOME: A BIOINFORMATICS PLATFORM TO ASSIST THE ANALYSIS OF GENETIC VARIATION


Magalhaes W C S1 , Silva D1 , Rodrigues M1 , Faria-Campos A1 , Tarazona-Santos E1
1 Universidade

Federal de Minas Gerais

High-throughput data production has revolutionized molecular biology, fact that may be manly attributed to development of technologies of new generation sequencing (NGS). However, a massive increase in data generation requires new approaches to store this data and also efcient ways to recover and analyze it. Although complex systems exist to keep track of the ow of laboratory samples for generating genotype data, there are few tools that manage and setup analyses on these generated data. Here we developed a bioinformatics platform, DIVERGENOME, to assist population genetics and genetic epidemiology studies. DIVERGENOME is a web accessible open-source platform to assist the analysis of genetic and epidemiologic datasets. It was developed to help investigators in data storage and analysis for population genetics and genetic epidemiology studies. The platform contains two components. The rst component, DIVERGENOMEdb, is a relational database developed using MySQL. It allows to safely store individual genotypes from three different types of data: contigs (resulted from resequencing projects), SNPs/INDELs, and microsatellites. Genotype data can be linked to a description of protocols used to generate them. Individuals can be linked to populations, as well as to individual phenotypic information that are collected in genetic epidemiology studies using different kinds of variables. The inferred cross-link between genomic and phenotypic information allows access to a large body of information to nd answers to several biological questions. The database structure also permits easy integration with other data types and opens up prospects for future implementations. The second component, DIVERGENOMEtools, is a dynamic pipeline composed of a set of scripts, developed using the programming language Perl, that enables the conversion of both queries submitted to the database and independent les to many popular le formats required by well known software in population genetics and genetic epidemiology. DIVERGENOME web interface is accessed through a web-based interface offeringusers a simple interaction and friendly navigation. The PHP language (http://php.net ) was used to design the dynamic Web interface and to implement thescripts that perform requests to the MySQL database and the Apache web server (http://www.apache.org). To guarantee portability and accessibility, the system was fully tested in different operating systems and web browsers. Supported by: CAPES, CNPq, FAPEMIG, NHI

111

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

A GRAPH-BASE APPROACH TO DYNAMIC AND EXTENSIBLE PIPELINES


Rodrigues M R1 , Magalhes W C S1 , Corgozinho P M1 , Machado M1 , Tarazona-Santos E1
1 Universidade

Federal de Minas Gerais

In silico experiments are generally executed through a set of computer-processing steps that follow a specic order. Integration is need so that heterogeneous tools executing at each step can be combined in the same experiment automatically. What is needed for integration in most cases is a set of independent tools that perform specic tasks, such as format-specic conversions, and a pipeline that combines these different tools in a coordinated way to achieve the desired outcome for the in silico experiment. That is, a pipeline makes sure that the output of one tool is passed on as the input of another tool and so forth. Therefore, developing effective pipelines is an essential task for the majority of research in Bioinformatics. What we see today is a huge amount of available pipelines designed in a static way, with the execution steps hardcoded into programs and scripts. This approach has a major drawback when the pipeline is expanding: the inclusion of new tools is costly in terms of manual work and error prone. More specically, it needs an experienced programmer to change the hardcoded steps to include new tools in the pipeline while guaranteeing its well functioning. This is a big concert not only for bioinformatics laboratories that want to continuously update their pipelines with new software developments, but also for the consolidation of open and cooperative systems in bioinformatics. If we want to develop bioinformatics systems (in our case, pipelines) that are open to external collaborators, we need easily-extendable systems that automatically incorporate third-party tools. In this case, pipelines must be arranged on-theyin a dynamic way, instead of being statically programmed. Here we present a graph-based approach towards this view of dynamic and extensible pipelines. The idea is to represent the connectivity of tools with a directed graph in which data or le formats are the graph vertexes and programs or scripts that process them are the graph edges. Therefore, if there is an edge (E) connecting two vertexes (A) and (B), being (E) the incoming edge of (B) and the outgoing edge of (A), it means that script (E) receives data or le format (A) as input and generates format (B) as output. To implement that, our approach comprises four elements: (i) a tool Registry containing the scripts and their accepted input and output data formats, (ii) a graph representation of the registry, (iii) a graph-traversing algorithm, and (iv) the dynamic pipeline algorithm. The later algorithm works generally as follows: (1) receives as input the Registry le and the start and end points of the pipeline; (2) builds a graph based on the registry le; (3) applies the graph-traversing algorithm to nd a path through the graph connecting the start and end points received as input; (4) executes the path returned in step 3. With this approach, to incorporate a new tool in the pipeline, one needs only to update the tool Registry, and the dynamic pipeline algorithm is responsible for generating the new pipeline on-the-y, during execution. Therefore, our proposed graph-based approach enables extensible pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. We also present a case study of a dynamic pipeline applied to in silico experiments in the eld of Population Genetics. Supported by: CAPES, CNPq, FAPEMIG, NHI.

112

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

SEMANTIC-BASED SIMILARITY OF DRUG TARGET PROTEINS


Santos E C D1,2 , Santos M A D3 , Lopes J C D1
1 NEQUIM

- Ncleo de Estudos de Quimioinformtica, Departamento de Qumica - Universidade Federal de Minas Gerais 2 Programa de Ps-Graduao em Bioinformtica, Instituto de Cincias Biolgicas - Universidade Federal de Minas Gerais 3 Departamento de Cincias da Computao - Universidade Federal de Minas Gerais In general, semantics is the study of meaning. Semantic similarity is a concept whereby a metric is assigned to terms or documents in a set of terms or documents according the likeness of their meaning in a pragmatic approach (i.e. considering how the context contributes to meaning). Broadly speaking, two objects are semantically similar if they are related to similar objects. A semantic similarity measure may reveal new correlations which are not possible by strictly direct queries in relational databases. Furthermore, semantic-based similarities may be determined over data hold in the form of annotation which are more suitable for humans. Therefore, they can be used on exploration of knowledge on scientic data resources. Indeed, the use of semantic-based similarities across the Gene Ontology (GO) has been evaluated in the literature. In a research over the druggable genome, a list of 130 InterPro entries was identied as sufcient to druggable proteinsselecting. This suggest that may be possible to establish a semantic similarity measure based on InterPro annotation capable of identifying new potential protein drug targets. As an alternative, Singular Value Decomposition (SVD) has known functionality for latent information retrieving. We developed a semantic-based similarity measure based on annotation related to proteins validated as drug targets. Using singular value decomposition (SVD), we developed a new computational tool to identify and classify potential protein drug targets. We applied clustering algorithms over our measure matrix and the results was highly compatible with the ones produced by sequence alignments (blastall). Our tool allows latent information retrieving across the annotation terms and may suggest new correspondence not possible by direct queries in relational databases or by sequence alignment. We are now implementing some statistical analysis over the clusters obtained with this measure. We hope it will permit to identify the annotation entries of interest for potential druggabletargets and their associations to validated drug targets ensemble by exploring their semantic similarities. It will also permit new mappings to other classiers and ontologies like OMIM and MeSH. Supported by: FAPEMIG and CNPq

113

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

ANALYSIS OF AMINO ACID SUBSTITUTIONS BY USE OF ERROR-CORRECTING CODES


Faria L C B1 , Rocha A S L1 , Kleinschmidt J H3 , Silva-Filho M C2 , Jr R P1
1 Unicamp 2 USP 3 UFABC

Although the genetic information in the cell is often exposed to every type of interference, this information is passed on from one generation to the next with high delity. This infers the existence of error correction mechanisms similar to the ones employed in coded digital communication systems. Error-correcting codes (ECC) are used with the aim at storing and/or conveying information with high accuracy and delity. We have proposed a biological coding system which generates DNA sequences with length 2n -1, n a positive integer, by use of BCH codes. Here we show the proposed biological coding system in addition to generating certain DNA sequences reproduces enzyme kinetics. Recognition of malate dehydrogenase (MDH1-21) synthetic peptide by MPP. The selected sequence is from Rattus norvegicus MDH (MDH1-21 Rn). A codeword generated by the BCH code with generator polynomial g (x )=x 6 +x 5 +x 4 +2x 2 +3x +1 did successfully reproduce the MDH1-21 Rn sequence, differing in one nucleotide but otherwise identical with respect to the amino acid sequence. We term this sequence an MDH1-21 code. In addition, we analyze the effects of replacing the arginine residues in positions 7, 14, and 15 with lysine and alanine residues and compared to biochemical assays. All possibilities of codons corresponding to lysine and alanine residues were considered and it was veried that only the MDHKR sequence was reproduced by the very same BCH code as expected. The results indicate that the substitutions in the peptides MDHRK and MDHKK were more dramatic than the replacements carried out in peptide MDHKR. These results are remarkable, considering the fact that the kinetic parameters can be accurately reproduced by the BCH code.

114

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

DEVELOPMENT AND VALIDATION OF AN EPITOPE PREDICTION PIPELINE


Lobo F P1 , Bueno L L1 , Rodrigues-Luiz G1 , Mendes T A D O1 , Miranda R R C D1 , vila R A M D2 , Freitas L1 , Braga M1 , Fujiwara R T1 , Bartholomeu D C1
1 Department 2 Department

of Parasitology, Federal University of Minas Gerais of Biochemistry, Federal University of Minas Gerais

Several bioinformatics tools have been designed to characterize proteins based solely on their primary sequence. Some of these tools can predict properties that are useful to infer possible immunogenic regions within a given protein, a topic of great interest in parasitological/immunological studies. Examples of such properties are the protein location prediction within a cell/organism, possible B and/or T cell epitopes, intrinsically unstructured regions and similarity with host sequences or with other pathogens. The co-occurrence of several of such properties in a given protein is, consequently, expected to increase the likelihood of nding true immunogenic regions. Therefore, such bioinformatics tools could, theoretically, select a small number of very good vaccine/diagnostic candidates from a large pool of protein sequences. Nevertheless, the software that calculate these individual properties are not immediately compatible, since they were independently developed by distinct groups and for distinct purposes. Therefore, the individual output of these software must be parsed and integrated into a clear and intuitive way that allow researchers to easily select potential immunogenic proteins. However, there is a surprisingly lack of initiatives to integrate the aforementioned properties into a nal and coherent metric that allows one to successfully distinguish between potential immunogenic and non-immunogenic regions. In this study we developed and evaluated an epitope prediction pipeline that integrates the individual prediction scores of several useful programs to identify potential immunogenic regions into a nal immunogenicity score. We used, among others, 1) bepipred for B cell epitope prediction; 2) netCTL for cytotoxic T-lymphocyte epitope prediction; 3) iupred for intrinsically unordered regions; 4) signalP for peptide signal presence and 5) TMHMM for transmembrane regions. The user is able to customize all steps in any individual analysis, such as cutoffs for signicance for each feature evaluated, distinct features weights and data formats (EST, CDS or proteins), among others. The pipeline is sequentially executed using user-chosen parameters, and the results of predictions are automatically parsed and dynamically shown as distinct protein features using Bio::Graphics based scripts in a webpage. In vitro results obtained using predicted peptides by our methodology demonstrated that our pipeline is able to successfully detect immunogenic regions in Plasmodium vivax proteins. With the complete genome sequences of several parasites currently available, and many more expected to be released shortly, our method has the potential to open new avenues for large-scale epitope prediction. Supported by CAPES, FAPEMIG, CNPq

115

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

ADAPTATION OF RSCU ALGORITHM FOR DETECTION OF GENOMIC ISLANDS


Incio S F C1 , Alves F I A1 , Farias S T D1 , Rego T G D1,2 , Cavalcanti D D1 , Rocha P K L1
1 Laboratrio 2 Unidade

de Bioinformtica - Universidade Federal da Paraiba Acadmica do Desenvolvimento Tecnolgico - Universidade Federal de Campina Grande

Genomic islands (GIs) are relatively large segments of DNA, usually with size ranging from 10kb to 200kb, acquired through horizontal transfer. They can be identied through comparison of strains that are phylogenetically close, or through statistical analysis of nucleotides, such as GC content, Codon Usage, Dinucleotide Frequency, among others [1]. The method of identication of genomic islands by Codon Usage, is based on the fact that each genome has a preference in the use of codons, therefore, genes acquired by horizontal transfer will have a use of codons and amino acids different from the host genome. RSCU (Relative Synonymous Codon Usage ) is an algorithm used to estimate usage bias of synonymous codons, where the usage frequency of a specic codon, that codes for an amino acid z, is obtained through the frequency of codon j in one gene, and the expected codon frequency (E) is obtained by dividing the amino acid z count in the sequence by the number of synonymous codons that code for the amino acid in question[2]. In this work, we present an adaptation of the algorithm for calculating RSCU, to use it as a tool for identication of genomic islands. For this, we carried out a search in the genome of organisms on groups of genes, whose lengths were equal to or larger than 10kb and bias, and whose use of codons of each gene was different from the standard genome. To evaluate the algorithm we used a database of genomic islands described in the literature for ve organisms: E. coli O157:h7 EDL933 , E. coli O157:H7 str. Sakai , S. pyogenes MGAS315 , S. typhimurium LT2 , S. enterica subsp. Enterica serovar Typhi str. CT18 [3], with which we calculate the accuracy of the method. Our results show that adaptation of the algorithm for calculating RSCU proved to be very efcient in detecting genomic islands, being able to detect islands that were only identied by the use of homology dependent methods. The accuracy of the new methodology was 85.9%, a similar value to that found by programs such as Islandpath which resulted in an accuracy of 86.2% [3]. References: [1] Juhas M., van der Meer J. R., Gaillard M., Harding R. M., Hood D. W., Crook D. W. 2008. Genomic islands: tools of bacterial horizontal gene transfer and evolution . FREMS Microbiol. 33: 376-393. [2] Liu Y. S., Zhou J. H., Chen H. T., Ma L. N., Ding Y. Z., Wang J. 2010. Analysis of synonymous codon usage in porcine reproductive and respiratory syndrome virus. Infect Genet Evol. (6):797-803 [3] Langille M. G. I, Hsiao W. W. L, Brinkman, F. S. L. 2008. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics 2008, 9:329

116

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

THE SURFACEOME EXPRESSION IN BREAST CANCER CELL LINES USING NEXT-GENERATION SEQUENCING DATA
Virgili N S1,2 , Galante P A F2 , Souza J E D2 , Old L J3 , Souza S J D2
1 Departamento 2 Laboratrio

de Bioqumica - IQ - Universidade de So Paulo de Biologia Computacional - Instituto Ludwig de Pesquisa sobre o Cncer 3 Ludwig Institute For Cancer Research - New York High-throughput gene expression analyses in tumors versus normal samples have yielded useful results, accelerating the rate of discoveries in cancer biology and the identication of possible therapeutic targets. Nowadays, the development of RNA sequencing (RNAseq) protocols using next-generationsequencing (NGS) platforms have been revolutionizing gene expression analysis eld. Recently, despite some works have used RNAseq to study the differential gene expression in tumors, none of them focused on a specic class of genes, such as genes encoding cell surface proteins, which are ideally the best targets for cancer therapy and diagnosis. In a previous work, we have identied a set of 3700 human genes encoding cell surface transmembrane proteins (called surfaceome). Here, we used a bioinformatics approach and public available RNAseq data to study the surfaceome gene expression in ve human breast cancer cell lines and one normal breast tissue. Computational analysis was carried out to identify sets of surfaceome genes differentially expressed in the tumor cell lines and, to better understand the role of these genes in breast tumors. We also used different data sources, such as protein-protein interaction databases, KEGG pathways, G.O. classications and public somatic mutation information, to explore the surfaceome genes differentially expressed. We found 289 genes exclusively expressed and 309 additional genes overexpressed in cancer cell lines. Among these, 43 genes contain at least one somatic mutation already identied in breast tumors. Furthermore, a pair-wise comparison was performed between ve breast cancer cell lines and a set of genes showing cell-line-specic expression have been identied. We believed that gene expression proling generated in this study, together with an integrative bioinformatics analysis, might be an initial step to nd new therapeutic and diagnostic molecular targets for breast cancer.

117

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

COMPUTATIONAL PLATFORM FOR THE INTEGRATION OF CLINICAL AND BIOMOLECULAR DATA


Miyoshi N S B2 , Pinheiro D G1 , Junior W A D S1 , Felipe J C2
1 Blood

Center of Ribeiro Preto - Faculty of Medicine of Ribeiro Preto - University of So Paulo of Physics and Mathematics - Faculty of Philosophy, Sciences and Letters of Ribeiro Preto - University of So Paulo
2 Department

Translational medicine is the application of basic research results, especially those coming from omicstechnologies in health and disease processes. This new area of research seeks to reduce the gap that exists between the bench and the bedside. This is a great challenge that has many barriers to be overcome, one of the most important is related to the nature of the data. The nature of clinical and molecular data are very different although they are often closely related. A global analysis concerning different levels of information is necessary when studying complex mechanisms responsible for the onset of pathological processes. To make this possible, two aspects of data handling must be well dened: storage and analysis. It is necessary to provide a computational platform and a data model that could store and represent clinical and biomolecular information consistently. From a well formalized and structured model it will be possible to design novel methods of computational analysis. In the area of genomics there are several models of biological databases such as AceDB, Ensembl and Chado. The models serve as the basis for construction of computational tools for genomic analysis in an organism independent way. ACeDB (A C.elegans DataBase) was one of the rst models of biological databases. It was a hierarchical DBMS schema and was initially built to support research on C. elegans and was subsequently adapted for other organisms. Ensembl was initially developed to support human genome research and currently has several computational tools associated with it such as Biomart and EnsMart. A model of biological databases which have gained popularity among research groups from different organisms is Chado. Chado was proposed by FlyBase group and is currently a component of the project GMOD (Generic Model Organism Database). Chado is a modular schema of a relational database. Chado is composed of eighteen modules where each module comprises a set of tables, triggers and functions responsible for managing information from a subdomain of genomics. Chado is extensible because it allows the incorporation of new modules and, if necessary, changes in existing ones. A differential to other models of generic biodatabases is that Chado makes intensive use of ontologies. Ontology plays a central role because all stored information should be related to some ontology or controlled vocabulary. Chado, like other biological databases, does not have a module to store clinical and sociodemographic information. It is in this context that the present work aims at the denition of a computing platform that aggregates in a consistent way, clinical and molecular data enabling the development of computational analysis to be applied in the eld of translational medicine. In this project, we consider using the Chado model as the basic genomic data model and propose the creation of a new module to store clinical information and allow research in translational medicine. As use case we will implement our platform to support the project Oncogenomics Applied to Therapy of Head and Neck Carciomafrom GENOPROT Network (CNPq). Through this project it will be possible to integrate sequence data, gene expression data from microarray, microRNA and disease association data with the clinical and socio-demographic features of patients who provided samples for laboratory test generation.

118

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

TORNADO: AN AUTOMATED PIPELINE FOR DE NOVO HYBRID GENOME ASSEMBLY BASED ON FREE SOFTWARE PACKAGES FOR SANGER AND NEXT GENERATION SEQUENCING TECHNOLOGIES (NGS).
Herai R H1,2 , Costa G G L D1 , Jnior O R1 , Vidal R O1,4 , Nascimento L C1 , Parizzi L P1 , Pereira G G A1,4 , Carazzolle M F1,3
1 Genomics

and Expression Laboratory (LGE), Genetics, Evolution and Bioagents Department, Biology Institute (IB), State University of Campinas (UNICAMP) 2 Applied Bioinformatics Laboratory (LBA), Informatics and Agropecuary Subdivision (CNPTIA), Brazilian Agricultural Research Corporation (EMBRAPA) 3 National Center for High Performance Computing (CENAPAD) 4 National Laboratory of Biosciences (LNBio), Brazilian Association for Synchrotron Light Technology Next generation sequence technologies (NGS) made possible to sequence entirely genomes in a fast way and low cost, from unicellular to complex organisms, like plants and mammals. These sequences can be assembled (i ) using a reference genome or by some de novo bioinformatics method, such as Velvet, SOAPDenovo, Edena, ABYSS, GS Assembler 454, Mira and ZORRO. They are mainly based on de Bruijin graphs or, in a few softwares, reads overlapping to form contigs and scaffolds. The involved ltering and assembly step are very sensitive for each type of tool, and can be a key factor to generate the best assembly results. This way, when a set of sequences from distinct technologies exists, from Sanger to NGS, it is necessary the use of distinct assembly strategies for each type of data. Actually, at our knowledge, there is no automated hybrid strategies based on in-use of distinct assembly softwares that can be applied to assembly hybrid data generated by NGS or Sanger platforms. This works presents TORNADO, and automated pipeline for hybrid genome assembly based on free software packages. TORNADO did not proposed new methods for genome assembly. It just uses the best described software strategies for each type of genomic data to perform the hybrid assembly. It was organized in two main modules that are congured by XML le. In the rst module, input data are ltered for trimming and experimental artifacts clipping. In the second module, based on sequence type, TORNADO automatically performs the assembly task using Mira for 454 and Sanger reads, Velvet for Illumina/Solexa or Solid/Life Tech reads. Finally, each assembled data are merged in a single assembly using ZORRO. If there are paired-end (mate pairs data) reads, an additional step involves CloseGaps software, which closes the gaps between assembled scaffolds. TORNADO was already applied to assembly hybrid genomic reads from Moniliophthora perniciosa fungi, Witches broom causal in plant cacao. Results showed that our strategy can works like an useful method to automatically assembly hybrid genome data. TORNADOs was implemented using Java and PERL programming language technologies.

119

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT06

A BRAZILIAN SOYBEAN DATABASE


Nascimento L C1 , Costa G G L1 , Meyer L1 , Binneck E2 , Rodrigues F2 , Kulcheski F R5 , Margis R5 , Kido A3 , Marcelino F C2 , Nepomuceno A L2 , Abdelnoor R V2 , Pereira G A G1 , Carazzolle M F1,4
1 Laboratrio

de Genmica e Expresso - Departamento de Gentica, Evoluo e Bioagentes, Instituto de Biologia, Universidade Estadual de Campinas, Campinas - SP, Brazil. 2 Embrapa Soja, Londrina - PR, Brazil. 3 Universidade Federal de Pernambuco, Recife - PE, Brazil. 4 Centro Nacional de Processamento de Alto Desempenho em So Paulo, Universidade Estadual de Campinas, Campinas - SP, Brazil 5 Universidade Federal do Rio Grande do Sul, Porto Alegre - RS, Brazill Soybean is a legume with large economic importance in the international market, with a world production of almost two hundred and ten million tons in the 2008/2009 harvest. Brazil appears as the second largest producer, with about twenty-ve percent of the world production. In 2007, the Brazilian Soybean Genome Consortium (GENOSOJA) was established with the goal of integrating several institutions currently working with soybean genomics in Brazil. The project has an initiative to search for new treats to improve the soybean production process, emphasizing in stresses that affect the national production, like the occurrence of droughts, pests attacks and the Asian Rust disease. Among the objectives of GENOSOJA is the creation of a relational database, integrating the results achieved by different methodologies utilized in the project. In the GENOSOJA context, we created a brazilian soybean database, integrating: (1) public data consisting of genome and predicted genes from JGI, an assembly of 1,276,813 of NCBI ESTs from several cultivars and 4,712 full-length cDNA sequences from one japanese cultivar; and (2) private data consisting of (i ) three cDNA libraries explored by SuperSAGE methodology, resulting in 4,373,053 tags of 26 bp, (ii ) 22 cDNA subctrative libraries from several brazilian cultivars under different stresses and (iii ) several libraries of soybean microRNAs under eight conditions and size between 19 and 24 bp. All these data were sequenced using Solexa/Illumina sequencing technology. This database offers to the users some features, including keywords searches, statics comparisons, automatic annotation, gene ontology classication and gene expression of the genes under certain conditions. All data are storage in a Fedora Linux machine, running the MySQL database server. The web interface is based in a combination of CGI scripts using Perl language (including BioPerl module) and the Apache Web Server.

120

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT07

REVISING SIGNAL PEPTIDE PREDICTION BY APPLYING AN ORTHOLOGY-BASED APPROACH: PLASMODIUM SPECIES AS AN EXAMPLE.
Neto A D M1 , Ribeiro R S1 , Rezende A M1 , Brito C F A D1
1 Centro

de Pesquisas Rene Rachou

Protein function is profoundly related to its localization, since possible interactions and metabolic pathways in which it might participate are all ultimately dened by the microcellular environment where the protein inhabits. Moreover, signal peptides are essential for protein sorting. Signal peptides are hydrophobic aminoacid stretches, usually located at the N-terminal region of proteins that are target to the secretory pathways. Therefore, orthologous genes, especially syntenic ones, are not expected to differ much according to the presence/absence of signal peptides, since any divergences would entail strikingly distinct biological outcomes. Signal peptide prediction, based on primary structure of proteins, is a rather important topic of computational research and several predicting algorithms are available. SignalP is based on Neural Networks (NN) and Hidden Markov Models (HMM), and is amongst the most employed predictors. We believe that signal peptide predictions can be combined to orthology information and used as a lter to 1) identify inaccurately annotated protein sequences in a supra genome level; 2) rene signal peptide prediction itself. To test our hypothesis, we have selected ortholog groups, formed by proteins from two or more Plasmodium species, which show diverging signal peptide predictions among their Plasmodium members. Then these proteins were aligned to its orthologs and manually inspected to check for possible differences. Whenever feasible, proteins were reannotated (or marked for reannotation) and signal peptides were predicted again. A total of 397 orthologous groups were inspected and 338 (out of 2113) proteins were revised. After revision, the prediction status changed for over 250 proteins, resulting in a signicant alteration in the roll of exported/secreted proteins of Plasmodium . Although the selection criteria was biased towards differing sequences from P. vivax and P. falciparum , the species with the most number (163) of revised (or marked for revision) proteins was P. berghei, followed by P. vivax (145). Approximately 87% of groups had at least one protein that needed revision, showing that combining orthology and signal peptide prediction appears to be a powerful, yet simple, approach to rene the curation process of proteomes from closely related organisms. Groups that showed differing predictions among their members (albeit no changes were made or even after changes were made) are currently being analyzed in an effort to dene new parameters that could improve signal peptide prediction for Plasmodium proteins. Supported by: FAPEMIG, CNPq, CPqRR, CAPES

121

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

INSTITUTIONALIZED ELDERLY PATIENTS DRUG RISK ASSESSMENT USING SINGULAR VALUE DECOMPOSITION AND K-MEANS CLUSTERING
Ferre F1 , Matos F M S B D2 , Pinto M C X4 , Pinheiro M L P4 , Santos M A D1 , Acrcio F D A3 , MeiraJr W1
1 PhD

program in Bioinformatics FEDERAL UNIVERSITY OF MINAS GERAIS of Computer Science FEDERAL UNIVERSITY OF MINAS GERAIS 3 Fac. of Pharmacy FEDERAL UNIVERSITY OF MINAS GERAIS 4 Fac. of Pharmacy FEDERAL UNIVERSITY OF JEQUITINHONHA AND MUCURI VALLEYS
2 Dep.

Institutionalized elderly patients usually make use of higher amount of medicines. In many cases are common nd comorbidities in the same individual, being necessary use of various medications and different treatments, which characterize polypharmacy and can bring a large number of drug interactions or toxic effects. The polypharmacy users must be followed by a pharmacist, and the present work determines a way to clustering the patients by priority risk groups considering the multidimensional data involved. Materials and Methods: It was evaluated prescriptions of 150 elderly patients. The descriptors used in the binary matrix were gender, age, pharmacological subgroup of medicine of continuous use classied according to ATC (WHO), over dosage, Beers criteria for inappropriate medication use in older adults, drug interaction and substances presented in National Essential Medicines List (RENAME2008). The Singular Value Decomposition (SVD) was used to make the representation of elements from orthonormal bases practical, which enables a geometric visualization of the data. In order to get better results, it was considered three different metrics: (I) SVD from the binary matrix (BD), (II) SVD optimized by the Euclidean Distance (ED) method and (III) SVD optimized by the Cosine Distance (CD) method. In this manner, analyzing the sharp falls in the chart of diagonal S elements, it was obtained the number of existing clusters for each metric. Scilab was used for SVD and to generate the coordinates required for the clustering algorithm. The K-means algorithm was used to separate data into clusters for each metric, considering as input the coordinates obtained using SVD and the numbers of clusters evaluated from the eigenvalues charts. The correspondence between patient and cluster was obtained as output. A standard criteria where developed considering the pharmacists hypothetic decision involving the risk factors adopted. Results: Considering the number of singular values, three clusters were obtained for the BD and CD methods. Two clusters were identied for the ED. The clusters where classied as high, moderate and low risk. Regarding risk criteria, the clusters presented each other an agreement among 86.0% and 76.7% with the adopted standard (moderate risk in BD and CD clusters were considered equal to low risk when compared with ED cluster). The age greater than 80 and absence of medication in RENAME list did not wield inuence to classify these patients in greater risk criteria. The position in the follow-up priority list was compared to the respective scores derived from each algorithm. The ED provided closer response with standard criteria. Cardiac therapy and antianemic preparation drugs were allocated in high risk group by all algorithms (in 81.3% and 70.0% of prescriptions respectively). The SVD representation was capable to distinct the elderly in risk groups, helping the pharmacist to make a priority list of patients to follow-up. This tool can be associated with biological databases improving information to support decision making. Supported by: CAPES-CNPq.

122

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT06

STORAGE AS A SERVICE AND CLOUD COMPUTING FOR BIOINFORMATICS COMPUTING ENVIRONMENT


Arbex W1,2 , Tagliatti R F1 , Andrade L G D1 , Muniz M N M2 , Guedes E1 , Silva M V B D1,2
1 Laboratrio 2 Ncleo

de Bioinformtica e Genmica Animal - Embrapa Gado de Leite de Bioinformtica de Juiz de Fora - Universidade Federal de Juiz de Fora/Embrapa Gado de

Leite Cloud computing is a new business model of IT, but it is also a new way of organization of computing resources, which may be available available software, platform, infrastructure, storage area, among others, as well as the end-user aplications, such as services, or better, providing and making these resources to users and clients from simple access procedures. Under the technical aspects the cloud computing is structured from a subset of the distributed systems and operating systems concepts and tools, e. g., transparency of distribution and virtualization. Storage systems are excellent solutions for bioinformatics computing environments, which, in general, the nature of the activity difcult to predict the volume of data to be treated. Furthermore, storage systems and are expensive solutions that require changes in the computing environment to allow the use of this type of storage system. This paper describes a storage solution built for the exclusive use of free software, and within the concept of cloud computing, provides storage as a service (StaaS), implementing transparency of access, location, migration, replication, and concurrence. This storage system was developed using four CPUs and eight HDs under Linux operating system, creating a disk array with capacity of over 4 TB of raw space. To implement the disk array was used MHDDFS and NFS applications, available for Linux and other operating systems. The MHDDFS allows the union of several mount points in a single space, assembling the disk array, and the NFS provides disk array over the network. Integrated into the storage system was implemented a backup/restore system that runs copies all of the contents of storage, four times by day, in external hard drives with network access. The backup/restore system was also implemented with native Linux tools - such as CRON and RSYNC - and has its operation without the intervention or even the users knowledge. This storage system was implemented with software free features and low cost hardware, but, despite this, the logical organization and the working structure of the cloud computing allows allows scalability and security in storage to bioinformatics computing environments and scientic computing in general.

123

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

CLUSTERING AND ANALYSIS OF COMPOUNDS THROUGH FINGERPRINTS DATA


Elias H C B1 , Gusmo R F S1 , Alves E C D O1 , Santos M A D1
1 Computer

Science Department - UFMG

In biochemistry and pharmacology, a ligand is a substance that is able to bind to and form a complex with a biomolecule in order to trigger some biological effect. Biological activity is an expression that describe the benencial or adverse effects of a drug on living matter. Biological targets are most commonly proteins such as enzymes, ion channels, and receptors. Many databases presents biological target data, one of these is DUD database. DUD is a database of bioactive molecules and decoys (inatives) for validation of virtual screening methods, especially molecular docking. In the neld of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex. Knowledge of the preferred orientation in turn may be used to predict the strength of association or binding afnnity between two molecules using scoring functions. In this database, there is information about 3800 bioactives compounds related to 40 biological activities. In this work each compound is represented as a binary vector, the molecular nngerprint, which encodes molecular structure in a serie of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. By comparing nngerprints one can determine the similarity between two molecules, search molecular databases, etc. The molecular nngerprint does not include full structural data (such as coordinates), only the presence or absence of some structural features, like substructures.The Decomposition Single Values (SVD) is an important tool in data mining. This technique performs a factorization of a rectangular matrix, with many applications in signal process optimization and statistics.This method receives a matrix and try to reduce matrixs post according to its singular values (low-rank matrix aproximation). In this work we use SVD to cluster the DUD compounds nngerprints matrix. Webuild a 1024x3800 matrix (nngerprints x compounds), where each line represents a position in the molecular nngerprint vector, and each column represents a compound. We determined that we would use the single values to determinate the number of clusters. Comparing the single values obtained in SVD, we observed that from the sixth single value the values declined rapidly. According to the single values of matrix calculated in SVD, the compounds can be separated in, at most, 5 groups. The distribution of points in space generated by SVD connrms this value. Actually we are analyzing the groups of compounds generated and the next stage is try to understand why the compounds were separated in those groups. That is, how a group of compounds which has 40 biological activities can be grouped in only 5 clusters.

124

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

IDENTIFICATION OF NON-CODING RNA FROM MYCOBACTERIUM PATHOGENIC STRAINS


Oliveira L S1 , Paschoal A R2 , Durham A M1,2
1 Departament

of Computer Science, Institute of Mathematic and Statistic, University of Sao Paulo USP, Mato Street, 1010, SP, Brazil 2 Bioinformatics Program, Institute of Mathematic and Statistic, University of Sao Paulo - USP, Mato Street, 1010, SP, Brazil Tuberculosis is a disease caused by the bacteria Mycobacterium tuberculosis . Which has relevance increased due its the wide distribution, association with immunosuppressed patients, and also the appearance of thousands of cases of strains resistant to rst line drugs (MDR) and second-line (XDR). The development of specic bioinformatics tools to provide more support to the joint analysis of multiple genomes, focusing on nding unique and conserved regions of a given group of related bacteria, represents a new opportunity to identify and validate new potential targets for drugs. Our goal was to isolate the exclusive genomic regions of pathogenic strains and, within then, we analyzed exclusive virulent strains to identify possible genes of non-coding RNAs (ncRNA) in these regions, in order to characterize potential drug targets. For these analysis, we used seventeen bacterial genomes. Six of these are from bacteria of the M.tuberculosis complex, the species that causes tuberculosis. The other eleven genomes do not belong to this group, but some are responsible for causing diseases in humans. Those genomes were used as input in the execution of the multiple alignments program Mauve. We developed a PERL script to isolate the conservation regions exclusive to virulent strains based on the output of Mauve software. Each common region larger than 100nt was isolated in a multifasta le. Then it was aligned using the ClustalX program and submitted to the RNAZ program. This latter analyses multiple alignments searching for characteristics sequence and structure variations of non-coding RNAs. Those regions of the alignments classied as ncRNA was isolated (M. tuberculosis ). The candidate sequences of M. tuberculosis were submitted to a pipeline for annotation of the candidates of RNAs (developed by Phd student Alexandre Rossi Paschoal). The results for the annotation of 3714 candidates were as follows: 1220 (32.85%) of them have not shown any evidence (agree or not), 548 (14.76%) candidates were ncRNA (which means have more the one evidence of differente type of RNA in the same candidate), 904 (24.34%) were False-positives, 287 (7.73%) were miRNA, 236 (6.35%) were CRISP RNA and other 236 (6.35%) of for Cis-Regulatory RNA, 180 (4.85%) candidates were snoRNA, 7 (0.19%) were tRNA, 5 (0.13%) were Hammerhead and 91 of them were other specic type of RNA.

125

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

SANDFLY DATABASE: IN SILICO EXPLOITATION OF EST DATA TO EXPLORE EXPRESSION PATTERNS IN SANDFLY SPECIES
Silva G G Z1 , Medeiros S M O1 , Gomes R C A1 , Santos M A O1 , Batista M V A1 , Gomes L1 , Lima T L D1 , Ferreira T A E2 , Balbino V Q1
1 Departamento 2 Departamento

de Gentica, Universidade Federal de Pernambuco, Recife - PE, Brazil de Estatstica e Informtica, Universidade Federal de Pernambuco, Recife - PE, Brazil

In recent years it has been observed a notable expansion in the number of high throughput molecular studies in several species of insects vectors of tropical diseases, particularly those involved in the transmission of malaria (Anopheles gambiae ), dengue and yellow fever (Aedes aegypti ). The completion of these studies has resulted in the identication of several genes involved in a series of key aspects of these organismsevolutionary success. In the case of sandies (Diptera: Psychodidae) involved in the transmission of leishmaniasis, these studies remain rather incipient and consist mainly of the analysis of transcriptomes of some species. In this context, more representative species in public biological databases are Lutzomyia longipalpis (vector of Leishmania infantum chagasi ) and Phlebotomous papatasi (vector of Leishmania major ). Owing to the need to obtain information on the biology of the sandiesinteractions with their hosts, the cDNA libraries evaluated in these studies were built using organs such as salivary glands and digestive systems. These analyses have enabled the identication of important molecules, opening the perspective of their future utilization in the development of more effective methods of leishmaniasis control. The information produced in these investigations demands the development of computational methods that enable the automation of functional annotation and interpretation of the resulting data. In an effort to contribute in this regard, we present the SandFly Database (SandFlyDB), a secondary database integrating expressed sequence tag sequence information of the sandy species most studied. The main focus of the proposed tool consisted of the computational analysis of the totality of ESTs of these organisms available in GenBank (42,778 ESTs of Ph. papatasi , 33,123 of Lu. longipalpis and 7,727 of other sandy species), with a view to establishing a protocol for automatic identication and annotation of protein coding genes. Sequences were retrieved from GenBank, as well as the following information: designation of cDNA libraries; organs and/or physiological stages of the organisms used in the construction of libraries. This information was stored in a MySQL database, enabling the classication of ESTs according to used tissue type and the physiological stage (e.g. sandies fed with blood or sugar; infected with Leishmania sp . or no). The les were clustered using CAP3 software (http://seq.cs.iastate.edu/download.html), and the contigs and singlets were compared against a local copy of the NR database from Genbank using the BLASTX software. Of the 8032 contigs and 37,859 singlets obtained, it was found that the majority of contigs (80.1%) and a signicant number of singlets (45.3%) exhibited similarities with gene sequences deposited in GENBANK. The next step consisted of the recovery of information from secondary databases in order to obtain additional information on the functions of the identied proteins. Searches were carried out against local copies of the following databases: KEGG; Gene Ontology; and COG. This information will be freely available online on our web site (http://www.bioinfo.ufpe.br), for those interested in structural, functional and comparative genomics of sandies. Supported by CNPq, UFPE, CAPES and FACEPE

126

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

VISUALIZING HIGH DIMENSIONAL AND MULTIVARIATE DATA APPLYING SINGULAR VALUE DECOMPOSITION FOLLOWED BY OPTIMIZATION
Couto B R G M1,2 , Boaventura M A C3 , Marcolino L S3 , Santos M A3
1 Programa

de Doutorado em Bioinformtica, Universidade Federal de Minas Gerais, UFMG, Belo Horizonte, Minas Gerais, Brasil 2 Departamento de Cincias Exatas e Tecnologia, Centro Universitrio de Belo Horizonte, UNI-BH, Belo Horizonte, Minas Gerais, Brasil 3 Departamento de Cincia da Computao, UFMG, Av. Antonio Carlos 6627, Belo Horizonte, Minas Gerais, 31270-010, Brasil Background Genomics experiments have produced massive amounts of multivariate data that are being collected into public databases. In this scenario, visualizing the non-visual high dimensional data plays an important role. This paper presents an approach, the SVD/optimization method, to map multivariate data as proteins sequence from their high dimensional representation into 2D or 3D space. The high-dimensional visualization problem in Rm is formulated as a distance-geometry problem, i.e., to nd n points in low space (2D or 3D) so that their interpoint distances match the corresponding values from Rm as closely as possible. Firstly, protein sequences are recoded as tripeptide frequency vector using all possible overlapping tripeptides window. After to describe protein sequences as vectors in a high-dimensional space, we applied a rank reduction by using singular value decomposition (SVD) that is followed by optimization for visualizing proteins and genomes in low-dimensional space. To validate the SVD/optimization method we compared all results with PCA Principal Components Analysis. Results Proposed method was successfully tested in three instances: a set geographic coordinates, other with whole mitochondrial genomes and a third database with proteins from ve families. The SVD/optimization method had better visualization results than PCA. Conclusion The method was able to correctly visualize high dimensional and multivariate data in low space. Predened groups of protein and biologically similar species were represented as near points in space and correctly discriminated.

127

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT05

UEKO UPDATE: 2.4 MILLION SEQUENCES FROM 25 THOUSAND ORGANISMS


Fernandes G D R1 , Ortega J M1
1 Universidade

Federal de Minas Gerais

Kegg Orthology database contains up to date more than 1 million sequences from near 1,000 genomes and it was enriched by a procedure developed by our group to attain 2,442,384 sequences from 25,024 organisms, constituting the UEKO database. UniRef50 enriched Kegg Orthology database, is produced by a procedure of database integration to be published elsewhere and further distributed. The new UEKO database also provides data about EC number, and protein existence evidence. All this information allows a more reliable enrichment procedure. UEKO is available at a website biodados.icb.ufmg.br/ueko and its information is used in a bunch of applications in evolution, development, pathways studies and more. For this work we will show some snapshots and information about what we are able to do using the new UEKO database.

128

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

COMPUTATIONAL ANALYSIS OF THE SCHISTOSOMA MANSONI GENOME FOR THE IDENTIFICATION OF VACCINE CANDIDATES.
Moraes R1,2 , Zerlotini A1 , Oliveira G1
1 Parasitology

Lab - REN RACHOU INSTITUTE, FIOCRUZ, BELO HORIZONTE, BRAZIL Phd Programa - FEDERAL UNIVERSITY OF MINAS GERAIS - UFMG, BELO HORIZONTE BRAZIL
2 Bioinformatics

It is widely recognized that new approaches towards the development vaccine candidates for schistosomiasis are needed. The genome sequence, transcriptomic and proteomic data opens possibilities for vaccine development with the use of computational methods for antigen prediction. We will use these data to predict cell localization and epitopes to identify candidate antigens. The information will be mined via a relational database. Selected candidates will be tested for immunoreactivity with sera from infection resistant humans and in vaccination challenge tests using animals. T cell epitopes are pathogen derived antigenic peptide fragments that, if bound to an MHC molecule of an antigen presenting cell, interacts with the T cell receptor triggering an adaptative immune response to the pathogen. We have used in the S. mansoni genome an integrative approach to the prediction of these ligands in two steps. At rst we made the prediction of all proteins that could be exposed to the host immune system. We tested several algorithms for prediction of secreted proteins and proteins with transmembrane domains. After this we have chosen the SignalP and TMHMM as classical methods, and Sherloc framework and SPRED as a non-classical methods. With the exposed proteins were able to make the prediction for epitopes. For this we use a framework called FRED, which allowed us to use various algorithms and metrics into a single consensus analysis to obtain the best binders. The genome of S. mansoni have 13,273 genes, 2469 of these were predicted exposed to the host. In the exposed proteins were identied 1883 with immunogenic ligands, and 90 were classied as candidates to contain more than 10 binders. All predictions were integrated into a pipeline to automate processes and the results were stored in a relational database. Recently we are integrating experimental data from proteomics and expression to select antigens for experimental validation.We expect that by the end of the proposed work we will have selected at least 30 different antigens that were tested up to the in vitro reactivity assays and 10 that were tested in challenge assays. The results will create the basis of following experiments to develop vaccines for further testing initially in animal models. The vaccines may consist of a single of multiple recombinant proteins or multiple epitopes from one or more proteins. This work was supported by WHO/TDR (ID No. A70310), FAPEMIG and CNPQ.

129

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

A WEB INTERFACE FOR LSQKAB TOOL


Araujo U1 , Dias S R1
1 Faculdade

Anhanguera

Since the beginning of biology studies up to nowadays, the biologists have restricted access to software to develop their researches in a precise and trustful way. In 1979 a project was started by UK Science and Engineering Research Council researchers called CCP4 (The Collaborative Computational Project Number 4 ) which comprises several tools, in particular about macromolecules crystallography. Although the project is still active, its tools interfaces complexity is a great limiter to researches. With that in mind, we propose the creation of a WEB interface for the LSQKAB tool, part of CCP4. The LSQKAB tool compares and improves atomic coordinates of PDB les. The interface was developed by using opensource resources on the WEB environment creation, such as PHP, Perl, JavaScript, CSS on interface creation, and Apache web server for application running, everything in Linux environment. Although there is still need of the LSQKAB tool installed and congured for the application generated scripts execution, they will work in pair, bringing a more pleasant web interface to the researchers, with more simplicity and easy of use. Financial Support: FAPEMIG

130

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT03

SEARCH FOR STRUCTURAL PATTERNS IN RNAS USING GRAPHS


Onuchic V F1 , Machado-Lima A2 , Durham A M1
1 Universidade 2 Escola

de So Paulo de Artes, Cincias e Humanidades - Universidade de So Paulo

In the past few years, researchers have found evidences showing that the importance of ncRNAs to cell activity is a lot greater than it was initially believed to be. For this reason, a large number of algorithms and programs have been developed to try to solve the problem of recognizing new ncRNA genes. The majority of the programs developed to this date depend either on similarity search to some degree, on ab initio structural prediction, or on probabilistic proling of known ncRNAs. Structural prediction is highly inaccurate. Similarity search and probabilistic proling generate classiers that are dependent in different degrees to the primary sequence of known ncRNAs. However, ncRNAs functionality is mostly dened by secondary structure, meaning that there is a lower positive selection pressure for sequence conservation in ncRNA genes than in protein coding ones. Another problem is that some ncRNA families can have also secondary structure variation, with only some conserved structural patterns. A good example of this phenomenon is the Telomerase RNA Components (TERC). We present here a tool that takes into account only the presence of conserved secondary structure patterns, in order to identify new ncRNA genes. In particular, no sequence similarity is required. Also, the tools can be used to represent and search partial structural patterns and to represent pseudo-knots. The basic principle behind the tool is the representation of secondary structures and structural patterns as graphs. In these graphs nodes represent sub-sequences that are part of a pairing in the ncRNA and directed edges represent both sequence order and pairing. We can use the same idea to represent every possible pairing in a sequence. We can create nodes and edges to represent each pair of sub-sequences that are complementary and can, in theory, create a helix. These complementary sub-sequences can be found by aligning the sequence with its reverse, in a way where matches are dened by complementary bases instead of identical ones. Now, to nd if a given structure is possible for a sequence we need only to nd if the graph representing the desired structure is a sub-graph of the graph that represent all possible pairings. This is the isomorphic sub-graphproblem of graph theory that has a known solution. This algorithm, even though impractical in the general case (the problem is NP-Complete), has shown acceptable performance in the real-world scenarios we tested. We have used our tool in three different scenarios. First we checked if it could be used to detect structural patterns, that is, common elements that are part of a larger secondary structure. Second, we used the tool to characterize microRNAs. Third, we used the tool to characterize the family of the RNA component of the Telomerase complex, a problem of particular interest because of the presence of pseudo-knots (crossing helices). The rst two experiments were highly successful: the tool was able to identify all structural patterns tested and the microRNA characterization, applied to a dataset consisting of 5020 curated microRNAs, correctly identied 73% of the sequences, microRNA with an expected false positive rate of only 12%. We expect this tool to have a broad application in the future characterization of ncRNA families.

131

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

BIONETCAD: DESIGN, SIMULATION AND EXPERIMENTAL VALIDATION OF SYNTHETIC BIOCHEMICAL NETWORKS.


Rialle S1 , Felicori L1,2 , Dias-Lopes C1,2 , Peres S1 , Atia S E1 , Thierry A1 , Amar P1 , Molina F1
1 SysDiag

UMR 3145 CNRS/Bio-Rad de Bioquimica e Imunologia, Instituto de Cincias Biolgicas, Universidade Federal de Minas Gerais
2 Departamento

Synthetic biology studies how to design and construct biological systems with functions that do not exist in nature. Biochemical networks, although easier to control, have been used less frequently than genetic networks as a base to build a synthetic system. To date, no clear engineering principles exist to design such cell-free biochemical networks. We describe a methodology for the construction of synthetic biochemical networks based on three main steps: design, simulation and experimental validation. We developed BioNetCAD to help users to go through these steps. BioNetCAD allows designing abstract networks that can be implemented thanks to CompuBioTicDB, a database of parts for synthetic biology. BioNetCAD enables also simulations with the HSim software and the classical Ordinary Differential Equations (ODE). We demonstrate with a case study that BioNetCAD can rationalize and reduce further experimental validation during the construction of a biochemical network. Availability and implementation: BioNetCAD is freely available at http://www.sysdiag.cnrs.fr/BioNetCAD. It is implemented in Java and supported on MS Windows. CompuBioTicDB is freely accessible at http://compubiotic.sysdiag.cnrs.fr/.

132

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Databases and Bioinformatics Tools PI: DBT04

PIPS: PATHOGENICITY ISLAND PREDICTION SOFTWARE


Soares S C1 , Abreu V A C1 , McCulloch J A2 , DAfonseca V1 , Ramos R T J2 , Ali A1 , Santos A R1 , Pinto A C1 , Almeida S S1 , Silva A L2 , Miyoshi A1 , Azevedo V1
1 Universidade 2 Universidade

Federal de Minas Gerais Federal do Par

The adaptability of pathogenic bacteria to new hosts is inuenced by their genomic plasticity, which in turn is generated by mechanisms like horizontal gene transfer. Here, Pathogenicity Islands play a major role since they are large horizontally acquired regions that harbour clusters of virulence genes mediating the adhesion, colonization, invasion, immune system evasion, and toxigenesis of the acceptor organism to the host. Nowadays, Pathogenicity Islands are mainly identied in silico by their classical features: (1) deviations in codon usage, G+C content or dinucleotide frequency; or (2) the presence of insertion sequences and/or tRNA genetic anking regions jointly with transposase coding genes. Several computational techniques for identifying Pathogenicity Islands exist. Most of them, however, only aim for detecting horizontally transferred genes and/or the absence of genomic regions of the pathogenic bacterium in closely related but non-pathogenic species.Here we present the PIPS, a novel software suite designed for the prediction of Pathogenicity Islands. In contrast to existing tools, our approach is capable of utilizing multiple features for Pathogenicity Island detection in an integrative manner. We show that PIPS provides better accuracy rates than other typical software packages. Finally, we used PIPS to study the veterinary pathogen Corynebacterium pseudotuberculosis and identify seven putative Pathogenicity Islands.PIPS is a new computational tool for the in silico prediction of Pathogenicity Islands. It is open source, freely available online and outperforms existing approaches in terms of accuracy. The web interface, the source code, as well as the required databases may be found at http://www.genoma.ufpa.br/lgcm/pips.

133

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 5 T EXT M INING AND I NFORMATION E XTRACTION

134

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE01

CLASSIFICATION OF METAGENOMIC FRAGMENTS THROUGH MACHINE LEARNING METHODS


Higashi S1 , Barreto A D M S1 , Canto M E1 , Vasconcelos A T R D1
1 Laboratrio

Nacional de Computao Cientca (LNCC)

Microorganisms are essential for life on Earth due to the role they play on chemical reactions. One example is the conversion of key elements - carbon, nitrogen, oxygen, and sulfur - into biologically accessible forms. Moreover, microorganisms hosted in human body - intestine and mouth - help in digesting food and in protecting from disease-causing agents.In the last years the study of microorganisms has been centered on single species in pure culture. In this scenario, any conclusion regarding communities of microorganims depends largely on inferences based on their individual behavior. In constrast to single genome studies, metagenomics investigation is applied to entire communities of microorganisms instead of only a few isolated ones. Therefore, the important roles microorganisms play, conducted by these complex communities, can be better understood.Metagenomics is a research eld which attempts to understand biology in a collective way, bypassing the individual focus and centering on the community as a whole. This requires the development of new computational methods that are able to extract knowledge from the genetic composition of complex communities. In this context, the aim of our work is to analyse synthetic metagenomic data and hence develop a taxonomic classier, based on a machine learning approach, for metagenomic fragments. Binning (classifying) fragments into taxonomic ranks is extremely important since it is the rst step of any metagenomic analysis. So far, identifying the taxonomic composition of a sample from a community of microorganisms remains an unresolved problem.Our objective was to study the topology of the space of genome sequences induced by these vectors. In order to do that, we used n-mmer accounting to encode the DNA sequence and k-nearest-neighbor algorithm, with k=1. This choice is justied by the fact that this algorithm makes very few assumptions regarding the distribution of the data (that is, it keeps the bias introduced by the choice of classication method at a minimum.The goal of our work was to answer three questions: (i) What is the best value for n, the length of the nucleotide subsequences?, (ii) What happens when we change the norm of the vector space induced by these sequences?, (iii) Is it a good idea to perform a hierarchical classication, in which sequences are sequentially classied from higher order taxons to lower order ones?We rst investigated the questions above for complete genomes. Then, we simulated synthetic metagenomics fragments, generated from the genomes, and veried if the same behavior emerged. Through this analisys, we conrmed some intuitive results and also found some unexpected ones. Follows a summary of our ndings:(i) For complete genomes, the best value for n is approximately 7, whereas for fragments it seems to decrease to around n = 3.(ii) The choice of distance measure does not seem to have a major impact on the classication accuracy. However, the 1-norm provided slightly better results than the most commonly used Euclidean distance, which is somewhat surprising.(iii) In contrast with other results in the literature, our investigation strongly suggests that a standard classication scheme should be preferred over a hierarchical one.

135

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE03

AUTOMATIC PREDICTION OF GENE FUNCTION USING LITERATURE PROFILING


Oliveira F S D1 , Oliveira G C1,2 , Coimbra R S1,2
1 Center 2 Genomics

for Excellence in Bioinformatics, FIOCRUZ-MG and Computational Biology Group, FIOCRUZ-MG

In the last years, there was an exponential increase in the number of publicly available genomes. Once nished, most of the genome sequencing and annotation projects usually lack nancial support to review annotations. As a result, a large number of predicted genes remain unassigned to a functional category. We present, herein, a new tool for the automatic prediction of bacterial gene function based on literature proles extracted from their gene-specic collections of Pubmed abstracts. We used the software LitProf, an implementation of the Chaussabel & Sher (Genome Research, 2002) algorithm written by the us, to identify the minimum vocabulary required to describe the function of a given gene from a collection of its gene-specic abstracts in Pubmed (www.pubmed.org). To K canonical gene names in the 20 functional categories of the J. Craig compose the initial data set, 50 Venter Institute (JCVI, www.jcvi.org) ontology were randomly picked from 117 genomes representing all bacterial phylogenetic branches. This initial gene set was screened to eliminate homonymous orthologs and genes assigned to more than one functional category that could bias the gene set. Genes assigned to the JCVI categories Disrupted reading frame, Hypothetical proteins, Unclassiedor Unknown functionwere also excluded. In order to further reduce redundancy in the gene set, genes were grouped by Hierarchical Clustering using as input the word-frequency vectors generated by LitProf from their text corpora of Pubmed abstracts. For each cluster of highly related genes ( 0.99 correlation), one representative gene was randomly chosen. Querying PubMed with the canonical names of the remaining 3,781 genes produced a text corpus of 126,990 abstracts (average = 33.5 abstracts/gene; min = 5; max = 50). From this text corpus , LitProf disclosed the minimum informative vocabulary of 889 words and their relative frequencies in the text corpus of each gene. These word-frequency vectors together with the respective functional categories previously assigned to each gene by JCVI were used to train a gene classier with the Support Vector Machine (SVM) program implemented in GenePattern (www.genepattern.org). To assess the performance of the classier, we used 100 replicates of crossvalidation with 90% of the gene set being used as training and 10% as test. The average precision was estimated in 80 3%, and the average recall in 60 3% (condence 0.4); for equally weighted recall and precision, the F-measure was 0.7. Furthermore, the classier achieved up to 81% precision with 73% recall (condence 0.1), F-measure 0.77, in an independent set of 4,000 genes from 13 species, previously classied into unambiguous categories of JCVI ontology. We then used the classier to propose functional categories to 2,192 randomly chosen genes previously assigned to categories Unknown function, or Unclassied. For condence thresholds of 0.7, 0.8 and 0.9, 30% (669), 22% (475) and 13% (276) of genes, respectively, were classied into one of 16 functional categories, of the ontology. In the next step, the classier will be used to re-annotate approximately 69,000 genes assigned to the categories Unknown functionor Unclassiedat JCVI. Supported by: FAPEMIG (CBB-1181/08), NIH-FIC (TW007012), CAPES/CDTS-FIOCRUZ, FIOCRUZMG.

136

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE04

CHRONIC DISEASE PREVENTION: AN APPROACH BASED ON GENETIC AND EPIGENETIC SCIENTIFIC PAPERS
Pollettini J T1 , Macedo A A1
1 Department

of Physics and Mathematics, FFCLRP - USP

Contextualization: Genomics medicine has suggested that risk in human development begins with the conception of the child and continues throughout the adolescence. As a result, these risk factors may inuence gene expression and consequently induce the development of chronic diseases in adulthood. Scientic papers with discoveries and innovative studies indicate that epigenetics must be explored in order to prevent high-prevalence diseases (cardiovascular diseases, diabetes and obesity) improving patients life quality. A large amount of scientic information burdens health professionals interested in being updated. Computational techniques may support management of large biomedical information repositories and the discovery of patterns and knowledge. So these techniques and methodologies should be used to support retrieval and management of biomedical knowledge, such as the presence of premature risk factors. Proposal: We intend to carry out a theoretical and practical scientic investigation exploiting natural language processing and linguistic resources to semantically relate biomedical and health textual information. We propose a surveillance system to alert health professionals about human development problems, automatically discovering scientic papers about related chronic diseases and risk factors from patients clinical records. As a result, the healthcare professional will be able to create a routine with the family in order to set up the best growing conditions. Nowadays, we are focusing on the denition and processing of a scientic paperscollection, which needs to be always up to date to be used by the proposed surveillance system. Materials and Methods: The surveillance system exploits Entrez Programming Utilities (eUtils) and Biopython to routinely search for related papers and retrieve them. We composed a query to lter the PubMed repository and dene a collection with papers of our interest. The query consists of concepts from the Chronic Disease Ontology (CDO) and the Unied Medical Language System (UMLS). Python, PostgreSQL, stopwords removal and n-grams processing are used for textual processing. Results: The collection is updated by processing only new documents. The current collection has 961 documents associated with cardiovascular diseases, diabetes and obesity. 579 are papers related to genes and mutations from CDO; 393 are papers associated with epigenetic; and 11 refer to the intersection of the previous groups. After processing, 135.353 terms were found, composing 12994 unigrams, 33587 bigrams, 43782 trigrams and 44990 4-grams. The terms with the highest frequencies are: diabetes (2457 occurrences); patients (1895); gene (1667); disease (1646); and study (1148). The terms which appear in the largest number of documents are: gene(809 documents); associated(582); diabetes(532); study(471); and patients(471). Final Remarks: This work will probably enable results from bioinformatics works to benet public health. It may also help to consolidate the Translational Bioinformatics, dened as the effective transformation of results from biomedical research into knowledge that is able to improve public health. The development of a mechanism to continuously create and process collection of papers is the rst step. Supported by: FAPESP

137

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE03

NETWORK ANALYSES OF PHARMACEUTICAL RELEVANT PATENTS


Elias H C B1,2 , Lopes J C D2
1 Computer 2 NEQUIM

Science Department - UFMG - Ncleo de Estudos de Quimioinformtica - Chemistry Department - UFMG

Patent documents present a great amount of knowledge stored in a standardized way, with well dened elds. Large databases of patents present their textual data without any kind of analysis. The creation of complex networks from information on patents has been performed by some authors through analyses of patents citations (patents cited by other patents) and classes (areas of knowledge that the patents are related to). One example of the analysis performed is the generality and originality that are related to the classes of a patent and the classes of the patents cited. If a patent cites previous patents that belong to a narrow set of technologies, the originality score will be low, where as citing patents in a wide range of elds would render a higher score. The generality is calculated in the same way but considering forward patents, that is, the classes of a patent and the patents that cite it.In this work the patents of pharmaceutical importance were obtained from SciClips website(www.sciclips.com) and include 554 patents and 1288 recent patent applications (2009 and 2010). Only patents from USPTO were considered. The data include patent applicant (owner), therapeutic targets,diseases and drug types. The original data from SciClips were subject of manual curation in orderto remove typing errors and redundancies. The patents analyzed cover 1279 drug targets, 50 patent applicants and 5636 diseases.From the data available about the pairs of entities of each kind, we constructed networks involving direct citation of target-owner, owner-disease and target-disease. From primary data we derivate relationships about entities of same kind, thus, from target or disease citations we could infer relationships between owners and create a owner-owner network. In the same fashion target-target and diase-disease networks were obtained.However, one problem of this analysis is that the relationships between drug target and diseases are not entirely trustful. The patents analyzed are related to the compounds whose biological/pharmaceutical activitiy where determinated by means of high throughput screening (HTS) usingspecic targets. Thus, in general the cited target is reliable, but many patents cite a high number of disease that are not necessarily related to the target cited. Each patent cite a mean of 1.5 target and 26 diseases. In order to overcome this drawback, we build an index that could be used to weight the target-disease relationship. So, when a specic patent cite many diseases, the relationship between these diseases and the target are divided by the total number of diseases. When several patents cite the same target-disease pair, their weight are summed up. In this fashion only the relationships more trustful will be used to constructed the networks. At the moment we are trying to determine the best cutoff to discard less reliable relationships.

138

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE04

COMPUTATIONAL METHODS FOR PROCESSING AND ANALYZING HISTOPATHOLOGICAL IMAGES TO AID RESEARCH AND CLINICAL DIAGNOSIS
Melo M P1 , Felipe J C1
1 Department

of Physics and Mathematics - Faculty of Philosophy, Sciences and Letters of Ribeiro Preto - University of So Paulo - Brazil Introduction. Bioimage Informatics is a subarea of Bioinformatics, which consists in the development and use of several computational techniques for image analysis, through extraction, comparison, retrieval and management of biological knowledge related to the images. A category of medical images that begins to present high volume in the digital format concerns the microscopic images, taken from blades containing tissue samples and independent cells. In the medical area of Pathology, images obtained from histological examination (extracted from biopsies) and cytological (extracted cell surface, but no removal of tissue samples) consist of a valuable resource for detecting and monitoring diseases. Traditionally, the analysis of such images has been done visually, which makes the work hard and highly susceptible for variations from one expert to another. To assist them in clinical diagnosis and also in research, there is a need to develop new techniques and adapt existing ones (mainly used in radiological images) due to the difculty of applying conventional models for handling computational bioimages. High density of interconnected cells, variations in morphology and contrast objects of interest are some examples of these difculties. Objectives. Within this scope, this work consists in the implementation and evaluation of processing techniques and image analysis that are appropriate, in order to improve them, along with the development of new approaches aiming to solve a set of problems involving the analysis of different tissues (histological images). We will need to map the perceptual parameters used by experts in the process of visual analysis through the microscope, at each proposed problem. Materials and Methods. Basis of images related to the problems will be built containing images of certain categories: liver, nerve bers, muscle tissue, among others. All implementations will be conducted in object oriented programming paradigm, using C++ or Java programming language, and using libraries of image processing, such as the M-morph (containing routines for morphological image processing) and OpenCV, in computing environments as ImageJ and Matlab. Different techniques such as segmentation (allowing separate objects or signicant regions of the image, such as shapes, lines and curves, aiming at a more precise identication of what is to be analyzed in the image), feature extraction (which can be used depending on the application domain, and is usually associated to intrinsic image descriptors such as texture, shape, color and distribution of structural elements), registration (which is the transformation of different data sets obtained by sampling the same scene or through the acquisition of objects at different moments under a unique coordinate system), among others will be evaluated the context of this project in order to determine which methods produce the best results in the identication and classication of patterns. Conclusion. The need for automating the processing and analysis of histopathological images becomes increasingly requested by experts and researchers for tasks such as the location of abnormal patterns from the scan image, as well as the classication of the patterns found, thus, this proposal will contribute to these demands and also to the area of Bioimage Informatics.

139

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE03

TEXTUAL RELATIONSHIPS BETWEEN MEDICAL REPORTS AND RADIOLOGY PAPERS


Dutra M B1 , Barbosa F2 , Macedo A A3
1 Inter-institutional 2 Center

Grad Program on Bioinformatics - USP of Information and Analysis (CIA) - HC - FMRP - USP 3 Department of Physics and Mathematics - FFCLRP - USP Introduction. Considering the large amount of medical data, information and knowledge, such as patient general information, clinical reports, gene expressions and others, techniques for processing and retrieving information from textual documents have tackled by professionals from different areas. Relationships between different types of medical information may be exploited to give support to medical decision. For instance, the scientic literature may be a potential source of knowledge for health professionals during a patient care. Purpose. We intend to develop mechanisms to create relationships between medical reports and scientic articles.We are using documents from Radiology as case of study. Materials and Methods. We are relating scientic papers from SciELO (Scientic Electronic Library Online), especially Radiology area, to 2,000 medical reports. The papers reported studies of anatomical regions: abdomen (274 articles), breast (557 articles) and chest (461 articles). The medical reports were created by health professionals from the Hospital das Clnicas de Ribeiro Preto. Data preparation and processing (i) identied bigrams and word frequencies, (ii) removed stopwords, punctuation and special characters, (iii) used the vector model for the representation of the collection and (iv) created an index. Similarity between documents was calculated by cosine similarity measure. Relevance feedback was exploited considering Rocchio. Discussion. We have generated 3163, 4101 and 2593 bigrams, respectively, for abdomen, breast and chest collections. Nowadays we are evaluating the resulting relationships considering relevance feedback from a physician. Precision and recall must be used to identify the level of accuracy and recall of the proposal. Next we intend to use a linguistic resource, such as MedDRA, MeSH and DeCS, to semantically approximate our collections and tackle language barriers. We are manipulating medical information in Portuguese and SciELO papers in Portuguese, English and Spanish.

140

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE05

VALIDATING CLUSTER OF GENE EXPRESSION BY MEANS OF DISCOVERY OF GENE INTERACTION FROM BIOGRID
Paula D S D1 , Macedo A A2
1 Interunit 2 Department

in Bioinformatics - UNIVERSITY OF SO PAULO of Physics and Mathematics - FFCLRP, UNIVERSITY OF SO PAULO

Modern techniques for obtaining biological data created opportunities for the development of scientic investigations in the biological sciences. Trying to understand the complex biological mechanisms, many scientists from the whole world have generated a large amount of gene expressions. However, there is a gap between the ability to produce data and extract knowledge from that data. The use of clustering techniques may support the discovery of biological mechanisms from gene expression data. For clustering, data usually are separated into groups according to some measure of similarity. Nevertheless, clustering algorithms are sensitive to many factors, for example, data set and metrics. These factors may inuence the quality of resulting clusters. Thus, it is necessary to evaluate whether the cluster has not happened by chance. Therefore, specialists need to analyze and validate if the clusters were produced based on their biological signicance. Considering the large amount of knowledge from scientic papers, it is hard to nd information of interest. There are many initiatives focused in organizing information considering huge repositories. The BioGRID (Biological General Repository for Interaction Datasets) was developed to store and distribute collections of gene interactions. The vast amount of literature available also stimulates interest in automatic text summarization from one or more sources of information. We are developing mechanisms to support validation of clustering algorithms when they are applied to gene expression data. First, we created a database to store information from BioGRID. Afterwards, we have exploited the datasets of gene expression in the Gene Expression Omnibus (GEO), available at SOFT File format. To perform clustering algorithms in WEKA, we have converted the GEO datasets to the ARFF (Attribute- Relation File Format) les. After, we have consulted the database with information from BioGRID in order to search for genetics research documents pertaining to the same cluster. We have also applied shallow approaches to automatic summarization of retrieved articles. We intend to give support to cluster validation and denition of meaning of the identied gene expression clusters. When the scientic literature indicates relationships of genes in a cluster according to some biological signicance, it is possible to infer that the cluster was not randomly dened. Consequently, we are able to add information to clusters. Considering the same cluster with biological signicance, we also intend to investigate its other relationships not reported by the literature. Invalid or undiscovered relationships may indicate problems with the clustering algorithms. Consequently, inferences of relationships need to be very well analyzed. Our work may also contribute with the denition of genetic networks. Supported by: CNPq.

141

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE04

ENRICHING KEGG PATHWAY WITH PESCADOR TEXT MINING TOOL - COLORECTAL CANCER PATHWAY AS A MODEL
Stussi F1 , Barbosa-Silva A2 , Donnard E R1 , Fernandes G R1 , Guedes R L M1 , Ortega J M1
1 Departamento 2 Biology

de Bioqumica e Imunologia - ICB, UFMG and Data Mining Group - Max-Delbrck Center for Mol. Medicine

Kegg Pathway database has been initiated in 1995 with biochemical pathways. Hitherto, the database covers several biological phenomena. The propagation of information to all organisms comprised by the database allows its use inautomated analyzes of new sequenced genomes. To address the possibility of enriching the database with relationships with additional gene products and with sequences from more species we set up to enrich the Colorectal Cancer pathway, comprised of 40 gene entries which have been mapped in 299 species, adding up a total of 3012 amino acid sequences. Using the text mining tool PESCADOR, we processed a total of 29208 abstracts sorted by signicance with the software PubMed Ranker. PESCADOR has labeled 3492 genes involved in 91635 biointeractions. Manual inspection allowed for the addition of 42 genes to the original Kegg Pathway establishing 92 new biointeractions. The capture of amino acid sequences comprising the database was obtained with the use of UEKO, an enrichment of Kegg Orthology database conducted by our group. After using UEKO, the sequence information related to this pathway attained 5622 sequences, a 1.9 fold enlargement. They belong to 944 species. Automated analysis of 81814 ESTs from public colorectal tumor libraries produced 2394 hits for as compared to 2055 hits, using the original Kegg Pathway. Thus, text mining tool PESCADOR and UEKO database result on a signicant enrichement of information on adenocarcinome pathway, as present in Kegg Pathway.Supported by: FAPEMIG

142

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE04

REDUCING SEMANTIC GAP BETWEEN HISTOLOGY IMAGES AND PATHOLOGY DIAGNOSIS IN THYROID CANCER USING SECONDARY DATA
Pessotti H C1 , Murta-Junior L O1 , Soares E G2 , Moriel A R3 , Macedo A A1
1 Department 2 Department

of Physics and Mathematics - FFCLRP, USP of Pathology - FMRP, USP 3 Institute of Anatomy Pathology and Cytopathology - So Jos do Rio Preto, SP, Brazil The development of computer systems for microscopic imaging is a challenge for medical informatics professionals. Leading researchers have developed new techniques of image processing, visualization and text mining aiming to extract knowledge from medical documents and data. Computational technologies may give some support to physicians, for example, when they are establishing a diagnosis through image analysis. Computer systems are able to provide additional information and consequently reduce time and effort of health professionals. Traditionally, Computer Aided Diagnosis (CAD) systems are based on image processing and Content-Based Image Retrieval (CBIR) methods that focus on images attributes. During microscopic image analysis by healthcare professionals, the specication of image attributes may not offer substantial help for the retrieval process because the user may not be familiar with the chosen attributes, especially the low level ones. The use of semantic-level information may improve the accuracy of image retrieval, helping users to specify their queries and understand results. We propose a theoretical and practical study of Information Retrieval (IR) and linguistic resources (such as medical ontologies and thesauri) to dene a conceptual mapping between the contents of medical images and textual information from pathologic exams. Usually, pathologic exams contain the description of the associated microscopic image in terms of its cellular components and a nal diagnosis, which synthesizes the pathologist impressions after image analysis. The cellular description can be used as auxiliary data to create a mapping between the latent semantic content from pathologic images and the diagnosis. After identifying and labeling each cell of a microscopic image, the labels can be expanded by a thesaurus and specialized in the thyroid domain using an ontology to dene the query terms. These terms can be used to lter a set of pathologic exams describing similar cellular components. IR techniques were applied to weight representative strings found from the collection. As a result, a mapping can be established using the exam description to provide relevant information to give support to pathologists. Initial experiments were conducted with 25 pathologic exams, divided into two groups: a training set (with 20 exams) and a validation set (with 5 exams). Five validation exams were submitted to the system, returning seven exams (threshold = 0.5) with an average of 1.4 exams retrieved per query. The most frequent terms were colloid nodule, medullary carcinoma, and papillary microcarcinoma. Our current collection is being expanded to allow more precise evaluation. Our proposal expects to reduce the semantic gap between the computerized medical image retrieval and its human interpretation. A semantic mapping may generate knowledge and assist pathologists. Other cases of study can be performed in the context of electronic learning tools, surveillance mechanisms and CAD systems. Supported by FAPESP.

143

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE05

GO-SIEVE - A METHOD TO AID THE ASSIGNMENT OF EVIDENCE CODES IN GENOME ANNOTATIONS


Folador E L1 , Maluceli A2 , Cruz L M3 , Madeira H M F2
1 Centro

de Pesquisa (CPQ), Laboratrio de Bioinformtica e Biologia Computacional, Instituto Nacional do Cncer (INCA), Rio de Janeiro, Brasil 2 Health Technology Graduate Program (PPGTS), Pontifcia Universidade Catlica do Paran (PUCPR), Curitiba, Brasil 3 Biochemistry Dept., Universidade Federal do Estado do Paran (UFPR), Curitiba, Brasil Automatic annotation has provided essential contribution to the processing of thehuge amount of data generated by genome sequencing projects. In the last few years,the use of GO terms and evidence codes has greatly contributed to standardizinggenome annotations. However, many older annotations were made available beforethe widespread use of ontology and the use of an evidence code establishing howthe annotation was obtained was rare. With that in mind we developed a method thataids the annotator in the assignment of evidence codes based on the results obtainedfrom sequence similarity. To achieve that, keywords arbitrarily associated to evidencecodes and present in the title and abstracts of PubMed les linked to BLAST resultswere searched for. The previously annotated genome of Chromobacterium violaceum,available in the SABIA annotation pipeline, was used to validate the evidence codeassignment. From the 4,431 annotated genes of C. violaceum, a total of 181,098 leslinked to sequence alignment BLAST results were found. Based on the presence ofkeywords on those les, 121,115 evidence codes were assigned. Most of the les wereredundant, i.e., were associated to more than one gene. Only a total of 2,517 distinctPubMed research papers, (title and abstract) were retrieved by GO-SIEVe, revealingthat current functional annotation is essentially based on the expansion of the relativelysmall number of experimentally determined functions to large collections of proteins.The assignment of evidence codes to the unique PubMed les were validated bycurators, and 82.1% of the assignments were considered appropriate.

144

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE05

PANATI AND WEBPANATI - INFORMATION SYSTEMS FOR SNPS


Wright M H2 , Narciso M G1 , Declerck G2 , Mccouch S2
1 Embrapa 2 Universidade

Arroz e Feijo de Cornell

This paper describes two softwares: Panati and webPanati. Panati is a software that reads data from Illumina GenomeAnalyzer (Solexa) and gives an output about SNPs in unix environment. These datas are showed in Internet by webPanati, and the user can use the output from webPanati for analysis of SNP data as the user wants. These data are about rice, but panati and webPanati can be used to get SNPs from other cultures. Panati and webPanati are free software, are easy to install and use, and webPanati gives les that can be used as input to Flapjack system to do analysis of SNPs.

145

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Text Mining and Information Extraction PI: TMIE04

CAN THE VECTOR SPACE MODEL BE USED TO IDENTIFY BIOLOGICAL ENTITY ACTIVITIES?
Maciel W D1 , Faria-Campos A C1 , Gonalves M A1 , Campos S V A1
1 Universidade

Federal de Minas Gerais

Biological systems are commonly described as networks of entity interactions. In these networks entities from different types or categories such as diseases and drugs interact with each other developing important activities. Some activities are already known and integrate the current knowledge in life sciences. Other activities remain unknown for long periods of time and are frequently discovered by chance. In this work we present a model to predict these unknown activities from a textual collection using the vector space model (VSM), a well known and established information retrieval model. We have extended the VSM ability to retrieve information using a transitive closure approach that is able to infer new interactions from known ones. Our objective is to identify the known biological entity activities from the literature and construct a network of entity interactions. Based on interactions established in the network our model applies this transitive closure in order to predict and rank new entity interactions. We have tested and validated our model using a collection of patent claims. From 266.528 possible interactions in our network, the model found 1.027 known ones throughout the textual collection and predicted 3.195 new interactions. Iterating the model according to patent issue dates, we have observed that new interactions found in a given past year were conrmed by patent claims not in the collection and issued in more recent years. More importantly, our model has provided means of ranking the interactions based on the similarity values derived from the VSM. We have observed that most conrmation patents were found at the top 100 new interactions obtained. We have also found papers on the Web which conrm new inferred interactions. For instance, the best new interaction inferred by our model relates the interaction between the adrenaline neurotransmitter and the androgen receptor gene. We have found a paper that conrms this interaction and reports the partially dependence of the antiapoptotic effect of adrenaline on androgen receptor. The VSM extended with a transitive closure approach provides a good way to identify biological activities from textual collections. The extended VSM contributes to identify and rank relevant new interactions even if these interactions occur in only a few documents of the collection. Consequently, we have found a good way of restricting the best potential results to consider in order to foster new advances in life sciences, even when indications of these results are not easily extracted from a mass of documents. Supported by: CAPES

146

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 6 S YSTEMS B IOLOGY AND N ETWORKS

147

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN01

A CELL-BASED MODEL FRAMEWORK FOR GRAPHICS PROCESSING UNITS


Tamulonis C1 , Postma M2 , Kaandorp J1
1 Section 2 Swammerdam

for Computational Science - Universiteit van Amsterdam Institute for Life Sciences - Universiteit van Amsterdam

Cell-based models represent biological systems on the level of individual cells. We have developed a simple cell-based model framework that allows us to represent complex cell shapes and simulate contact interactions between cells, such as collisions between adjacent cells and cell-cell adhesion. Cells are also allowed to move freely and are not restricted to dense aggregates or lattices. The framework is appropriate for running simulations on graphics processing units (GPUs) allowing for fast running times and automatic scalability. We have successfully applied the framework to two different biological systems: gliding cyanobacteria and embryogenesis. Cyanobacteria are a very large and diverse phylum of prokaryotes that perform oxygen-releasing photosynthesis. They require sufcient light to perform photosynthesis, but are also easily damaged by intense radiation, such as direct sunlight. To optimize their light exposure, many cyanobacteria species employ some form of photomovement that guides the microorganisms to optimal light conditions. For example, in a lab cyanobacteria can be induced to spread very accurately over a slide projection such that their exposure is increased, leaving the photograph imprintedonto the Petri dish. Using our framework, we created a detailed model of cyanobacteria to perform large-scale simulations of photomovement. Using the model, we were able to recreate the photograph experiment described above and demonstrate how a simple individual strategy can effectively redistribute a large population. We found that typical characteristics of cyanobacteria trichomes, such as large length, fast gliding speed and high photosensitivity all contribute to optimizing light exposure, suggesting that these organisms are highly tuned for this task. We have also used the framework for creating a model of eukaryotic cells that allows for arbitrary cell shapes represented as large polygons. The model was applied to the embryogenesis of Nematostella vectensis , specically the gastrulation process during which the embryo is transformed from a simple mono-layered sphere into a bi-layered gastrula with an irregular shape. Using our model, we were able to simulate gastrulation and associate emergent macroscopic changes in embryo shape to individual cell behaviors. We have developed a number of testable hypotheses based on the model. First, we hypothesize that the cells need to be stiffer at their apical ends, relative to the rest of the cell perimeter, in order to be able to hold the dimensions of the blastula, regardless of whether the blastula is sealed or leaky. We also postulate that bottle cells are a consequence of cell strain and low cell-cell adhesion, and can be produced within an epithelium even without apical constriction. Finally, we postulate that apical constriction, lopodia and de-epitheliazation are necessary and sufcient for gastrulation based on parameter variation studies. Supported by: the Foundation for Science and Technology (FCT, Portugal), the MORPHEX project (EU) and theNetherlands Organisation for Scientic Research(NWO,Netherlands).

148

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

SIMULATING RNA POLYMERASE MOBILITY USING A STOCHASTIC KINETIC MODEL AND COMPARISON WITH AGAROSE GEL ELECTROPHORESIS
Costa P R1 , Castriota M1 , Lemke N1
1 Instituto

de Biocincias de Botucatu - Unesp

The transcription process by RNA polymerase II (RNAPII) is one of the most exquisitely controlled processes in the cell. Several techniques have been used to study the entire process, such as atomic force microscopy, single-molecule uorescence, and methods that track the motions of tiny particles attached to RNAPII and DNA molecules. From the data obtained by these experiments, we can observe the complexity inherent in the process and theoretical models have been proposed to simulate it. We propose in this work, using the Gillespie algorithm, the creation and simulation of a stochastic kinetic model for the movement of RNAPII at DNA strand. For this purpose, we have developed a package for Mathematica that generates the expected banding pattern of electrophoretic mobility of RNA in agarose gel (EMRAG) corresponding to RNA fragments yielded from a DNA strand bearing intrinsic RNAPII pauses. Also, this package can process the image of the banding pattern obtained by an real EMRAG to create a dwell-time histogram of the RNAPII elongation. In this way, by comparing the real dwell-time histogram with the simulated one, we can determine the parameters values to rene our model. In future, we intend to compare our results using these parameters with other results in the literature.

149

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

THE QUANTUM-COHERENT HYPOTHESIS FOR SIGNALING AND REPAIR OF DNA DAMAGE


Martnez A1 , Arruda-Neto J D T1,2 , Rodrigues T E1
1 Linear 2 FESP,

Acelerator Laboratory,Physics Institute, University of So Paulo, Brasil So Paulo Engineering College, So Paulo, Brasil

The mechanisms by which eukaryotic cells sense DNA strand breaks remain to be elucidated. The fast induction of ATM kinase activity, immediately after exposure to ionizing radiation, suggests that it acts at an early stage of signal transduction. Published data (Nature 421/2003,p.499) indicate that ATM activation is not dependent on direct binding to strand breaks, but may result from changes in the chromatin structure . However, many fundamental questions were left unresponded (Nature 421/2003,p.486); in particular: what is so specic about DNA breaks vis--vis chromatin changes?, how ATM senses directly structure disruption in relaxedchromatin?, what factors determine the impressive speed and extent of the ATM response?, among others. Here we propose a signaling mechanism for the activation of the ATM kinase and its recognition of a few DNA breaks within the entire genome, while accounting for the elucidation of the questions mentioned before. The model allows the calculation of an electromagnetic transient, produced by an electron coherent hoping current generated at the damage site, which is able to induce long-range activation of ATM almost instantly. Another quantum-coherent effect known as free water dipole laseris incorporated into the model, which explains how the ATM nds its navigation route to sense and reach the DNA damage site.

150

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

A SCENARIO SIMULATION SYSTEM OF FOOT-AND-MOUTH DISEASE SPREAD IN MINAS GERAIS BRAZIL


Silva M M R D1 , Veloso C J M1
1 Ponticia

Universidade Catolica de Minas Gerais

Foot-and-mouth disease (FMD) is considered one of the most important infectious diseases of livestock because of the devastating economic consequences that it inicts in affected regions. The value of critical parameters, such as the duration of the latency or the duration of the infectious periods, which affect the transmission rate of the FMD virus (FMDV), are believed to be inuenced by characteristics of the host and the virus. Disease control and surveillance strategies, as well as FMD simulation models, will benet from improved parameter estimation. The objective of this study was to simulate the impact of a potential incursion of foot-and-mouth disease (FMD) virus on the livestock in counties areas of Minas Gerais, Brazil. The study was conducted taking into account the distributions of variables associated with the duration of the latency,sub clinical, incubation, and infectiousness periods of FMDV transmission, the density of livestock area, with an estimated number of cattle on-feed, based on IMA and MAPA data. A double independent, systematic review of retrieved publications reporting results from experimental trials, was performed to extract individual values related to FMDV transmission. Probability density functions were tted to data and a set of regression models were used to identify factors associated with the assessed parameters. An stochastic, spatial simulation model to simulate scenarios was adopted for planning and decision-making. Our scenarios simulated different herd types in an integrated way using the current mitigation strategies. Under our assumptions about availability of resources to manage an outbreak, median epidemic lengths in the scenarios with the estimated herd ranged from 1 to 5 years. These results will contribute to the improvement of disease control and surveillance strategies and stochastic models used to simulate FMD spread and, ultimately, development of cost-effective plans to prevent and control the potential spread of the disease in Minas Gerais. Suported by FIP-PUC Minas and FAPEMIG

151

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

GENOME MAINTENANCE PATHWAYS ALTERATIONS IN HUMAN GENETIC DISEASES AVAILABLE IN THE NCBI/GEO
Simao E M2 , Mombach J C M2
1 Genome 2 Department

Maintenance Pathways Alterations in Human Genetic Diseases available in the NCBI/GEO of Physics - Universidade Federal de Santa Maria (UFSM)

Genome instability originates from either somatic mutations, observed in the majority of sporadic cancers, or germline mutations, associated with rare hereditary cancer syndromes. The genome integrity, cell proliferation and survival are regulated by an intricate network of pathways that include DNA repair and recombination, cell-cycle and programmed cell death. The loss of genome maintenance mechanisms (GMM) is one of the most important aspects of human carcinogenesis [1]. We developed the Ontocancro database to be a source of information on transcriptomics and interatomics data involved in comprehensive GMM pathways including: apoptosis, repair, chromosome stability, cell cycle and many others [1, 2]. With the purpose of differentiating and characterizing the functionality of GMM pathways in cancer, their development and in other diseases, we analyzed 32 samples of cancers, genetic syndromes and other genetic related diseases. The transcriptomics obtained from the Gene Expression Omnibus (GEO) [3] and were studied using the tool Viacomplex [4] and related through a graph using a statistical test that identies signicant alterations in pathways activities. The analysis identied several altered pathways. For most of the cancers we found an increased expression activity in the cell cycle pathway, very few alterations in the apoptosis pathway and random alterations in the repair pathways, implying that these tissues have highly altered cell cycles, scattered alterations in repair and less commonly altered programmed cell death. In precancerous tissues, as Colon and Adrenal adenoma we found a general increase of all repair and cell cycle pathways. The results indicate that these tissues have increase of the activity in GMM pathways, especially in samples with high index of chromosome instability [5]. We found no evidence alterations in apoptosis, suggesting that this pathway is unaltered in this type of disease. In genetic syndromes we found, in general, decreased expression in some pathway and in the case of the syndromes accompanied of chromosomal instability as Werner, Bloom and Fanconi a substantially higher number of decreased pathways. The results strongly suggest that the GMM pathways are affected and can be a useful tool for differentiating between adenoma, cancer, syndromes and other genetic related diseases. Supported by: CAPES, FAPERGS AND CNPq REFERENCES [1] der M. Simo, et al , Modeling the Human Genome Maintenance Network , Physics A 389 (2010) 4188-4194; [2] Giovane R. Librelotto, et al , An Ontology to Integrate Transcriptomics and Interatomics Data involved in Gene Pathways of Genome Stability . BSB 2009, LNBI 5676 (2009) 164-167; [3] Gene Expression Omnibus : http://www.ncbi.nlm.nih.gov/geo/; [4] Mauro A.o A. Castro, et al , ViaComplex: Software for Landscape Analysis of Gene Expression Networks in Genomic Context , Bioinformatics 11 (2009) 1468-1479. [5] Charles E. Jefford, et al, Mechanisms of chromosome Instability in Cancer , Oncology Hematology 56 (2006) 1-14

152

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

A TOOLBOX FOR FLUX BALANCEABILITY AND CALCULABILITY IN METABOLIC FLUX ANALYSIS


Campos J D O3 , Oliveira I L1,2,3
1 FISIOCOMP 2 NUBIO

- Laboratrio de Fisiologia Computacional - UFJF - Ncleo de Bioinformtica - UFJF-EGL (EMBRAPA Gado de Leite) 3 DCC - Departamento de Cincia da Computao - UFJF Metabolic ux analysis (MFA) has many biotechnological applications and has turned out to be a powerful computational tool. MFA allows one to determine some of the reaction rates that are not accessible to measurement. A problem in MFA is to nd a set of rates that have to be measured in order to uniquely calculate the other rates. This is also useful for nding rates that must be measured or given additionally to make a currently no observable rate uniquely calculable. Beside calculability (determinacy), redundancy is a second important criterion for classication. A metabolic system is redundant if the non measured ux matrix contains dependent rows. In such cases, balanceable rates exist, which may be balanced by the redundancy matrix. The redundant information can also be used for a consistency check of measurements and/or of the model. We developed some routines in MATLAB (toolbox) that carry out the above mentioned metabolic systems classication. Supported by: FAPEMIG

153

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN05

TRANSCRIPTIONAL NETWORKS RECONSTRUCTION: IDENTIFICATION OF GENES INVOLVED ON CATTLE RESPONSE TO TICK RHIPICEPHALUS (BOOPHILUS) MICROPLUS INFESTATION
Giachetto P F1 , Yamagishi M E B1 , Santos E H D1 , Ibelli A M G2,3 , Regitano L C A2
1 Laboratrio 2 Laboratrio

de Bioinformtica Aplicada - Embrapa Informtica Agropecuria de Biotecnologia Animal - Embrapa Pecuria Sudeste 3 Departamento de Gentica e Evoluo - UFSCAR In tropical countries, losses caused by tick infestation in cattle lead to a great impact on animal production systems. Weight and feed conversion reduction, together with diseases transmitted by the parasite are some of the problems that lead to economic losses of billion dollars a year. In a general way, Bos taurus indicus cattle are less susceptible to infestation with Rhipicephalus (Boophilus) microplus than Bos taurus taurus cattle but the immunological basis of this difference is not understood. Since we are interested in nding genes that may be involved in mechanism of bovine response to tick for use in animal breeding, we investigated transcriptional networks in the response of these different genotypes of cattle to tick infestation. Recent studies show that co-expression networks can be used to identify a set of candidate genes underlying specic phenotypes and some gene co-expression network methods have been successfully applied in a variety of studies. In this study, Weighted Gene Co-expression Network Analysis (WGCNA) was applied, using microarray expression data. This systems biology analysis method starts out by identication of modules of genes based on patterns of gene co-expression, dened as sets of highly correlated (connected) genes, which may represent molecular networks involved in a common biological pathway. Genes highly connected within these modules are thought to drive the group, and are considered to be hub genes . Skin samples were collected from bovines of different genotypes before (BI) and after (AI) articial tick infestation and mRNA used for GeneChip Bovine Genome Array hybridization. Microarray data were processed using affy /Bioconductor software package. We follow a general framework for constructing gene co-expression networks and used the WGCNA R package. The power adjacency function was applied to the co-expression measurement, the absolute Pearson correlation coefcient, to derive the adjacency matrix; we used a soft thresholding approach by raising each correlation to a xed power ( =6). Modules were dened using the dynamic hybrid tree cutting algorithm of the dynamicTreeCut R package. Our analysis identied 8 modules. Each of the modules was labeled with a unique color as an identier and characterized for enrichment of functionally-related genes. Interesting modules were dened as those enriched with genes involved in immune response and containing differentialy expressed genes (DEG), wigh were identied separately in each module. The blue module (n=220 genes) was enriched for genes belonging to Chemokine signaling pathway , Focal adhesion and Cell adhesion molecules pathways, and had the greatest number of DEG. These DEG, together with the hub genes inside the blue module are candidate genes elected for further studies aiming the understanding of mechanisms involved in tick tolerance by cattle. Supported by: Embrapa, CNPq.

154

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN05

SKN-1 KNOCKDOWN REDUCES POLYGLUTAMINE AGGREGATION IN THE MODEL ORGANISM CAENORHABDITIS ELEGANS
Pereira V S1 , Paula I T B R2 , Oliveira R P1,2
1 Ncleo 2 Departamento

de Pesquisa em Cincias Biolgicas - UFOP de Biodiversidade, Evoluo e Meio Ambiente - UFOP

Proteins with abnormal expansions of polyglutamine (PolyQ) repeats are inclined to misfold and deposit as aggregates and inclusions causing neurodegenerative disorders, such as Huntingtons disease. The molecular pathways that regulate longevity play also a major role in the protein aggregation and pathogenesis of polyglutamine diseases. Recently, SKN-1 has been described as a novel transcription factor required in C. elegans for oxidative stress resistance and for promoting longevity under normal, reduced insulin signaling, and caloric restriction conditions. The analysis of the transcriptome regulated by SKN-1 under normal conditions showed that it regulates numerous detoxication genes, along with other genes which are related to stress defenses. We identied 40 upregulated and 6 downregulated genes involved in protein folding or degradation, some of which have lysosomal and proteassomal functions. The inspection of 3 Kb within their putative promoters showed that 56% of the SKN-1-regulated genes (2646) contained three to ten copies of the canonical SKN-1 binding site (WWTRTCAT) indicating that SKN- 1 may directly control the expression of many of the SKN-1-regulated genes involved in protein degradation. Interestingly, 13 of the SKN-1-upregulated genes have been previously identied as able to prevent protein aggregate formation in a transgenic line carrying 35 polyglutamine repeats (Q35) fused to Yellow Fluorescent Protein (YFP), Next, we examined the number of aggregates in transgenic animals carrying 44, 64 and 82 PolyQ repeats under the control of an intestinal promoter vha-6 (vha-6p::Q44::YFP , vha-6p::Q64::YFP and vha-6p::Q82::YFP ) treated with RNA interference (RNAi) against skn-1 . Our results revealed that the number of aggregates in the intestine of Q44::YFP, Q64::YFP and Q82::YFP are signicantly reduced when skn-1 is suppressed in adult animals compared to animals on control RNAi. Although the number of SKN-1-upregulated genes found involved in protein folding or degradation is higher than those downregulated, our results suggest that skn-1 knockdown turns on a set of genes involved in protein degradation. One possible model is that these genes might protect against protein degradation under normal conditions, and were upregulated after skn-1 RNAi as a secondary defensive response to stress resulting from SKN-1 loss. Alternatively, another pathway is being activated in order to compensate the absence of SKN-1. Support: CNPq, Fapemig and NUPEB

155

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

MODELING AND ANALYSIS OF SODIUM-POTASSIUM EXCHANGE PUMP WITH PROBABILISTIC MODEL CHECKING
Crepalde M A1 , Faria-Campos A1 , Campos S V A1
1 Department

of Computer Science - FEDERAL UNIVERSITY OF MINAS GERAIS

Recently there has been growing interest in the application of Probabilistic Model Checking (PMC) for formal specication of biological systems. PMC is a formal verication technique to describe and analyze systems that exhibit stochastic behavior. This approach is able to explore all traces of a modeled system and, therefore, can identify events and conditions that can be overlooked by simulation methods and average analysis. In this work we propose a discrete and stochastic modeling along with PMC for a quantitative description and analysis of the sodium-potassium exchange pump. Furthermore, our biological model is based on the known Albers-Post cycle, which summarizes the complex pump mechanism into a set of elementary and reversible reactions. The sodium potassium pump is found in the plasma membrane of virtually all animal cells and is responsible for keeping the sodium and potassium concentrations inside the cell, respectively, high and low. This cytoplasm condition is essential for basic cellular functions such as excitability, secondary active transport and volume regulation. Some works has already developed a formal specication of this biological system using the -calculus process algebra language. Those works has also used model checking to verify some computational properties about deadlock and bisimilarity, which is an equivalence relation between state transition systems, associating systems that behave in the same way. However, it does not have a quantitative description of the sodium-potassium exchange pump, nor does it deal with quantitative properties about the biological system. We will present a quantitative formal specication of the pump mechanism in the PRISM tool, a probabilistic model checker, taking into consideration a discrete chemistry approach and the Law of Mass Action aspects. We will also present some signicant questions about the sodium-potassium pump reversibility that can be addressed directly using model checking, whereas with the other traditional approaches, such as simulation and Ordinary Differential Equation (ODE) methodology, it can be difcult or even impossible. Whereas the ODE approach takes account of the average behavior, hiding same important traces, the simulation might not capture some events, given the long periods of time for them to happen with considerable probability. With model checking we can quickly verify in our pump model that, for example, the potassium outside the cell ends in all traces. The probability that this event to happen in the rst 10 seconds is 3.34x104 and, therefore, it might not be seen directly with simulations. Then, after we know that the event potassium outside the cell ends happens, through model checking, we can focus the other approaches to identify and understand it better. Moreover, with model checking we can know if this event and other, such as the pump reversibility, will exhibit recurrence in the long term, which is difcult to be addressed with the other approaches. To sum up, we have shown that model checking can be used along with traditional approaches based on simulation and differential equations for analysis of the sodium potassium pump, in order to extend the pump behavior knowledge.

156

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

BIOIMAGE INFORMATICS TECHNIQUES APPLIED TO THE PROCESSING AND THE ANALYSIS OF CERVICAL HISTOPATHOLOGICAL IMAGES
Miranda G H B1 , Barrera J1 , Felipe J C1
1 Faculty

of Philosophy, Sciences and Languages of Ribeiro Preto (Department of Physics and Mathematics) - University of So Paulo The study of biological processes at the cellular level has been supported by the development and the application of different methodologies that can be combined to help the understanding of the dynamic processes of living organisms. Bioimage Informatics is a sub-area of Bioinformatics that consists in the study of biological problems which have images as their primary data source. In the last decades, many image processing techniques have been developed to better attend studies varying from small organisms to molecular structures. Large databases became available containing bi and three-dimensional images as well as biological signals representing expression levels of different genes.In addition, techniques already widely used for general image and signal analysis could be appropriately applied to these images.Histopathology is characterized by the study of cellular structure and the changes in its organization caused by diseases. It is one of the most important diagnostic tools in the medical practice.The lesions gotten through histological samples are identied by the morphological changes in the structures of the cells.The objective of this work is to develop a methodology for the automatic image analysis of histological samples from cervix epithelial tissue, supported by a study of the processes involved in the cellular structural organization through the application of techniques of computerimage processing and pattern recognition.In order to achieve this objective, meetings were held with pathologists aiming at raising the perceptual parameters used in the identication and classication of cervical lesions, such as the nuclear size and the cell shape.This work is being developed in collaboration with the Cytopathology Laboratory team of the Department of Pathology at Ribeiro Preto School of Medicine, which provides material from cervical uterine cytological examinations, used for the creation of an image database.The digitized microscopic images are acquired from blades previously stained, containing samples of Pap exams, using a camera connected to the microscope.A pipeline of morphological operators was implemented in order to segment the cell nuclei.After the segmentation of the tissue, it will be represented by a complex network through which attributes will be extracted and measures will be calculated.From these attributes a classier will be implemented, based on supervised learning.The labels of the cervical tissues are standardized by the Bethesda international classication system, which has a quality criteria adequacy for samples and also categorizes the ndings and the interpretation of results.The application of segmentation techniques such as Watershed, based on the growth of connected regions by grouping neighboring pixels with similar intensities, provided the nuclear contours, one of the core parameters of pathological analysis.Other processing techniques and algorithms for image segmentation are being tested and compared aiming at the cell cytoplasm segmentation. Supported by: FAPESP.

157

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

GENE CONNECTION PROBABILITY PATTERN ANALYSIS IN YEAST DIAUXIC SHIFT PROCESS


Noronha M F1,2 , Carazzolle M F2 , Pereira G A G2 , Hashimoto R F1
1 Programa

de Interunidades de Ps-graduao em Bioinformtica, Instituito de Matemtica e Estatstica - USP 2 Laboratrio e Genmica e Expresso, Instituto de Biologia- UNICAMP Over the years, yeast have been commonly used as a model for studying eukaryotic organisms because of their well-studied and easily manipulated DNA. Bioethanol, a high potential renewable energy source, is produced mainly from the Saccharomyces cerevisiae yeast fermentation on a glucose medium. Therefore, understanding the yeast genes and their interactions may improve the production of energy source. On the way to discover the regulation genes occurring in the diauxic shift process (when it changes from fermentation to respiration process), we decided to conduct an analysis of the behavior of genes involved in respiration and fermentation pathways over time during this process. In this work, we have used the public experimental dataset available at NCBI, where a gene expression analysis during the transition from fermentation to respiration has been performed using 7 time-series microarray data. Based on the metabolic pathway from literature, we have selected 12 out of the 27 genes involved in the glycolysis and fermentation pathways: ADH1, ALD6, PDC1, ENO1, ENO2, GPM1, PGK1, TDH1, PFK1, PGL1, HXK1 E PYK2. These genes were chosen based on literature studies and data analysis described in other works such as KEGG and SGD. We selected the most inuence genes known so far from each gene family. Then, we have performed a binarization of the experimental data for those 12 genes whose criterion was: for each gene, we calculated the average of the 7 time-series data. If the gene expression value is less than the average, it is set to 0; otherwise, it is set to 1. By the fact that diauxic shift data are dynamic over the time, starting we have decided to build the gene networks assuming a time progression. To model this time progression, we have used a window of length 4 that passes through the 7 time-series data. In this way, we have formed 4 progression datasets corresponding to the time points from 1 to 4; from 2 to 5; from 3 to 6; and nally from 4 to 7. We developed a program to build a Boolean Network for each progression dataset. As a result, this program generated a list of all possible interactions for each gene in each progression based on the previous input data. We normalized these output data tting the values between 0 and 1 ranges and grouping it by 0, -1and 1frequency for each gene in each progression. Then, analyzing those frequencies in each progression, we found some interesting results: (i) ENO2, PDC1, GPM1 e PFK1 genes had the same connection pattern in relation of others genes, even with increased probability of being repressed by ALD6, ENO1 and PYK12 genes in the last two progressions; (ii) ADH1, PGL1 and PGK1 also had a same connection pattern, being induced by PDC1 and ENO2 throughout the progression evolution and repressed by gene ENO1; (iii) ALD6 represses most of the genes in the last progression, except for ENO1 and PYK2 genes; (iv) HXK1 behaves exactly in an opposite way of ALD6. Another interesting conclusions were found: (i) ALD6 and HXK1 genes are the most inuencing genes on this group of genes; (ii) TDH1 and ADH1 (NAD+ dependent) had different connection pattern on the rst two progressions (fermentation) but the same pattern on the last two progressions (respiration).This is a work in progress, so these are some preliminary results found that serve as incentives for future research.

158

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

A TOOL FOR STOCHASTIC SIMULATION OF BIOCHEMICAL REACTION SYSTEMS


Resende A C M D1,2,4 , Oliveira I L D1,2,3,4 , Silva A P C D1,2,4
1 UFJF

- Universidade Federal de Juiz de Fora - Laboratrio de Fisiologia Computacional - UFJF 3 NUBIO - Ncleo de Bioinformtica - UFJF-EGL (EMBRAPA Gado de Leite) 4 DCC - Departamento de Cincia da Computao - UFJF
2 FISIOCOMP

The temporal evolution of homogeneous (bio)chemical systems is traditionally calculated by solving a series of ordinary differential equations. Although the deterministic formulation is adequate in most cases, it does not reect the natural stochasticity of the system which is essential in many biological systems. An alternative to overcome this limitation is to apply the Stochastic Simulation Algorithm (SSA). With the study and implementation of some of the Gillespies algorithmic approaches, we developed the SST (Stochastic Simulation Tool) tool. Implemented in C programming language, SST is composed by six different versions of the SSA. The SST input is composed by a set of reactions and by the initial numbers of molecules. The reactions should follow a specic language: the Biochemical Reactions Language (LRB). The LRB language was implemented as a separate module in the SST and the output of the LRB analyzer is mainly the stoichiometric matrix, which represents the biochemical network to be simulated by any of the algorithms. With the SST, it will be possible to simulate a set of complex stochastic biochemical reactions and to compare their results with the results obtained by the deterministic simulations. As another contribution, we have also implemented the Dependency Graph (DG) as well as the Priority Queue (PQ). The DG is used by the Next Reaction Method (NRM), the Optimized Reaction Method (ODM), the Sorting Direct Method (SDM) and the Logarithmic Direct Method (LDM). The PQ is used only by the NRM. In order to perform a set of studies cases, we have used a hypothetical gene transcription model and a signaling network (MAPK mitogen-activated protein kinase). Supported by: BIC/PROPESQ-UFJF-MG

159

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN06

USING SMALL-WORLD NETWORK TOPOLOGY TO GUIDE GRN INFERENCE PROCESSES


Lima L D A1 , Martins-Jr D C2 , Lopes F M3,4
1 Laboratrio 2 Centro

de Genmica Mdica e Bioinformtica - Hospital AC Camargo de Matemtica, Computao e Cognio - Universidade Federal do ABC 3 Instituto de Matemtica e Estatstica - Universidade de So Paulo 4 Universidade Tecnolgica Federal do Paran The knowledge about biological processes and the interrelationships among genes remains a major challenge in the systems biology research. It is very important to understand how many biological processes happen and, in most cases, how to prevent it from happening (diseases). Thus, the inference process of the gene regulatory networks (GRNs) from gene expression data is crucial to understand how the biological entities interact as a system, and how this system works to create an organism. However, the reverse-engineering of GRNs is a very difcult task due to insufcient information about biological organisms, the complexity of the genes interrelationshipsand the noise of the gene expression measurements. One way that can help the inference process and improve its performance is the application of the complex networks theory based approach, mainly exploring its local and global topological properties. In particular, current literature research indicates that several biological networks present the scale-free property, in which new nodes tend to be connected to nodes that have already many connections. However, other biological networks are best described by the small-world network model, in which it is possible to nd a path between any two network genes (nodes) by traversing a maximum of six connections. In other words, the transitivity property is highly probable in small-world networks. If node A is connected to node B and node B is connected to node C , the probability of the existence of a connection between A and C is high, allowing a large number of triangles in the graph. In order to study networks related to specic phenomena or genes, we can apply a method for growing networks around these genes. In general, in this task some criterion function is used to estimate the strength of the connection between every pair of genes from the expression proles, e.g., entropy, CoD, correlation, to cite but a few. In addition to using only this function, we propose to include a new function that improves the small-world properties of the network. For each candidate gene to be added to the network, we rst link it to its neighbors with stronger connection according to the criterion function adopted. Thereafter, we give a higher weight to nodes that are distant from the new node in order to maintain the characteristic path length of the network and its small-world properties. In summary, this work presents a new approach for the GRNs inference by adding a small-world complex network topology as a prior knowledge in order to guide the inference process in face of the known limitations for this task. In this way, it is expected a better accuracy for the inferred GRNs by reducing its false positives. Besides, it is also expected inferred GRNs closer to the small-world model, becoming these networks more biologically signicant, which can contribute to better understanding biological networks and their processes.

160

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN06

A MODULE-BASED ANALYSIS OF GENE EXPRESSION NETWORKS BASED ON CLIQUES IDENTIFICATION


Campiteli M G2 , Jr R M C2 , Costa L D F1
1 Instituto 2 Instituto

de Fsica - Universidade de Sao Paulo de Matemtica e Estatstica - Universidade de Sao Paulo

In the last decades, molecular biology technologies have witnessed an astonishing improvement. Concurrently, the amount of biological data increases exponentially and the development of algorithms to analyze the amount of information produced and gain insight into biological systems became imperative. Among these, the complex networks approach is gaining increasing attention due to its potential to reveal mechanistic processes not revealed with traditional techniques. Gene expression networks are usually built from microarray data and genes are connected given a signicant level of correlation between their expression patterns. These structures have proved a rich source of abstraction and gain of information. Although, the analysis of the networks cannot be conducted on a gene-gene basis due to their statistical nature. These networks rely thus on the analysis of modules and their correlation with biological function. In this work we present an algorithm for the analysis of gene expression networks based on the identication of maximal cliques. Cliques are dened as a set of nodes completely interconnected and can be interpreted in this context as a module. The result of this algorithm is another network where the nodes represent what we call super-genes genes which patterns of expression are highly correlated to each other while weakly correlated with the patterns of genes in other cliques. We successfully correlate the super-genes with a dened biological function. We show results for a model organism and for a well-known human disease dataset. Support: FAPESP

161

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

ESTIMATION OF PROBABILISTIC GENE NETWORKS USING THE UCS FEATURE SELECTION ALGORITHM
Reis M S1 , Ferreira C E1 , Barrera J1,2
1 Institute 2 Faculty

of Mathematics and Statistics - UNIVERSITY OF SO PAULO of Philosophy, Sciences and Letters - UNIVERSITY OF SO PAULO

Probabilistic gene networks (PGNs) are a particular family of conditionally-independent Markov chains, proposed for modeling of regulatory gene networks. These models were successfully applied for designing of malaria parasite and yeast cell-cycle regulatory networks. PGNs mimics the properties of a gene as a non-linear stochastic gate and the systems built by the coupling of these gates. The way the coupling of these gates is arranged, called architecture of the network, denes for each gene the group of genes which inhibit or activate it, that is, which genes are its predictors. The estimation of a PGN consists of the choice of the best sets of predictors for each one of the genes. Such a choice may be seen as a feature selection problem in the context of pattern recognition. For the estimation of a PGN from temporal data (e.g., temporal microarray data, normalized and discretized), a method applied in the feature selection procedure is the Mean Mutual Information (MMI). The MMI measures the expectation of the mutual information between a target gene and a set of predictors. There are in the literature several algorithms for the feature selection problem, for instance the sequential oating forward selection (SFFS). Another method is the U-Curve algorithm, which takes into account the fact that the search space is the power set of a Boolean lattice and the cost function forms a U-shaped curve when applied to any chain of the lattice. Recently it was proposed the U-curve search (UCS) algorithm, which is an improved version of the U-Curve algorithm: opposite to the latter, the former guarantees that all the global minima of an instance are found. The objective of this work is the assessment of the performance of the UCS algorithm when used as the feature selection algorithm in a PGN estimation. We considered an experiment using simulated temporal data of PGNs of different sizes. The architecture of each network was recovered using MMI as the cost function and different feature selection algorithms, namely the original U-Curve algorithm, UCS, SFFS, and an exhaustive search. We evaluated different trade offs between computational time and correctness of the estimated PGNs, and in the most of them the UCS algorithm had the best results. Supported by: CNPq

162

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

PREDICTING NOVEL EXPRESSION QUANTITATIVE TRAIT LOCI (EQTL): THE USE OF THE FUZZY LOGIC THROUGH CUDA ARCHITECTURE
Simoes S N1 , Hashimoto R F1
1 Instituto

de Matemtica e Estatstica da Universidade de So Paulo

Predicting novel expression quantitative trait loci (eQTL): the use of the Fuzzy logic through CUDA ArchitectureEQTL (expression Quantitative Trait Loci) data analysis is a growing eld in bioinformatics, as a promising approach to assist the discovery of genome regions responsible for certain phenotypic traits. This can be used for the following purposes: (i) location of one or multiple interactive QTLs associated with phenotypic characteristics; (ii) mapping of complex traits and diseases association with global gene expression and (iii) discovery of regulatory mechanisms from variations in gene expression. EQTL data can be obtained, for example, through the Microarray technique. This represents a powerful tool to systematically quantify gene expression by determining the amount of RNA transcribed in cells and tissues subjected to numerous experimental conditions. However, biological data input usually involves considerable uncertainty. Among the factors causing uncertainty are: the incorrect calibration of equipments, failure during samples or equipments handling by the operator, or variations in experimental conditions. Thus, the biological data used in bioinformatics have several sources of uncertainties, which require appropriate uncertainty treatment methods. Since most approaches to eQTL analysis is essentially statistical (e.g. hypothesis testing and Bayesian inference), sometimes they are not the most appropriate approach to deal with the uncertainty of biological data. Thus, we assumed that the utilization of more sophisticated methods can obtain more informative results. Among the methods to treat uncertainty we can highlight the Fuzzy logic, a multivalued logical way to deal with approximate reasoning, which is derived from the theory of fuzzy sets. There are many applications of fuzzy logic in Bioinformatics, i.e. Microarray data analysis, structural Bioinformatics and Fuzzy similarities to ontologies. Since the eQTLs data have considerable uncertainty, this work proposes the use of fuzzy logic by a expert system to deal with this uncertainty. After receiving the expression data and the markers positions as input, the expert system will search for loci related to certain characteristics by employing the fuzzy logic, as follows: (i) fuzzycation values of genes expression in ve categories: very low, low, average, high, very high; (ii) application of fuzzy operators to identify the membership degree of fuzzy set of genes; (iii) application of rules of implication, so that the system can select loci linked to a specic phenotype; (iv) combination of all possible fuzzy outputs and (v) defuzzication, so that the system returns phenotype markers by the certainty factor or degree of condence in the outcome. However, the analysis of QTL expression data involves the analysis of many markers, along the entire genome, and large data content. This, in addition to the use of the Fuzzy logic for the treatment of uncertainty, will require more computational performance to run the expert system. Thus, to fulll this demand, the system will operate in parallel by using CUDA architecture for parallelization of the methods. We expect to be able to predict potential loci related to specic phenotypic traits more accurately. Currently, this work is in progress and preliminary results indicate it will work.

163

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

SYNCHRONIZATION IN A BOOLEAN NETWORK OF SIGNALING IN LARGE GRANULAR LYMPHOCYTE LEUKEMIA


Bugs C A1,2 , Mombach J C M1
1 Universidade 2 Universidade

Federal de Santa Maria Federal do Pampa

We study a synchronization in a Boolean Network development by Zhang and coworkers. A boolean network consists in a group of N nodes or elements {N1 , N2 , ... ,NN } so that i is the state(True,False) of the Ni node. The state True or False of each node is determined by the initial conditions along with the Boolean rules that determine the state of Ni determined by the state of its regulators {Ni1 , Ni2 ,..., Nik }. The boolean operations are used as follows: if two or more elements can induce the activation of a node in an independent way, we combine both with the logical function OR, if two or more components cannot induce the activation in an independent way, we associate to both the logical operator AND and nally, the operator NOT will be associated to the inversion of the state of the element. Our method of analysis consists in nding the states of the network that are most frequently visited and the identication of the variable elements since they show the activated or deactivated network sub-pathways and their biological inuence on the global network behavior. Denoting by t ={ 1 (t), 2 (t),..., k (t)} the state of the nodes in the update t , we have that T otal ={1 , 2 , 3 , ... , T } represents a set of possible states of the network for T updates for a given initial condition and denes a possible trajectory of the network in its Space of States (SS).We developed a simulation using the software Mathematica 7.0 considering a hybrid synchronous/asynchronous dynamics to describe the SS of the T-LGL (T cell large granular lymphocyte leukemia) survival signaling network that contains 58 nodes and 123 interactions. The nodes of this network represent proteins and biomolecules and a single node representing the apoptosis pathway. The regulatory interactions in this network are translated to Boolean rules. For any initial condition with Apoptosis held in the False state we nd that the cardinality of SS is small, however with Apoptosis in the True state the cardinality of SS increases a great deal. In both situations the network breaks apart in two groups of nodes for all initial conditions: in the rst group the nodes are always in a frozen state and in the second the nodes are always in a variable state. This feature determines the cardinality of ST otal : the different behavior of the network with Apoptosis on or off associated to a synchronization among several network nodes. For Apoptosis on we nd on average only 15 different states for the SS, but for Apoptosis off we nd on average 48. The results show that for Apoptosis on in this network the nodes and the changes that constitute the apoptotic program are synchronized and operating in accordance with a precise coordinated scheme.

164

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

MODELLING CANCER TREATMENT THROUGH CHIMERIC POLYMERS NANOPARTICLES


Paiva L R1 , Martins M L1
1 Universidade

Federal de Viosa

A major challenge for cancer chemotherapy is deliver adequate doses of drugs to the affected areas in the body. By one hand, rapid clearance of these drugs from the circulation requires large doses in order to be effective. By the other hand, systemic toxicity limit their dose. Packaging chemotherapeutic drugs into nanoscale delivery vehicles is a promising strategy, since some studies have shown that nanoparticles accumulate within solid tumors due to the enhanced permeability and retention effect, resultant from abnormalities of tumor blood and lymphatic vasculature [1]. Also, they have reduced toxicity [1]. Polymeric nanomaterials offer a promising solution by encapsulating chemotherapeutic drugs. The chimeric polymers (CPs) consist of a hydrophilic, biodegradable elastin-like polypeptide and a short segment for the attachment of drugs, such as the anti-cancer drug Doxorubicin (Dox), through a pHlabile linker [1]. We propose a hybrid, multiscale model to investigate the use of the CP-Dox nanoparticles against a solid tumor. The CP-Dox is administrated in injections of same dose and period . At the capillar, the CP-Dox concentration falls exponentially after the injection. The nanoparticles diffuse in the tissue and are absorbed by normal and cancer cells. Inside the cells, the CP-Dox is degradated and a fraction of the drug is released. The intracellular accumulation of this drug affects the cell dynamics: normal cells can only die, while cancer ones are able to divide, migrate and die. Our results show that if the tumor is too close to the capillar, the therapy fails, even for nanoparticles injection near to the MTD (maximum tolerated dose) given at =12h. However, our results indicate that the CP-Dox therapy can be efcient against micrometastases far from blood supply. The nanoparticles selectivity by the cancer cells is an important issue, as the drug uptake by normal cells reduces signicantly the amount of available drug to the tumor. Also, the death of normal tissue increases the amount of nutrients available to the cancer. So, if a selective nanoparticle erradicates the tumor in about 8 days (doses of 80% of the MTD given at each 12h), non-selective ones need about 40 days to erradicate the tumor using the same protocol. And, as the drug affects the normal tissue, at the time of tumor erradication the most of the normal cells are died. If the absolute selectivity is achived, the less aggressive treatment able to erradicate an avascular micrometastase with 100% of probability consists in 4 injections of 50% of MTD of CP-Dox given once at 8 days. Supported by: CNPq, CAPES, Fapemig [1] Nat Mat 2009 8(12) 993

165

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

GENE EXPRESSION ANALYSIS OF AN ORDERED NETWORK: A SEARCH FOR TUMORAL PATTERNS


Rybarczyk-Filho J L1 , Benetti F P D C D1 , Dalmolin R J S2 , Moreira J C F2 , Brunnet L G1 , Almeida R M C D1
1 Instituto 2 Departamento

de Fsica - Universidade Federal do Rio Grande do Sul de Bioqumica - Universidade Federal do Rio Grande do Sul

A genome modular classication that associates cellular processes with modules could lead to a method for quantifying the variations in gene expression levels caused by different cellular stages or conditions: the transcriptogram, a powerful tool for assessing cell performance, would be at hand. We present here a computational method that order genes on a line and clusters genes by the probability that their products interact. Protein-protein association information can be obtained from large data bases as STRING or KEGG. Once the interactome has been obtained, it is possible to generate a graph representing the network, where each node corresponds to a protein and interaction nodes are linked by a straight line. It is also possible to represent the interactome through its interaction matrix, and is applied on computational method to order genes of a genome by clustering together the strongly interaction genes on a line, and recognize the gene ontology terms associated to each modulus. We applied the method to the Homo sapiens genome and, considering genome expression, we produced a succession of plots of gene transcription levels (cancer vs. normal). These may be regarded as the rst versions of a transcriptogram. This method is useful for extracting information from cell stimuli/responses experiments, and may be applied with diagnostic purposes. Supported by: CNPq, CAPES, FAPERGS

166

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

THE MIGRATION OF MELANOMA CELLS IN CULTURE: A QUANTITATIVE CHARACTERIZATION


Silva P C A D1 , Rosembach T V1 , Maral L N2 , Mendes R L2 , Silva H S1 , Rocha M S1 , Martins M L1
1 Departamento 2 Departamento

de Fsica - UNIVERSIDADE FEDERAL DE VIOSA de Biologia Animal - UNIVERSIDADE FEDERAL DE VIOSA

Studies about cellular motion are important to understand fundamental process such as embrionary development, wound healing, invasion and metastasis in cancer. In this work we report on the initial results concerning the trajectories followed by cells of murine melanoma in vitro and our goal is characterize the dynamics of growth, migration and aggregation of these cells, in culture. The experimental technique used was videomicroscopy and the results were obtained by digital analysis of these images. The melanoma cells were maintained in ideal conditions and photographed at regular intervals of 15 minutes. Using the coordinates of each cell, we calculated some quantities, like velocities, turning angle and mean square displacement used to determine the nature of the migratory process employed by these cells in monolayer cultures under low cell density. Supported by: Capes and CNPq

167

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

A GENOME-SCALE INVESTIGATION FOR THE ONE STEP PRODUCTION OF POLYLACTIC ACID IN A GENETICALLY ENGINEERED LACTOCOCCUS LACTIS
Pereira F C1 , Bagnariolli B1 , Castro J V1 , Porto L M1
1 Integrated

Technologies Laboratory - InteLab, Chemical and Food Engineering Department FEDERAL UNIVERSITY OF SANTA CATARINA - UFSC Genome-scale models for the metabolism of Lactococcus lactis have being recently described in the literature. The use of genomic models is usually challenging due to the lack of physiologically meaningful correlations that link metabolic and signaling networks to workable mathematical descriptions. L. lactis is a bacterium that naturally produces lactic acid, the monomer used for the production of lactatebased polyesters, particularly polylactic acid (PLA) and its co-polymers. PLA has been extensively studied for tissue engineering applications as well as an alternative to petrochemical-based polymers. In this work we have demonstrated that L. lactis IL 1403 is able to produce PLA in one-step by genetic manipulation, i.e., internal polymerization is achieved by cloning appropriate synthase and transferase genes, by in silico metabolic engineering techniques. In particular, lactyl-CoA is polymerized by a suitable PHA (polyhydroxyalkanoate) synthase enzyme. The simplied genome-scale model is based on 458 biochemical reactions relating 321 metabolites. This already complex network summarizes the bacterial metabolism that transforms glucose into nal products, the biopolymer PLA being the target one. The Metabolic Flux Analysis (MFA) approach was used to investigate the corresponding stoichiometric model in order to predict the metabolic ux distribution. MFA was performed by the Genomic Engineering System Toolbox, GEnSys, developed in our laboratory. Compared to an Escherichia coli modied PLA-producing engineered strain, our model is superior, that is, it predicts an increased accumulation of the polylactic acid inside the bacterial cell. This is in agreement to what has been published in the literature for the production of the monomer by the Lactococcus strain. It is expected that for highly active synthases the biopolymer production would be proportional to the lactic acid ux. Based on our in silico results, we propose that this genome-scale model could be useful for the development of highly productive metabolic engineered Lactococcus lactis for the one step production of PLA. Supported by: CNPq

168

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

SIMULATION OF CARDIAC MYOCYTE MODELS WITH EXCITATION-CONTRACTION COUPLED TO MITOCHONDRIAL ENERGETICS


Carvalho R R D1,2 , Santos R W D1,2 , Rocha P A F1,2
1 Universidade 2 Mestrado

Federal de Juiz de Fora em Modelagem Computacional

Cardiovascular diseases are the rst cause of death in developed countries. Better knowledge of the underlying function of the heart is key to further advance health care. Simulation and development of mathematical models of the heart, at cell, tissue and organ levels have proven to be a very efcient and attractive approach. In the cell level, many models are avaible both for excitation-contraction and for the mitochondrial energetic activity, but there are few about the integration of these cell processes. We present a numerical simulation of action potential (AP) and propagation for cardiac tissue, using as a cell model that provides excitation-contraction coupling to mitochondrial energy metabolism.

169

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN04

COMPUTATIONAL METHODS FOR THE INVERSE PROBLEM IN CARDIAC ELECTROPHYSIOLOGY


Novaes G M1 , Brugiolo A1 , Santos R W D1 , Borges C C H1
1 Universidade

Federal de Juiz de Fora

The modeling of the electrical activity of the heart is of great medical and scientic interest, because it provides a way for a better understanding of the related biophysical phenomena and it allows the development of new techniques for diagnoses, as well as new drugs. The current mathematical models are usually based on experimental data obtained from a small collection of cells. However, the electrical characteristics of cells vary along the heart. This heterogeneity plays a key role, but creates computational modeling difculties. Currently, there are no global models capable of reproducing the electrical activity of different cells, even if neighbors.This work has as objective to evaluate a methodology based on Genetic Algorithms that aims to automatically adjust existing models of cellular electrophysiology to experimental data obtained from a cell or a collection of cardiac cells. The proposed methodology is implemented and evaluated through different numerical experiments.In order to increase the efciency and speed of the numerical solution the Genetic Algorithm was implemented using Parallel Programming and uses the idea of metamodels.

170

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

SOLVING MARKOV CHAINS IN CARDIAC MODELS THROUGH THE UNIFORMIZATION METHOD


Secundino A A1 , Couto A P1 , Weber R1
1 Universidade

Federal de Juiz de Fora

The main purpose of this work is to compare the different numerical methods for solving Markov Chains used for modeling the behavior of ion channels in cardiac action potential models. The Uniformization technique, used in several elds of performance of communication systems, is appliedfor calculating the steady state distribution probability of these chains. The Uniformization technique,as shown in the literature, is robust and overcome several problems when ordinary differentialequations (ODEs) are solved for calculating the same measure. In order to perform the comparisonsbetween the uniformization technique and traditional methods for solving ODEs, we chose the well-known Model of action potential of mouse ventricular myocytes, proposed by Bondarenko et al(2004). The results show that the uniformization technique has a smaller computational cost,considering the total number of multiplications performed, as well as it is more stable that thetraditional methods for solving ODEs.

171

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Systems Biology and Networks PI: SBN03

A COMPUTATIONAL MODEL OF THE INNATE IMMUNE SYSTEM


Pigozzo A B1 , Lobosco M1 , Santos R W D1
1 Universidade

Federal de Juiz de Fora

The majority of infectious diseases is caused when foreing pathogenic agents cross into the body. When a foreing pathogenic agent is detected, it triggers an immune response: some body molecules indicate the presence of the pathogenic agent in the body or even the tissue damage. The rst phase of this immune response is called innate immune response. In this scenario, our work aims to develop and implement a mathematical and computational model for this innate immune response in a microscopic section of a tissue. For this purpose, a couple of partial differential equations are employed to simulate the spatial and temporal behavior of antigens, chemokines and neutrophils during the initial stage of the innate immune response.

172

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

TOPIC 7 S EQUENCE A NALYSIS

173

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA01

IDENTIFICATION OF SPECIE SPECIFICS MICROSATELLITES IN THREE SPECIES OF LEISHMANIA


Rodrigues-Luiz G F1 , Cristo G S P1 , Lobo F P1 , Lourdes R D A1 , Rodrigues T D S2 , Fujiwara R T1 , Bartholomeu D C1
1 Department 2 Department

of Parasitology - Universidade Federal de Minas Gerais of Computer Science - Centro Federal de Educao Tecnolgica de Minas Gerais

Microsatellites or Single Sequence Repeat (SSR) are tandem repeated stretches of short nucleotide motifs, usually ranging from 1 to 6 bp, ubiquitously distributed in the genomes of eukaryotic organisms. Length variation of individual SSR loci can be easily screened by PCR, and this technique has been useful for several studies of strain typing and population genetics. Multiplex PCR is a variant of PCR which simultaneously amplies many loci of interest in one single reaction by using more than one pair of primer. Therefore, multiplex-PCR is commonly used for genotyping applications and microsatellite analyses. Leishmania is a genus of agellate protozoan that cause a broad spectrum of diseases, ranging from self-limiting localized cutaneous lesions to visceral leishmaniasis. Multilocus enzyme electrophoresis (MLEE) has been the gold standard for taxonomy and strain typing of Leishmania , although this method has several limitations, including the relatively small number of characterized loci and alleles and the requirement of parasite culture. Here we have performed an in silico analysis searching for specie-specics SSR in the genomes of three species of Leishmania (L. infantum , L. braziliensis and L. major ) to design set of primers for genotyping using Multiplex PCR. To identify such repeats we have developed the program NT-RepeatFinder, written in Pascal Language. To design the primers we have written a script using PERL language and commands of the EMBOSS package, as well as the third-party software e-PCR (Electronic PCR) and FastPCR, which were used to identify the best primers. By applying this protocol, we generated 15 set of primers for L. major genome, 9 for L. braziliensis and 4 for L. infantum . We are currently validating the selected primers using articial DNA mixtures of the three Leishmania species as well as testing them in infected tissues.

174

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

ORIGIN OF MULTIPLE PERIODICITIES IN THE FOURIER POWER SPECTRA OF THE PLASMODIUM FALCIPARUM GENOME
Nunes M C D S1 , Wanner E F3 , Weber G2
1 Federal 2 Federal

University of Ouro Preto University of Minas Gerais 3 Federal Center for Technological Education of Minas Gerais The DNA molecule has regions with periodic repetition patterns of nucleotides such as tandem repeats, generally present in chromosomic regions. These repetitions have several structural functions, like the folding of coiled-coil proteins, the helical turn of the DNA double helix, or the binding of DNA with histones. Some periodicities are also involved in human disorders or diseases, like Huntingtons disease and cancer. Signal processing tools such as Fourier transforms are routinely used to detect and analyze genomic periodicities. For instance, there is a well-documented periodicity of 3 nucleotides related to the codon usage in protein synthesis which appears as a strong peak at frequency f = 1/3 in the Fourier power spectra. Using this technique, Sharma et al (Bioinformatics 20: 1405-1412, 2004) reported a curious nding in the genome of Plasmodium falciparum : a comb of frequency harmonics with multiples of k/21 in a section of 2000 base pairs of chromosome 2. Given that no such frequency multiples were reported for any other genome, we were motivated to perform a closer inspection of periodicities for all chromosomes of P. falciparum . The technique used in this work for detecting DNA periodicities is the Fourier transform, a mathematical tool that was introduced with success in genomic analysis of periodic repeats. We applied the binary indicator power spectrum to all chromosomes of P. falciparum and found that the frequency overtones or multiples of k/21 are present only in the non-coding regions of all but one chromosome. Only chromosome 14 has no k/21 frequency multiples. Chromosome 5, on the other hand, also has k/18 multiples in the coding region in addition to the k/21 multiples in the non-coding region. However, we also determined that these frequency multiples are frequency aliases caused by the way the symbolic genomic sequence is converted into a numeric series. For example, by choosing a different way to encode the sequences, like dinucleotide mapping, the frequency overtones essentially disappear. In view of these results, the frequency overtones were identied as an artifact of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. We then revised early applications of the Fourier transform technique to protein sequences were frequency overtones were repeatedly reported and showed that all these overtones in proteins are simple frequency aliases. In the case of P. falciparum the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, there seems to be a general unawareness of the potential problems represented by frequency aliases.

175

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

DETECTION OF GENOME-WIDE STRUCTURAL VARIATION USING NEXT GENERATION SEQUENCING DATA


Navarro F C P1,2 , Galante P A F2 , Parmigiani R B3 , Camargo A A3 , Souza S J D2
1 Departamento 2 Laboratrio

de Bioqumica - IQ - Universidade de So Paulo de Biologia Computacional - Instituto Ludwig de Pesquisa sobre o Cncer 3 Laboratrio de Biologia Molecular e Genmica - Instituto Ludwig de Pesquisa sobre o Cncer Cancer results from the accumulation of mutations and genomic structural variations involving specic genes. Many somatic alterations have been described for tumor genomes, however little is known about how they affect normal genomes. A comparison of the two would be crucial to better understand the genetic changes driving tumorigenesis. The current study aims to explore somatic variation in a normal (HCC1954BL) and a tumoral (HCC1954) genome from the same patient, clarifying which events are involved in tumorigenesis. Using two next generation sequencing platforms, Roche 454 with long reads captured from exonic regions and Illumina-Solexa with genome shotgun paired-end libraries, we sequenced both genomes with coverage of approximately 22x for both genomes. Long reads were used to the detection of somatic small insertions and deletions (indels). Moreover, short reads were used to the detection of somatic interchromosomal rearrangements, inversions, deletions and tandem duplications. Using computational methods for analysis of sequencing data, we detected a total of 94 structural variations in HCC1954 including 49 interchromosomal rearrangements, 30 large deletions, 11 inversions and 4 duplications. Of the 49 interchromosomal events, 38 were within genic regions. In contrast, no interchromosomal rearrangement and only two large deletions and two inversions, both within intergenic or intronic regions, were detected in the HCC1954BL genome. Since there is a high difference between the frequency and class of structural variation on both genomes, these results indicate that structural variations play a major hole in the tumorigenesis. Further work should be addressed to clarify how structural variations driven the tumorigenesis and the progression of tumors. Supported by CAPES.

176

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

EVALUATION OF ENDONUCLEASES TO BETTER PERFORM T-RFLP OF THE DSRAB GENE


Freitas L A1 , Grativol A D1 , Gatts C E N1
1 Universidade

Estadual do Norte Fluminense Darcy Ribeiro

All sulfate-reducing bacteria have in common the presence of the dissimilatory sulte reductase gene (dsrAB ), which catalyzes the last step in the sulfate reduction pathway. They are characterized by the use of sulphate as the nal electron acceptor, and are found in anoxic environments, such as landlls, mines and estuaries. They are important in the Sulfur and Carbon Cycles, being reported that about 50% of the degradation of organic matter in environments rich in sulfate is due to sulfate reduction. The study of microbial diversity has been recently enhanced by the use of the Terminal Restriction Fragment Length Polymorphism technique (T-RFLP), in which the DNA of environmental samples are amplied with specic primers, and are subsequently digested with endonucleases, to reveal the prole of the community diversity of such environment. Each fragment is considered a phylotype. A critical step in this technique is to use the best set of endonuclease in order to increase the resolution of the community prole. For this, it is recommended to perform an in-silico digestion of published sequences. Therefore, we created a database containing 490 published sequences of dsrAB gene, which was digested in silico with 190 endonuclases from the REBASE database. The sequences were rst aligned and edited, so they were the same size. Low quality sequences were eliminated from the database. To drawn up the best set of endonucleases, an algorithmic approach was use to sweep all the combinations of endonucleases and to select the best combination. The set that best distinguished the sequences of the dsrAB gene was composed by the endonucleases MnlI, HinfI and HpyAV. The endonucleases that showed the best resolution power were those that contained 4bp and 5bp in the recognition site.

177

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

IN SILICO IDENTIFICATION AND EXPERIMENTAL VALIDATION OF CONSERVED AND POLYMORPHIC EPITOPES IN DISTINCT TRYPANOSOMA CRUZI STRAINS
Mendes T A D O1 , Rodrigues-Luiz G F1 , Giusta M D S2 , Lobo F P1 , Fujiwara R T1 , Gazzinelli R T2 , Bartholomeu D C1
1 Departamento 2 Departamento

de Parasitologia - UNIVERSIDADE FEDERAL DE MINAS GERAIS de Bioqumica e Imunologia - UNIVERSIDADE FEDERAL DE MINAS GERAIS

The factors inuencing the variation in the clinical manifestation of Chagas disease have not been elucidated, but it is likely that genetic variation of both the host and parasite is involved. In fact, T.cruzi taxon is highly polimorphic. Several studies trying to correlate the T. cruzi strain involved in the infection and the clinical manifestation of the disease have used hemoculture and/or PCR-based genotyping of the parasites from infected human tissues. Both techniques have limitations. Hemoculture requires parasite isolation from patient blood and growth in animals or in vitro cultures and therefore offers opportunity for subpopulation selection. Furthermore, the parasitemia in the chronic phase is very low, hindering the detection of the parasites. On the other hand, parasite genotyping directly from infected tissues is a very invasive procedure that requires medical care and hampers studies with large numbers of samples. The goal of this work is to identify T. cruzi conserved and polymorphic B-cell linear epitopes that could be used for serodiagnostic and serotyping using ELISA. To this end, we have performed B-cell epitope prediction on CL-Brener predicted proteome. Proteins derived from single copy genes represented by pair of alleles in the CL-Brener genome were target. The rationale behind this strategy is that because CL-Brener is a recent hybrid between TcII and TcIII lineages, it is likely that polymorphic epitopes in pair of alleles of CL-Brener could also be polymorphic in the parental genotypes. In silico analysis on pair of alleles were performed to select conserved and allele-specic epitopes. We have excluded CLBrener predicted epitopes also presented in the predicted proteome of Leishmania major , L. infantum , L. braziliensis , and T. rangeli, parasites often related to have serological cross-reactivity with T. cruzi . A peptide array containing 180 linear B-cell predicted epitopes was covalently linked on a cellulose membrane. The reactivity of the peptides are been tested using sera from C57BL/6 mice chronically infected with T.cruzi strains representative of distinct phylogenetic lineages and non-infected mice as control. So far, sera from mice infected with CL-Brener and Y strains were tested. A total of 66 peptides were considered reactive with sera from animals infected with CL-Brener and/or Y strain and not reactive with sera from non-infected mice. Nineteen peptides were considered reactive with both sera from animals infected with CL-Brener as well sera from animals infected with Y strain. Fourteen and nineteen peptides were recognized by sera from mice infected with CL-Brener and Y strains, respectively. Sera from mice infected with Colombiana and Dm28 strains will also be tested. Conserved and polymorphic B-cell linear epitopes will be synthesized as soluble peptides and tested by ELISA against a panel of sera from chagasic patients.Financial support: INCTV, FAPEMIG, CNPq, CAPES

178

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

STUDY OF GENETIC VARIABILITY OF GENE FRAGMENTS S1 AND S2 OF AVIAN INFECTIOUS BRONCHITIS VIRUS IN THE STATE OF MINAS GERAIS
Santos C E F1 , Dias M C1 , Resende J S D2 , Mouro M D M3 , Abreu J T4 , Franco G R1
1 Departamento 2 Departamento

de Bioqumica e Imunologia - UFMG de Medicina Veterinria Preventiva - UFMG 3 Centro de Pesquisas Ren Rachou - FIOCRUZ 4 Departamento de Medicina Veterinria - PUC-MG The coronavirus avian infectious bronchitis virus (IBV) is the responsible for a highly contagious disease that globally affect the poultry industry causing large economic losses. The IBV genome is a positive single-stranded RNA with approximately 27,6 Kb which encodes four structural proteins: spike glycoprotein (S), nucleocapsid protein (N), integral membrane glycoprotein (M) and membraneassociated protein (E). The IBV spike glycoprotein is present in a cleaved form which is composed by two subunits (two-three copies each): the aminoterminal S1 and the carboxyterminal S2 with 500 and 600 aminoacids residues, respectively. The S1 subunit is anchored in the virus envelop through noncovalent interactions with the S2 subunit and presents antigenic determinants that induce neutralizing viral antibodies and inducers of cellular immunity. The serotypes classication has been made from studies on the subunit S1. The development of inactivated or live vaccines against avian infectious bronchitis is based on previous studies of circulating serotypes of the region for the determination of the ideal strains for vaccination program. To investigate which strains are circulating in Minas Gerais, tissue samples from outbreaks of infectious bronchitis in the region were processed and propagated in allantonic cavity of specic pathogen free eggs for isolation and amplication of viral title. After extraction of total RNA from allantoic uid and reverse transcription, amplications of gene fragments S1 and S2 were made for sequencing in the MegaBACE-1000(GE Healthcare) equipment. In silico translation of sequences was done and these were aligned. From the alignments, diagrams of phylogenetic trees were constructed with the ClustalW and TreePuzzle programs using the Neighbor Joining algorithm and Maximum Likelihood method, respectively. From the analysis of protein, sequences were detected similar and different isolates from the reference strains including vaccine strains independent of the time they were obtained. A high identity was observed among some recent and older isolates, suggesting a possible ancestral relationship between them. Different sequences of the gene fragment S2 were observed on different avian tissues from the same batch and in the diagrams of phylogenetic trees was observed a separation between sequences obtained from tissues of the respiratory and digestive/excretory systems. These data suggest that some strains of IBV may have tissue tropism. Supported by: CNPq, CAPES, FAPEMIG, BIOVET.

179

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA06

NAIVE BAYES CLASSIFIER APPLIED TO DNA BARCODING


Costa R L2 , Barreto A D M S2
1 Laboratorio 2 National

Nacional de Computacao Cientica Laboratory of Scientic Computing - LNCC/MCT 3 National Laboratory of Scientic Computing - LNCC/MCT Molecular taxonomy is one of the greatest challenges of modern biology. One way to address this problem is to perform the classication of an individual based on information contained in multiple regions of its genome. Classication techniques that use this strategy are in general effective, but presents a high computational cost. On the other hand, methods that analyze an unique genomic region (DNA barcoding ) are simpler and more computationally efcient. Studies in DNA barcoding are useful in species inventory and handling, vector control, environmental monitoring and in the identication of cryptic species. In most animals classication is based on the cytocrome oxidase (COI) gene, present in mitochondrial DNA. The classication algorithms used by the Barcode of Life Initiative are based on the combination of the k-nearest neighbor method with a neighbor-joining tree analysis. Recently, some alternatives to these techniques are being adopted, such as the statistical analysis of evolutionary distances and machine learning algorithms, as for example neural networks and support vector machines. In sequence analysis, the number of attributes (nucleotides and amino acids) can be huge depending on the gene or genomic region being studied. This increases the dimension of the problem and can severely damage the performance of the classier. One way to handle high-dimensional problems is to use the Naive Bayes (NB) classier. NB is a classication algorithm that assumes independence among conditional probabilities of attributes with respect to each class, which greatly reduces the complexity of the classication problem. In this study we used the NB algorithm to classify two sets of COI sequences extracted from Genbank and aligned by ClustalW. The goal of the experiment was to identify the species the sequences belong to. The rst dataset was composed by 281 COI sequences from six species of the genus Artibeus (Chiropetera ). The second set consisted in 248 sequences belonging to seven species of the genus Litoria (Hylidae ). The nucleotide sequences were encoded as numerical vectors and each dataset was partitioned into a training and a test set. The test set was composed by three individuals randomly selected from each species. This dataset was used to estimate the classication accuracy of the NB algorithm. The NB classier presented good results on the experiments, with success rates of 95% and 96% on the Artibeus and Litoria genera, respectively. Interestingly, the two misclassied species, A. obscurus and L. serrata, were mistaken by species that are close in the evolutionary tree (namely, A. lituratus and L. genimaculata ). The good performance of the NB algorithm on taxonomic classication is important for several reasons. Besides the qualities mentioned above, this classier presents the great advantage of being capable of classifying individuals using only fragments of DNA. This property is particularly useful in metagenomic analysis, which has recently been attracting a lot of attention from the scientic community. In addition, the structure of the NB classier provides information regarding the frequency of nucleotides associated with each species, which is potentially useful in population studies regarding inter and intra-specic molecular variation.

180

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

AN EFFICIENT COMPUTATIONAL APPROACH TO IDENTIFY NEW SNVS


Sousa R G M A D1,2 , Silva I T D1,2 , Oliveira T Y K1,2 , Pinheiro D G1,2 , Silva-Jr W A D1,2
1 Departamento 2 Centro

de Gentica - Faculdade de Medicina de Ribeiro Preto - Universidade de So Paulo Regional de Hemoterapia do Hospital das Clnicas da Faculdade de Medicina de Ribeiro Preto - Universidade de So Paulo Single Nucleotide Variations (SNVs) are the most abundant form of variation in genome. SNVs are, partially, responsible for individual differences associated with phenotypic heterogeneity of diseases and in response to drugs tolerance. At the same time, the identication of SNVs may represent a valuable source for understanding of the molecular basis of various genetic diseases such as cancer. In order to identify and characterize SNVs in coding regions of expressed genes of the tumor cells, we established a computational approach that uses a set of Expressed Sequence Tags (ESTs) besides the implementation of a strict lter including the elimination of paralogous regions and sequences with low quality. Additionally, we included information about the number of distinct libraries supporting the SNV to increasing the chance of validation. We obtained 274 SNVs (179 synonymous and 95 non-synonymous), which were in-silico validated against dbSNP v130, a database of single nucleotide polymorphisms. A total of 231 SNVs (84.31%) were already cataloged. The high rate of validations observed suggest that the proposed methodology efciently identies SNVs when ESTs are used as a source of data. Therefore, this approach can be useful to identify new SNVs in tumors and probably in other kinds of cells.

181

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

RICINUS COMMUNIS AS A SUITABLE REFERENCE GENOME FOR H. BRASILIENSIS TRANSCRIPT ANNOTATION


Salgado L R1,2 , Vencio R Z N1
1 LabPIB 2 Depto.

- Laboratrio de Processamento de Dados Biolgicos Gentica - FMRP/USP

The Hevea brasiliensis , often simply called rubber tree, is a tree belonging to the family Euphorbiaceae and the most economically important member of the genus Hevea. It is of major economic importance because its sap-like extract (known as Latex) can be collected and is the primary source of bio-rubber (aka natural rubber). With the development of new sequencing technologies, the access of entires transcriptomes has been a matter of hours concerning the sequencing machine run. The amount of data generated its in the order of several Gb and the analysis of this data its not an easy task. One of the rsts step of a transcriptome analysis is the annotation of the sequencing data against a reference genome, and the choice of an suitable organism its not trivial when the organisms genome is not available. The reference organism can be highly divergent depending from the phylogenetic distance between the query data and the reference. The objective of the present work is show the feasibility of the Ricinus communis (Euphobiaceae) genome for the annotation of Hevea brasiliensis (Euphorbiaceae) transcripts in comparison with others plants genome models. Using the Stand-Alone version of NCBIBLAST (Altschul, 1990) was performed a local tBLASTx using a R. communis database to annotate the ESTs from H. brasiliensis extracted from the NRESTdb (http://genome.ukm.my/nrestdb/), an EST database from Hb created by a Malaysian consortium. Aligning 9896 ESTs sequences from Hb to the complete genome assembled by the Vader Institute, we observed the Median Score obtained 98.4 with Median Absolute Deviation (mad) of 61.67. Moreover, the Median and mad for E-value was 3.0x1032 and 4.4x1032 . For Identity was 75% with 20% of mad. Approximately 1,4% of the ESTs found no hits at all against R. communis . The appropriateness of this approach can be seen when the same analysis is made using the most know plant genome model, Arabidopsis thaliana . When compared with the same annotation procedure using the Arabidopsis database the values obtained was 104 - mad 100 , 3.0x1028 mad 4.4x1028 , 64% - mad 23%, for Score, E-value and Identity, respectively, indicating a lower quality annotation when use a phylogenetic distant or misappropriated reference genome. The generated data by annotation have huge inuence in future works that make use of this kind of information, so the choose of a suitable reference genome its a serious consideration to be take into account.

182

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

EVALUATION OF GENE FINDERS USING SPLICING GRAPHS


Kashiwabara A Y1 , Durham A M1
1 Department

of Computer Science, Institute of Mathematics and Statistics, University of Sao Paulo

Gene nders play an important role in any genome project, and the task of choosing the best programs is far from trivial due to the following reasons: (i) using biased benchmarks makes the evaluation unfair; (ii) gene nders contain a set of arbitrary parameters in which the values change the predictions; and (iii) the available evaluations report an insufcient accuracy in metrics that omit important systematic errors. Despite the limitation (i), it is important to have tools that assess the accuracy of different methods and parameters settings when annotating a new genome sequence. There are two programs that calculate the metrics described by Burset et al (1996): (i) Eval (Keibler et al, 2003); (ii) GFPE (Wang and Kraemer, 2003). These metrics have become standard, but they assume the absence of alternative splicing events, inspite of the fact that recent researches have shown that 95%-100% of multi-exon genes present alternative splice events (Nielsen and Gravelev, 2010). Again, these metrics represent an average of the accuracy, which omit important systematic errors. We have generalized the standard accuracy metrics by assuming the presence of alternative splicing. Our approach consists in using splicing graphs (Sammeth, 2009), which represent annotated gene variants and predicted gene structures in the same graph. We have implemented a Perl program called SGEval (Splicing Graph Evaluation) that compares a list of predictions against a reference annotation. SGEval provides individual visualization of each predicted gene, facilitating the identication of systematic errors. For example, we have indentied that small exons and large exons are systematically and incorrectly predicted. Future work consists in using SGEval to implement more accurate gene nders through the identication of these kind of systematic errors. Supported by: CNPq (141069/2007) References Burset et al (1996). Evaluation of gene structure prediction programs. Genomics . 34(3):353-67. Keibler et al. (2003). Eval: a software package for analysis of genome annoation. BMC bioinformatics . 2003;4:50 Wang and Kraemer (2003). GFPE: gene-ding program evaluation. Bioinformatics . 19(13):17121713 Nielsen and Graveley (2010). Expansion of the eukaryotic proteome by alternative splicing. Nature. 463(7280):457-63 Sammeth (2009). Complete Alternative Splicing Events Are Bubbles in Splicing Graphs. Journal of Computational Biology. 16 (8):1117-1140

183

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

BRINGING BIOINFORMATICS TO NON COMPUTER-LITERATE BIOLOGISTS: A PILOT STUDY AT FIOCRUZ-MG


Aguiar E R G R1 , Oliveira G C1,2 , Coimbra R S1,2
1 Center 2 Genomics

for Excellence in Bioinformatics - FIOCRUZ-MG and Computational Biology Group - FIOCRUZ-MG

The development of friendly interfaces for accessing bioinformatics programs in an integrated framework may be of great assistance for the non computer-literate Biologists. Galaxy (http://usegalaxy.org) is a web-based framework developed to facilitate the access to bioinformatics tools. Besides offering a simple interface to many known softwares, the framework enables complex pipelines, results storage and integration of new programs. Additionally, Galaxy can be integrated with job schedulers to obtain the highest efciency in computer clusters. All resources and programs are available via web interface without the need for local installation. The main goal of this project is stimulate and to promote efciency in data analysis using bioinformatics tools , decreasing the learning curve. For this purpose, we are implementing a customized Galaxy framework at Center for Excellence in Bioinformatics (http://www.cebio.org) that will include the most popular bioinformatics tools, most of which are not available in the standard distribution. Galaxy core modules were modied to handle new input le types such as chromatograms and zip les. Three pipelines (immunoinformatics, phylogeny and Sanger sequences analysis), 19 new types of input les and 26 programs interfaces were implemented. New programs will be continuously integrated and their interfaces developed. The demand will be identied by surveying the needs of FIOCRUZ-MG researchers. Usability tests will be applied to volunteers from FIOCRUZ-MG community to ne tune interfaces and tools. Features as the automatic data validation and format conversion, together with reutilization of results and parameters of previous executions, will benet directly the analysis and also stimulate the use of bioinformatics tools by the scientic community. Supported by:

184

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

ALUMINUM-INDUCED GENES IN GRASS SPECIES


Noda R W1 , Guimares C T1 , Gomes E A1 , Carneiro N P1 , Lana U G P1 , Magalhes J V D1 , Costa M M C2 , Silva F R D3 , Brammer S P4 , Lngaro N C5 , Carvalho L J C B2 , Bevitori R6 , Purcino A A C1
1 EMBRAPA 2 EMBRAPA

MILHO E SORGO, Sete Lagoas/MG RECURSOS GENTICOS E BIOTECNOLOGIA, Braslia/DF 3 EMBRAPA INFORMTICA AGROPECURIA, Campinas/SP 4 EMBRAPA TRIGO, Passo Fundo/RS 5 UNIVERSIDADE DE PASSO FUNDO - UPF, Passo Fundo/RS 6 EMBRAPA ARROZ E FEIJO, Goinia/GO The toxicity caused by aluminum (Al), intrinsic to acid soils, inuences negatively the stability of crop production. Under toxic levels of Al, plant roots paralyze their development and become unable to explore the deeper layers of the soil, affecting nutrient and water acquisition and reducing crop yield. This study aimed to identify genes associated with Al tolerance mechanisms in grasses, using cDNA and subtractive libraries derived from roots of maize, rice, sorghum, oat, barley, wheat and brachiaria tolerant genotypes submitted to critical levels of Al in nutrient solution. We analyzed 5,304 sequences, of which 3,869 were considered of good quality (Phred score: 13 and minimum number of bases: 70). Additionally, 391 sequences without quality scores that showed similarities in BLASTn against public database sequences were added to the clustering process. A total of 4,260 quality sequences were clustered with CAP3, generating 567 contigs and 1,009 singletons. The contigs ranged from 2 to 92 sequences. BLAST2GO was used to determine the putative roles and ontologies of the sequences combining results from BLAST, InterProScan, Gene Ontology (GO), and KEGG metabolic pathways. Out of the 1,576 unique sequences (contigs + singletons), 953 received GO terms by BLAST2GO annotation. Al tolerance mechanisms are divided into two main types: (1) exclusion, which prevent the Al uptake into the cell; and (2) simplastic, that immobilize or neutralize the Al in specic locations inside the cells. Therefore, we focus our initial search in sequences associated with transport, biotic and abiotic stress, and membrane components. Next we searched for sequences related to organic acids compounds such as malate, citrate and oxalate. The transport term appears under GO terms in 113 unique sequences, membrane in 335, and stress in 45, while the combination of transport and membrane were found in 82, membrane and stress in 19, transport and stress in nine, and stress, transport and membrane in four. For organic compounds, malate appears under GO terms in four unique sequences, citrate in three and oxalate does not appear. Using KEGG, 12 sequences showed similarity with nine enzymes of the TCA cycle. A large number of genes were induced under Al stress in grass roots, including genes commonly found in other abiotic stresses. This strategy will allow us to identify Al tolerance mechanisms common to several grass species. Supported by: FAPEMIG (NuBio - TCT 12.009/09), McKnight, Embrapa, GCP, CNPq

185

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA08

PREDICTING TARGETS OF MIRNAS IN SCHIZOPHRENIA


Piovezani A R1 , Brentani H P2 , Machado-Lima A3
1 Interunidades 2 Faculdade

em Bioinformtica - Universidade de So Paulo de Medicina - Universidade de So Paulo. 3 Escola de Artes, Cincias e Humanidades - Universidade de So Paulo Recent genetical discoveries have identied new genes involved in many psychiatric disorders. However, how these genes are regulated and related with many behavioral disorders is far from being completely elucidated. MicroRNAs (miRNAs) are members of a family of non-coding RNAs (ncRNAs) of 22nt participating in the process of gene expression through events of negative regulation. MiRNAs can hybridize to the 3 UTR of mRNAs to direct their post-transcriptional repression through translation inhibition. It is estimated that miRNAs regulate 60% of coding-protein genes through this mecanism and that 1-2% of all human genes are genes of miRNAs. Currently studies show that approximately one-third of the miRNAs are expressed in the brain, where they have been shown to be involved in maintaining brain function and closely related to affecting neuronal differentiation, synaptosomal complex and others effects. These events are exactly the same interrupted at the schizophrenia. There are many computational tools for miRNA target prediction in animals. Studies with miRNAs and their targets can help to understand the involvement of miRNAs in schizophrenia. We are performing a computational prediction of miRNA targets in genes related to such disorder. The predictions will be analised together gene expression data from brains of individuals affected with schizopheria and controls.

186

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

CLUSTERING OF SEQUENCES WITHIN THE 3 UNTRANSLATED REGIONS OF CO-EXPRESSED TRANSCRIPTS IN TRYPANOSOMA CRUZI
Neto R P D M1 , Pontes A1 , Santos M A D1
1 Universidade

Federal de Minas Gerais

Trypanosoma cruzi , the etiologic agent of Chagas disease in humans, is a protozoan parasite which assumes four morphological stages during its cycle in insect and mammalian hosts. This parasitebelongs to the Trypanosomatidae family, a group of organisms that has unusual mechanisms of gene expression.The latter issue is of particular importance forT. cruzi and other kinetoplastids, due to the generally accepted view that regulation of gene expression in the kinetoplastids is almost entirely posttranscriptional.There is no identiable RNA polymerase II promoter,theprotein-coding sequences are transcribed as large polycistronic units by through trans-splicing, and cis-splicing is a rare event because only four genes have introns and most mitochondrial mRNAs undergo extensive RNA editing. Because the genes are transcribedconstitutively, most of control of gene regulation in these organisms occurs post-transcriptionally.Other studies have implicated mRNA processing, translational repression, polysome recruitment, and codon adaptation in the regulation of gene expression in the kinetoplastids, all processes that would be predicted to mitigate the role of mRNA abundance regulation in determining protein expression levels.Processing of polycistronic transcripts to generate monocistronic mRNAs involves two coupled co-transcriptional RNA-processing reactions: SL trans-splicing that result in the addition of the splice leader (SL) sequence at the 5-UTR region and a polyadenylation at the 3-UTR region of each mRNA. Beside this, mRNA stabilization and translational control are important steps that modulate gene expression in these parasites. It is known thatmRNA degradation and translation efciency may be mediated by the presence of regulatory elements within 3UTRs of the transcripts. Therefore, the aim of our work is to cluster conserved regions withinT. cruzi stage-specic mRNA sequences that may regulate gene expression inthis organism. ESTs from each life stage were downloaded from Genbank and parsed by Seqclean and Perl scripts to pick our target strings. Those parsed-ESTs were aligned against automatic predicted ORFs through Megablast. As results, we take only real 3-UTR sequences. Those sequences were parsed by .NET scripts to produce a distance matrix. This matrix was analyzed using SVD clustering methods and compared with data derived from the expression levels of the corresponding mRNAs. Three clusters were observed, suggesting that genes from each cluster have similar expression.

187

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

DESIGN OF A QUINTILLION OF GENES THAT CODIFY FOR SYNTHETIC MINI-ANTIBODIES (SCFV)


Ribeiro C1 , Guedes R L M1 , Ribeiro H A L1 , Coelho-Junior O1 , Ortega J M1
1 Laboratrio

de Biodados, UFMG, Belo Horizonte, Brasil

An approach to obtain intra-cellular phenotypic knockout of protein activity is the expression of a gene designed to block the target. A plethora of use for this approach may contemplate from blocking of viral replication, e.g. in transgenic plants, to an alternative to the popular RNA interference method. A great candidate to play this role is a mini-antibody containing both variable fractions of heavy and light chain, known as Single-chain Fraction of variability (ScFv). Engineered or synthetic ScFv in which the CDR loops were codied by degenerated codons such as NNS (any nucleotide at positions one and two and either G or C at the third position) have been constructed and selected with Phage-display technique. However, the stop codon TAG may occur in the random loops, what avoids selection with the yeast two-hybrid system for instance. Aiming to codify with random nucleotides the six CDR in a synthetic ScFv we aligned all available sequences present either in PDB or in nr database to a prototype ScFv, and determined the amino acid usage for all CDR positions. Results did not differ signicantly for PDB or nr datasets, so we used all sequences from nr for the processing. Moreover, we determined the amino acid usage for the translation of every degenerated codon that did not produce any of the three STOP codons, and select with two criteria the best codon for each synthetic CDR position: (i) degenerate codon must be highly used in S. cerevisiae , to favor its use in the two-hybrid selection system and (ii) degenerated codon must simulate usage in natural antibodies. As example, the sequence for the smaller CDR, CDR1, resulted as RRT, TAT, NNT, ATK, RVT. Some positions resulted xed, as position 2 of CDR1 above, since TAT codes for Tyrosine (most used codon in S. cerevisiae ), which presents 48% of usage in the natural antibodies. A total of 2e26 distinct amino acid combinations can be generated with the nal design. Grouping amino acids with similar properties, the total of combinations attain 1e17, thus allowing for a putatively efcient selection. Thus, with bioinformatics analyses we were able to design around a synthetic ScFv that favors codon usage in yeast, avoids stop codons while simulating the usage in natural antibodies. In sequence, a library will be constructed and screened with liquid yeast two-hybrid approach. Supported by: FAPEMIG

188

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

IDENTIFICATION AND CLASSIFICATION OF NCRNAS IN TRYPANOSOMA CRUZI: A MULTISTEP APPROACH


Grynberg P1 , Bitar M2 , Paschoal A R3 , Durham A M3 , Franco G R1
1 Universidade 2 Universidade

Federal de Minas Gerais Federal do Rio de Janeiro 3 Universidade de Sao Paulo Non-coding RNAs (ncRNAs) prediction has become a vast eld of research and several classes of ncRNAs with different regulatory, catalytic and structural functions have been discovered. To enable these new molecular characterizations, computer-based techniques were developed and coupled with experimental designs. Trypanosomatids as Trypanosoma cruzi , Trypanosoma brucei and Leishmania sp. are the etiologic agents of high-incidence tropical diseases. Considering that genes in trypanosomatids are transcribed as polycistronic units, ncRNAs are likely to have fundamental roles in gene expression mechanisms. In the last years, three kinetoplastid genomes have been nalized, and a recent study to predict ncRNAs in Leishmania braziliensis and T . brucei has been published. Similarly, we propose to predict and classify ncRNAs for the complete genome of T. cruzi . For this purpose, we used eQRNA, an algorithm for comparative analysis of biological sequences that performs probabilistic inference on genomic alignments. eQRNA identies conserved sequences that show mutational patterns consistent with a preserved RNA secondary structure, as opposed to conserved coding frames and/or other genomic features.The entire genomes of T. brucei and T. cruzi were used to generate the initial alignments submitted to eQRNA, and 4195 ncRNA candidate sequences equal to or longer than 30 nucleotides were found. The candidate sequences were used for blastx search (e-value = 10e05 ) against T. cruzi annotated proteins. 2813 candidates matched protein-coding sequences and the remaining 1382 candidates were submitted to a pipeline that included search against 25 different ncRNA databases, ab initio RNA tools and structural analysis. 1301 candidates had no evidence to be classied as ncRNAs and 49 candidates were tRNAs or rRNAs. Twenty-nine candidates presented similarity with ncRNAs from several databases. Three were considered false-positives.For cases where no functional prediction was accomplished, an energy-based approach to characterize native-like structures of ncRNAs was applied. This new methodological tool takes into account that native-like structures of macromolecules are likely to present a lower energy than structures generated at random with no biological function. To test this statement in the context of ncRNAs, we developed a software tool coupled with the Vienna RNA package for energetic assessment of RNA molecules.An additional strategy aims to classify ncRNAs according to their RNA family. This computer-based tool consists of an energetic map that takes into account the length of the RNA sequences, and their RNA family. This map is generated from a 2D graph of length vs. energy, with points colored by RNA family type. All sequences used in this methodology were retrieved from RNAstrand.Preliminary tests point to an appropriate performance of these new computer-based methodologies. Other in silico approaches which make use of energy parameters will be employed to test the validity of the identied ncRNAs and to group these sequences according to family.Our next goal is to identify putative regulatory ncRNAs that may be directed to UTR elements by matching them to a catalog of 5 and 3 UTR sequences of T. cruzi transcripts retrieved from dbEST.

189

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

USE OF ARTIFICIAL IMMUNE SYSTEMS IN A PROTEIN CLASSIFICATION SYSTEM


Shimo H K1 , Oliveira L L D1 , Tins R1
1 Department

of Physics and Mathematics, FFCLRP, University of So Paulo

Proteins have important functions in biological organisms. Those essential macromolecules can be classied in different ways according to their characteristics and role in the most diverse processes. The classication of proteins according to their structures is, generally, too difcult and expensive. The development of computational tools for helping researches to classify proteins in different processes can provide guidelines to start wet experiments. In a previous work, a framework was developed integrating tools for classication, alignment and visualization of tridimensional structure of proteins. The classication tool employed an Articial Neural Network (ANN) with weights optimized by the Backpropagation algorithm. In recent years, some researches propose the use of population metaheuristics to train ANNs as an alternative to Backpropagation in order to minimize the local optima problems. In this work, an Articial Immune System (AIS), metaheuristic inspired on the theories of the adaptive immune response of immunology, is proposed to optimize the ANN weights in the protein classication system previously presented. The immune system works on recognizing antigens (membrane proteins of pathogens) to inhibit the pathogens action through antibodies, proteins produced by the immune systems cells. The AIS considers a set of immune system cells as candidate solutions for the optimization problem. The optimization is inspired by three principles of immunology: clonal selection, where best t cells are selected and reproduced more intensively; afnity maturation, that states that cells undergoes mutation with rates inversely proportional to its afnity to the pathogen; and immune network theory, where cells of the immune system can interact in inhibitory way. In the protein classication system, the direct use of the protein sequence as input of the classication engine is not suitable because the sequences have different length. Hence, 2-gram representation, which consists in computing the frequency of every possible transition of any 2 amino acids in a protein sequence, is used to encode the protein. Here, the tness function is based on the classication error rate of the ANN using the weights dened by the cells of the AIS. Supported by: FAPESP

190

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

FINDING CODING SEQUENCES WITH CONFLICTING ATTRIBUTES DURING THE PROCESS OF PROKARYOTIC GENOME ANNOTATION
Saji G D R Q1 , Canto M E1 , Nicolas M F1
1 The

National Laboratory for Scientic Computing LNCC

The advent of new sequencing technologies and the development of new computational tools for genomic analysis led to the exponential growth of genome databases, in particular, the draft data. These banks, in turn, are used for genomic comparisons in-silico , so the corrected description of the Coding Sequences (CDS) that are being deposited is very important. Accordingly, the biological database scientic communities, especially the International Nucleotide Sequence Databases (INSD) are establishing novel standards for the genome sequences submissions in the new era of sequencing. Within this context, particular emphasis should be put on the identication and correction of frameshift errors recognized in the genome sequences. Here, we describe a novel tool that uses two comparative methods in order to identify coding sequences with conicting attributes. The term conict is used to refer a sequence discrepancy, which is identied during genome comparisons, such as: i) frameshifts ii) large insertions or deletions and iii) truncation. The results of the comparisons can be viewed graphically. Also, it is provided the access to other online tools for additional analysis allowing quick and user-friendly genomic analysis. Using this tool, we analyzed several regions of the genome of Klebsiella pneumoniae KP13. Noteworthy, we identied frameshifts, truncations 3 and displacement of start codon. This analysis was essential to get the identication of sequence conicts, which currently have not been revealed through the automatic annotation methods. Supported by: FAPERJ (Process number E-26/102.214/2009), CNPq (Process number 473707/20101) and CAPES.

191

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

UNCOVERING TAXONOMICALLY RESTRICTED GENES USING THE PROTEIN WORLD DATABASE


Engelke F1 , Catanho M2 , Miranda A B D1
1 Laboratrio 2 Laboratrio

de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz - FIOCRUZ de Genmica Funcional e Bioinformtica, Instituto Oswaldo Cruz - FIOCRUZ

Some characteristic genes, known as gene markers, are able to distinguish groups of organisms, providing relevant information about their classication and typing, and are potential sources of epitopes for immunization or detection methodologies based on antigen/antibody. It is known that each species has its own set of distinctive genes, which gives its unique characteristics, providing researchers a source of classication of these organisms for instance. Regarding pathogenic microorganisms in general and mycobacteria in particular, a number of potential applications of comparative genome analysis have been reported, aimed especially at the prevention (development of more efcient vaccines), diagnosis (development of faster and more accurate methods), and treatment (development of new drugs) of tuberculosis and other mycobacterial diseases. Some of these applications include: identication of unique genes and virulence factors, and metabolism reconstruction; characterization of pathogens and identication of new diagnostic and therapeutic target; investigation of the molecular basis of pathogenesis and host range, and differences in phenotypes between clinical isolates and natural populations of pathogens; and, investigation of the genetic basis of virulence and drug resistance in tuberculosis-causing bacteria. In this work, we use the ProteinWorldDB (http://www.proteinworlddb.org), a relational database that contains coordinates of pairwise protein alignments and their respective scores obtained after a stringent comparison - applying the Smith-Waterman algorithm - among all predicted proteins encoded in the completely sequenced genomes of more than 500 organisms (Archaea, Bacteria and Eukarya ), to identify and characterize species-, genus-, and lineage-specic genes, which could possibly be used as markers, based on their patterns of co-occurrence among these organisms. The comparative dataset used in this study was provided by the Genome Comparison Project (http://www.dbbm.ocruz.br/GenomeComparison), a research project involving Fiocruz and PUC-Rio, sponsored by IBM and the World Community Grid. Supported by: IBM, World Community Grid, Coordenao de Aperfeioamento de Pessoal de Nvel Superior (CAPES), Programa Estratgico de Apoio Pesquisa em Sade (PAPES), Programa AMSUD-PASTEUR, Rede Iberoamericana de Bioinformtica (RIB), Conselho Nacional de Desenvolvimento Cientco e Tecnolgico (CNPq), Rede Fiocruz, Plataforma de Bioinformtica PDTIS, Fundao de Amparo Pesquisa do Estado do Rio de Janeiro (FAPERJ).

192

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

ANALYSIS OF HOMOLOGY-INDEPENDENT METHODS FOR DETECTION OF GENOMIC ISLAND IN PROKARYOTES


Rocha P K L1 , Alves F I A1 , Incio S F C1 , Cavalcanti D D1 , Farias S T1 , Rgo T G1,2
1 Departamento 2 Unidade

de Biologia Molecular - UNIVERSIDADE FEDERAL DA PARABA Acadmica do Desenvolvimento Tecnolgico - UNIVERSIDADE FEDERAL DE CAMPINA

GRANDE Genomic islands are regions of the genome, which are potentially acquired by horizontal transfer and may offer new features to the organizations that receive them. The methods for detection of genomic islands in prokaryotes can be classied into two groups: methods homology-dependent and homologyindependent. The latter group assumes that laterally transferred genes can be identied by distinctive features of the rest of the genome, such as amount of guanine (G) and cytosine (C), frequency of dinucleotides, the use of codons and amino acids. This study aimed to standardize the parameters of seven homology-independent methods based on three authors: [1], [2] and [3], and make different combinations of methodologies and comparisons between them. To standardize the parameter windows were used twenty organisms with different characteristics and lifestyle, while for the step of combining homologyindependent methods were used ve organisms that had genomic islands described in the literature for homology-dependent methods [4]. The results were validated by calculation of precision, recall, specicity and accuracy based on data from the literature. The homology-independent methods divide the genome into windows to perform the analysis for detection of genomic islands and misuse of these can result in data not very accurate. The results to step in standardizing the parameter window identied that the most appropriate window size was 10kb. The combinations of homology-independent methods analysis did not obtain optimal levels of precision and recall, but the majority showed excellent levels of specicity and accuracy. The methodologies that had better rates of accuracy were: dinucleotide [1] with 86,4%; and combining of the methodologies GC content [2] and dinucleotide [1] with 86,2%. The methodology that showed the best recall rate was codon usage [1] with 52.5%. The methods for detecting genomic islands are many and standardization of these for the help in the analysis of genomic islands is necessary.

193

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

RNA-SEQ: THE NEED FOR BIOLOGICAL REPLICATES


Reis O1 , Costa G G L1 , Herai R H1,2 , Carazzolle M F1,3 , Pereira G A G1
1 Laboratrio

de Genmica e Expresso (LGE) - UNICAMP Aplicada (LBA), Embrapa Informtica Agropecuria (CNPTIA) - EMBRAPA 3 Centro Nacional de Processamento de Alto Desempenho (CENAPAD-SP) - UNICAMP
2 Bioinformtica

RNA-seq provides a way to analyze entire transcriptomes with deep coverage and base level resolution and it is gradually replacing microarrays for gene expression analyses. In the past few years, many statistical methods have been developed to improve the detection of differentially expressed genes from RNA-seq. However the replacing of microarrays by RNA-seq has been often accompanied by decline in experimental design quality, as many groups working with RNA-seq are not using biological replicates. For any statistical inference it is necessary replication, for example, if you nd a couple of genes that are upregulated in the treatment in comparison to the value in control, without replication you can not know if that is effect of random variability or an actual effect of the treatment. Furthermore, without biological replicates it is too difcult to estimate the sample variation. The authors of EdgeR and DESeq propose a similar method to work without replicates. They propose an approach to infer an upper limit for the sample variance that uses both treatment and control as they were biological replicates. If it is expected that the most genes are not differently regulated, the calculated sample variance should be close to real sample variance. In this work, we evaluate the effect of working with and without biological replicates on the list of differentially expressed genes obtained. We show that the use of proper biological replicates is important to get good sensitivity to detect differentially expressed genes, showing that the replacement of microarrays by RNA-seq does not justify the use of an inferior experimental design.

194

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA05

A SIMPLE PROCEDURE FOR ANNOTATION OF UNCHARACTERIZED PROTEINS


Ribeiro H A L1 , Ortega J M1
1 Departamento

de Bioqumica e Imunologia - ICB, UFMG

Annotation of uncharacterized proteins in a given proteome is an ongoing process, since it is a situation that is continuously modied by the availability of characterized proteins in the database used for annotation. For instance, the 4189 amino acid sequences codied by the Mycobacterium tuberculosis strain CDC1551 (txid 83331) present 1371 uncharacterized sequences. We set up for verifying the similarity with a total of 10,217,234 full length proteins present in a the most recent download of UniProt database, and found that 809 sequences hit entries that are not uncharacterized, thus 562 remained without any useful hint of annotation. We developed a routine to run PSI-BLAST searches and applied it to these 562 uncharacterized proteins. Initially, they were independently formatted as database and a PSI-BLAST alignment determined an alignment score that we refer as self-score. Later, these sequences were used as queries of PSI-BLAST searches against the UniProt database mentioned above. Since PSIBLAST operates with consecutive iterations, building a Position Specic Substitution Matrix (PSSM) at each consecutive run, and including increasing number of proteins as hits at each iteration, often the built PSSM starts under-scoring the query sequence itself. Our implementation includes a simple procedure to counteract this, which is to stop iterations whenever the self-score previously determined decreases to a certain percentage. We accessed the results having stopped iterations at 90%, 80%, 70%, 60% or 50% of self-score. Respectively, a total of 14, 23, 42, 46 and 48 searches returned hits that are not uncharacterized. Amongst them, presented Kegg Orthology (KO) annotation, respectively, 2, 5, 10, 12 and 17 searches, however hits with unique distinct KO were obtained up to 70% of self-score. We set up to verify whether or not a KO centered search using PSI-BLAST would reciprocally hit the Mycobacterium protein. For the 10 KO hit with sel-score threshold of 70%, all reciprocal searches annotated the expected target Mycobacterium protein. These data show that PSSM matrices built with our simple routine using self-score allows a direct search of UniProt database that is equivalent to triggering a myriad of KO centered PSI-BLAST searches. Supported by: FAPEMIG

195

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

INFORMATION TECHNOLOGY APPLIED TO BIOENERGY GENOMICS: PROBABILISTIC ANNOTATION USING ARTIFICIAL INTELLIGENCE.
Silva D C D A E1 , Waldemarin R C1 , Vncio R Z N1
1 LabPIB

- Faculdade de Medicina de Ribeiro Preto - Universidade de So Paulo

An alternative to the problem fossil fuels depletion is the use of renewable energy. In Brazil, sugarcane (Saccharum ofcinarum) is already used for years as alternative energy source and, therefore, Brazil has become a key player in the development of alternative energy. The project aims to develop new methods and tools of Information Technology (Articial Intelligence) to attack some of the Bioinformatics issues raised in sugarcane genomics research. In order to achieve this objective we opt for SIFTER (Engelhardt et al. , 2006), a powerful method based on Bayesian Networks. Our major aim in the present work is to establish a local implementation of the SIFTER methodology for application in bioenergy related problems. A second goal is to improve the original source-code for better performance, potentially allowing it to be used in a genome-wide scale. If the success of this project is established, the tools of IT developed could point to new biological functions for several genes of sugar cane. Furthermore, it may reveal potential targets for crop improvement. The SIFTER pipeline takes as input Protein Family IDs, according to the PFAM database (http://pfam.janelia.org/) standard. Retrieving information from the PFAM database and gene association from the GOA database (http://www.ebi.ac.uk/GOA/), SIFTER is able to propagate, according to the Bayesian model, information from families annotated to families of interest. At the end of the process, the pipeline returns for each of ID Family Protein, a list of annotations associated with the probability of each annotation. Specically, the input of the script is an ASCII le containing the list of protein families for which we want to annotate. The rst script uses the relational structure of families, taken from the Pfam-A.fullle (PFAM database), and gene associations from gene_association.goa_uniprotle (GOA database); and creates a .pli le in XML format. The second script uses the .pli generated and information from Pfam-A.fastale (PFAM database) to generate an intermediate alignment le in .stockholm format, convert into fasta format and generates a tree in .nex format, through the FastTree software. The third script converts .nexles in .nhx. At this point of the pipeline we have all the SIFTER input les. Studying the Python scripts that perform the preprocessing steps above, we found some potential room for improvement. Using native python modules we stored as les some variables that are timeconsuming to calculate and are not dependent on the input data. Additionally, we incorporated the use of multiple processes. The performance improvement rates achieved 72.5% (1st script, quad core machine), 63.6% (2nd script, quad core), 67.7% (1st, dual core) and 41.8% (2nd , dual core). Supported by: FAPESP/Microsoft Research

196

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

HIGH-THROUGHPUT DATA ANALYSIS TO FIND TARGETS FOR CRYSTALLOGRAPHY STUDIES IN THE LABORATRIO NACIONAL DE BIOCINCIAS - LNBIO
Vidal R O1,2 , Costa G L1,2,4 , Carazzolle M1,2,3 , Murakami M1 , Franchini K1 , Pereira G A G2
1 Laboratrio

Nacional de Biocincias (LNBio), Associao Brasileira de Tecnologia de Luz Sncrotron - ABTLuS 2 Laboratrio de Genmica e Expresso, UNICAMP 3 CENAPAD-SP - Centro Nacional de Processamento de Alto Desempenho em So Paulo, UNICAMP 4 Laboratrio Central de Tecnologias de Alto Desempenho (LCTAD), UNICAMP The LNBio is a multidisciplinary, multiuser opened institution, dedicated to research, development and innovation in biotechnology. Actually it incorporates a bioinformatics laboratory to study complex biological systems with high throughput data analysis, supporting researches in the elds of genomics, proteomics, and protein structure and systems biology. We present here the LNBio platform to analyze high throughput gene expression data comes from next generation sequencing (RNA-seq) or microarrays experiments. The platform allows analyzing and organizing the genes aiming for new molecular targets and these will be done in two mais steps: (i) treat expression data nding molecular targets and (ii) nd the best way to obtain its protein structure. In the rst step we used some in house PERL scripts and other free open source software such as Bioconductor for microarray data or DEGseq for RNA-seq data. The expression data are treated and analyzed to nding differentially expressed genes in a plenty of conditions. These genes can be grouped in sub classes of Gene Ontology database, in networks of protein interaction and metabolism. After that, in the second step, the nal list of interesting protein can be processed with several protein parameters, to decide the best way to try to get the 3D protein. With this purpose, we developed the Protein Crystallizability Classier (PCC), that is able to indicate (based in many characteristics of the protein primary structure) if some protein can be crystallizable or not with 80% of accuracy. The PCC gives information about secondary structure, domains, disordered regions, globular regions, search for homologous and PDB and other parameters using some in house algorithms and other public tools. The beta version of PCC is public available in the website http://lge.ibi.unicamp.br/lnbio/protoss.

197

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA06

WHOLE GENOME CONSERVED SEQUENCE ANALYSIS OF GRAM-VARIABLE MICROORGANISMS


Londero A A1 , Castro J D V1 , Porto L M1
1 Integrated

Technologies Laboratory - InteLab, Chemical and Food Engineering Department - Federal University of Santa Catarina The Gram staining technique is one of the rst steps in identifying a bacterium strain. It is used in the diagnosis of clinical samples to search for potentially dangerous microorganisms that may threat human health. Although the staining mechanism is not completely understood, it is known to be heavily dependent on structural of the bacterial cell wall. Clostridium acetobutylicum is an anaerobic bacteria involved on acetone, butanol and ethanol fermentation (ABE) and hydrogen production. C. acetobutylicum has being usually reported as a Gram-positive bacterium, but there have been a few publications reporting its Gram variability. Along C. acetobutylicum metabolism, important morphological and compositional changes occur on the bacterium cell wall. The identication of these changes can be of practical interest to nd the metabolism fermentation stage and its corresponding physiology. Some other bacteria of interest in environmental health are also described as Gram variable. We have chosen nine Gram variable reported species and compared their whole genome sequences with the sequence of Clostridium acetobutylicum ATCC 824 using wgVISTA (genome.lbl.gov/vista) tool to align and identify cell wall-related conserved regions. Four regions have been successfully identied as potentially involved on regulation of changes on cell wall structure and Gram variability. We propose that these ndings may guide genetic modication to increase resistance to solvents in the media culture and consequently increase the production of target metabolites. Supported by: CNPq

198

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

GERMINAL AND SOMATIC MUTATIONS SIGNATURE IN BREAST TUMOR CELL LINE


Cardenas R G C C L1 , Pinheiro D G2 , Tojal I2 , Pedigoni R G2 , Sousa R G M A D1,2 , Oliveira T Y K1,2 , Paula M G D1,2 , Silva J W A1,2
1 National 2 Medical

Institute of Science and Technology in Stem Cell and Cell Therapy/CNPq/FAPESP School of Ribeiro Preto, Department of Genetics, University of So Paulo

Next generation of sequencing technology is rapidly adding new sequences for analysis at the same time the costs of sequencing a genome continues to fall. This is leading to an increase of available data and demands more powerful tools and people to acomplish the analysis. In this project we performed a computational approach to analyze next generation sequences data and to identify mutations in genes associated with breast cancer. The results revealed 3,874 non-synonymous novel mutations from 2,161 genes in a breast tumor cell line. The comparison against 2,496 mutations (in 1,772 genes) previously described as somatic mutations in breast cancer identied 609 mutations in 309 common genes. From this result we found 41 mutations (in 40 genes) at exactly the same positions and 568 mutations (in 269 genes) that were previously noted to be targeted for somatic mutation in other studies about breast cancer. Using CaMP score, previous dened by Ding et al (2007), we ranked 45 mutations in 27 genes with a score higher than 1.0. The highest scores were on genes TP53 (>10), SIX4 (2.46), CUBN (2.45), GRIN2D (2.43) and LRBA (2.13). To distinguish genes likely to contribute to tumorigenesis from those in which passenger mutations occurred by chance we compared the results with a lymphoblastoid cell line derived from the same patient. On this second analysis we found 4 mutations in 3 genes; LRBA(2.13), VEPH1 (2.09), COL7A1(1.46), which were common between the normal and tumor samples. Those genes could be good candidates for germinative mutations. The present study describe an in silico approach to identify mutation that can be useful to search genes afected by mutation and that could be used as targets for pharcogenomics. Supported by:CNPq/FAPESP

199

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA04

IN SILICO IDENTIFICATION OF CRISPR IN FOUR CORYNEBACTERIUM PSEUDOTUBERCULOSIS STRAINS


Almeida S S D1 , Abreu V A C D1 , Cassiano A A M1 , Soares S C1 , DAfonseca V1 , Pinto A C1 , Miyoshi A1 , Azevedo V1
1 Federal

University of Minas Gerais

Caseous Lymphadenitis (CL) is a chronic infection contagious disease caused by Gram positive bacterium called Corynebacterium pseudotuberculosis. This illness afects mainly sheep and goats and is responsible for signicant losses agribussiness in worldwide, including Brazil. The genome characterization of the four different C. pseudotuberculosis strains (Cp1002, CpC231, CpI19 and CpFRC41) isolated from distinct hosts, has allowed the development of post and functional genomics studies. In this work in silico analysis was used to study CRISPR (clustered regularly interspaced short palindromic repeats), an adaptive microbial immune system. CRISPR regions are found in prokaryotes (Archaea and Bacteria), but absent from eukaryotes or viruses, consist of 2040 bp. CRISPR arrays are typically composed of directed repeats (DR) with intervening non-repeat unique sequences, termed spacers. The arrays, as dened by their repeat sequence, are often anked by CRISPR-associated (cas) proteins and together have been shown to form a functional pair capable of providing acquired immunity against phages, plasmids, or other invading genetic elements. Thus knowledge on CRISPR can be useful applications that include strain typing based on hypervariability, predicting and modulating phage resistance, and controlling the dissemination of genetic elements. Several software applications are available for identifying various forms of repeats. However, the tool used in this study to data analyze was CRISPRs web service. This site does not focus on the genetic environment often associated with CRISPR structures, how the CAS (CRISPR-associated sequences) genes. The reason for this is that most CRISPR structures are not associated with CAS genes, so that CAS cannot be used as a CRISPR-nder tool. A second important aspect is that CRISPRs web service can to provide a tool which would not miss even the smallest CRISPR structures and CRISPR which might be slightly different from canonical structures. In silico CRISPR analysis showed that among C. pseudotuberculosis strains was not observed variation in the DR locus. However, when compared to other species Corynebacterium showed striking points of similarity between them which may be due to the horizontal gene transfer. In this context, these data may be used in industrial applications, as a defense mechanism against phage infections and plasmids, either via selection of variants with efcient spacers or via genetic engineering to preclude the transfer of particular genetic elements, such as antibiotic resistance markers or other potentially harmful genes. Supported by: CNPq, Fapemig, CAPES, PPGEN-UFMG

200

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA03

A SINGULAR VALUE DECOMPOSITION APPROACH FOR IMPROVED TAXONOMY CLASSIFICATION OF BIOLOGICAL SEQUENCES
Santos A R1 , Santos M A2 , McCulloch J A5 , Baumbach J3 , Oliveira G C4 , Silva A5 , Miyoshi A1 , Azevedo V1
1 Biochemistry

Departament, Instituto de Cincias Biolgicas, Universidade Federal de Minas Gerais, Av. Antonio Carlos, 6627, Belo Horizonte, Minas Gerais, Brazil 2 Computer Science Departament, Instituto de Cincias Exatas, Universidade Federal de Minas Gerais, Av. Antonio Carlos, 6627, Belo Horizonte, Minas Gerais, Brazil 3 Max Planck Institute for Informatics, Campus E2 1, Saarbrcken, Germany 4 CEBio and Laboratory of Cellular and Molecular Parasitology, Instituto Ren Rachou, Oswaldo Cruz Foundation, Av. Augusto de Lima 1715, Belo Horizonte, MG, 30190-002, Brazil 5 DNA Polimorsm Laboratory, Universidade Federal do Par, Campus do Guam - Belm, PA, Brazil Background: Singular Value Decomposition (SVD) is a powerful technique for information retrieval, which allows for the uncovering relationships between elements that are not prima facie related. Initially, SVD was developed for information retrieval in the complex internet environment for the analysis of very large data sets with reasonable response time. Since information retrieval from large-scale genome and proteome data sets is a task of similar complexity, SVD-based methods may lead to improved data analysis in this research area as well.Results: We found that SVD applied to amino acid sequences uncovers hidden relationships and helps drawing up clusters and cladograms, thereby exhibiting the evolutionary relatedness of species with an excellent correlation with the Linnaean taxonomy. Crucial for all SVD-based approaches is the choice of a reasonable number of singular values. Here, we demonstrate that a comparably small amount of singular values is sufcient to represent biologically signicant clusters. Subsequently, we suggest a method to determine the lowest amount of singular values and fewest clusters that still guarantee biological signicance on Linnaean taxonomy classication level.Conclusions: By using this strategy, we reduce the uncertainty regarding the appropriate rank value necessary to perform accurate information retrieval analyses. Exemplarily, we create clusters that perfectly match the taxonomy of Linnaeus.

201

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Topic: Sequence Analysis PI: SA07

IMPROVEMENT IN THE PREDICTION OF THE TRANSLATION INITIATION SITE THROUGH BALANCING METHODS, INCLUSION OF ACQUIRED KNOWLEDGE AND ADDITION OF FEATURES TO SEQUENCES OF MRNA
Silva L M1 , Nobre C N2 , Zrate L E1
1 Pontifcia 2 Universidade

Universidade Catlica de Minas Gerais Federal de So Joo del-Rei

The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modelled as a problem of classication between positive sequences (protein codiers) and negative sequences (non-codiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on balancing methods which resolve the proposed problem effectively and efciently. M-Clus, an undersampling balancing method based on Clustering is proposed, in addition to a new methodology that adds features to sequences and that improves the performance of the classier through the inclusion of knowledge obtained by the model.Through this methodology, the measures of performance used (accuracy, sensitivity, specicity and adjusted accuracy) are greater than 93%. The precision increases signicantly from 43.05% to 82.05% when the knowledge obtained by the model is included.In order to resolve the problem, it is necessary to invest in class balancing techniques in addition to a judicious methodology which visibly improves the results. Using the M-Clus balancing method generates a signicant increase in the rate of sensitivity from 51.39% to 91.55%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. There is also an increase in precision of 39% when the knowledge acquired by the classier is included in the new training set.

202

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Authors index
vila R A M D, 115 Abdelnoor R V, 120 Abraho J, 21 Abreu J T, 179 Abreu V A C, 133 Abreu V A C D, 31, 32, 200 Acrcio F D A, 122 Acencio M L, 81 Afonso M, 95 Aguiar E R G R, 21, 184 Alcalde R, 27 Alencar N A N D, 36, 40, 42, 43, 51 Ali A, 31, 32, 133 Almeida J S, 80 Almeida L G P D, 5 Almeida R B D, 9 Almeida R M C D, 25, 166 Almeida S S, 133 Almeida S S D, 31, 32, 200 Almeida V M G, 37 Alvarez J C, 85 Alvarez-Valin F, 6 Alves C, 49 Alves C N, 36, 40, 42, 43, 51, 60 Alves E C D O, 124 Alves F I A, 116, 193 Alves N R, 53 Alves W A L, 88 Amar P, 132 Amaral L R, 86 Ambrosio A B, 68 Andrade A C, 68 Andrade E S, 79 Andrade L G D, 16, 18, 103, 123 Antunes D A, 46 Aparicio R, 41 Arajo F M G, 21 Araujo F, 13 Araujo F M G, 4 Araujo U, 130 Araya-Secchi R R, 62 Arbex W, 123 Arbex W A, 13, 16, 18, 103 Arruda-Neto J D T, 150 Atia S E, 132 Azevedo V, 31, 32, 49, 133, 200, 201 Bagnariolli B, 168 203 Balbino V Q, 126 Baliga N, 71 Baptista R D P, 10 Barbosa E, 31, 32 Barbosa F, 140 Barbosa-Silva A, 142 Barboza N, 70 Barboza N R, 73 Barioni M C, 102 Barrera J, 157, 162 Barreto A D M S, 135, 180 Bartholomeu D C, 115, 174, 178 Batista M V A, 126 Batista T M, 11 Baumbach J, 201 Benetti F P D C D, 166 Bevitori R, 185 Binneck E, 120 Bisch P M, 47 Bitar M, 12, 47, 189 Bittencourt D, 67 Bleicher L, 92, 95 Boaventura M A C, 127 Borges C C H, 170 Borges W, 70 Borio C, 26 Botelho C, 106 Bovolenta L A, 81 Braga M, 115 Brammer S P, 185 Braz A S K, 14 Brentani H, 72 Brentani H P, 186 Brito C F A D, 74, 121 Brito R C F D, 93 Brugiolo A, 170 Brunnet L G, 166 Bueno L L, 115 Bugs C A, 164 Buriol L S, 54 Caetano A R, 16 Caffarena E R, 55 Camargo A A, 176 Campiteli M G, 161 Campos J D O, 153 Campos R K, 21 Campos S V A, 106, 108, 146, 156 Cancherini D V, 72

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Canto M E, 135, 191 Cantao M E, 5 Carazolle M F, 85 Carazzolle M, 197 Carazzolle M F, 22, 68, 110, 119, 120, 158, 194 Cardenas R G C C L, 199 Carmo M C L, 36, 40 Carneiro A A, 106 Carneiro N P, 185 Carvalho L J C B, 185 Carvalho M O D, 24 Carvalho M R S, 13 Carvalho P V S D D, 35 Carvalho R R D, 169 Carvalho S, 70, 73 Carvalho V, 82, 98 Casseb J, 27 Cassiano A A M, 200 Castriota M, 149 Castro I M, 64 Castro J D V, 198 Castro J V, 168 Castro M A A, 25 Catanho M, 6, 192 Cavalcanti D D, 116, 193 Cavalcanti G D D C, 52 Cerqueira F R, 75 Chanock S, 17 Chevitarese J, 17 Chiari E, 10 Chies J A B, 46 Coelho-Jr O, 58 Coelho-Junior O, 188 Cohen E M L, 44, 90 Cohen M, 44 Coimbra R S, 4, 13, 21, 101, 108, 136, 184 Colombo C A C, 68 Cordeiro G C B, 52 Corgozinho P M, 112 Corra B R S, 65 Corra-Oliveira R, 93 Costa G G L, 22, 68, 110, 120, 194 Costa G G L D, 119 Costa G L, 197 Costa K M, 51 Costa L D F, 161 Costa M, 49 Costa M G S, 47 Costa M M C, 185 Costa M O C, 83 Costa P R, 149 204

Costa R L, 180 Couto A P, 171 Couto B R G M, 97, 127 Crepalde M A, 156 Cristo G S P, 174 Cruz L M, 144 Custdio F L, 34, 45 Cysne M B, 108 Dvila D A, 10 DAfonseca V, 31, 133, 200 DAfonseca V, 32 Dalmolin R J S, 25, 166 Dardenne L E, 34, 45, 104 de L D M, 64 Declerck G, 145 Degrave W, 6 Dias L L C, 64 Dias M C, 179 Dias S R, 91, 99, 100, 130 Dias-Lopes C, 132 Domingues B F, 105 Donnard E R, 142 Dorella F A, 31, 32 Dorn M, 54 Drummond M G, 47 Drumond B, 4, 13 Drumond B P, 21 Duarte A, 27 Durham A M, 29, 50, 125, 131, 183, 189 Dutra M B, 140 Edelman E R, 76 Elias H C B, 124, 138 Engelke F, 192 Espindola F S, 86 Flber C C, 46 Falcao P R K, 110 Faria C D J, 31 Faria C J, 32 Faria L C B, 114 Faria-Campos A, 111, 156 Faria-Campos A C, 106, 108, 146 Farias S T, 193 Farias S T D, 116 Felicori L, 132 Felipe J C, 118, 139, 157 Fernndez-Becerra C, 65 Fernandes G D R, 87, 128 Fernandes G R, 142 Fernandes-Rausch H, 106, 108

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Ferre F, 122 Ferreira C E, 162 Ferreira J M S, 21 Ferreira T A E, 126 Fietto J L R, 87 Figueiredo H C P, 4 Folador E L, 144 Folgueras-Flatschart A V, 106 Fonseca L A C, 61 Fonseca M M B D, 55 Franchini K, 197 Franchini K G, 110 Franco G R, 11, 47, 64, 179, 189 Freitas J S M, 91 Freitas L, 115 Freitas L A, 177 Fujiwara R T, 115, 174, 178 Galante P A F, 84, 117, 176 Galvo L M D C, 10 Garate J A, 62 Garcia J C O, 66 Garratt R C, 92, 100 Gatts C E N, 177 Gazzinelli R T, 178 GCArnoldi F, 77 Gehlen M A C, 19, 109 Giachetto P F, 154 Gilman R, 17 Gilman R H, 23 Giusta M D S, 178 Godinho C P D S, 89 Goliatt P V Z C, 104 Gomes E A, 185 Gomes L, 126 Gomes R C A, 126 Gonalves M A, 146 Goni S, 26 Gontijo E D, 10 Gordo S M C, 51 Gordo S M D C, 57 Gouva C M C P, 35 Gouveia N M D, 86 Grativol A D, 177 Gruber A, 29, 96 Grynberg P, 64, 189 Guedes E, 13, 16, 18, 103, 123 Guedes R L M, 142, 188 Guimares A C R, 6 Guimares C T, 15, 185 Guimares M F M, 13, 18 Guizelini D, 19, 109

Gusmo R F S, 124 Hanke L A, 108 Hardwick R J, 23 Hashimoto R F, 158, 163 Herai R H, 22, 85, 110, 119, 194 Hess S, 76 Higashi S, 135 Hollox E J, 23 Holmes D S, 62 Ibelli A M G, 154 Incio S F C, 116, 193 Iserte J, 26 Ishivatari L H U, 39 Jnior O R, 119 Jr R M C, 161 Jr R P, 114 Jr W M, 61 Junior O R, 22 Junior W A D S, 118 Kaandorp J, 148 Kashiwabara A Y, 183 Kido A, 120 Klein C C, 83 Kleinschmidt J H, 114 Koide T, 71 Kroll J E, 84 Kroon E G, 21 Kulcheski F R, 120 Lngaro N C, 185 Laat D M D, 64 Lamb L D C, 54 Lameira J, 49 Lana U G D P, 15 Lana U G P, 185 Lana-Peixoto M A, 108 Leal D A, 73 Leclerqc S, 98 Lemke N, 81, 92, 149 Lima A H, 42 Lima J, 27 Lima L D A, 160 Lima N C B, 5, 83 Lima T L D, 126 Lobo F P, 11, 20, 47, 115, 174, 178 Lobosco M, 172 Londero A A, 198 Lopes B C, 13 Lopes F M, 160 205

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Lopes J C D, 37, 56, 86, 105, 113, 138 Lopes K P, 99 Lopes L R, 27 Loreto E L D S, 24 Lourdes R D A, 174 Lozano M, 26 Macedo A, 10 Macedo A A, 137, 140, 141, 143 Machado C R, 10 Machado K S, 44, 90 Machado M, 112 Machado M A, 13, 18 Machado-Lima A, 131, 186 Maciel T E F, 87 Maciel W D, 146 Madeira H M F, 144 Magalhes A A C, 31, 32 Magalhes J V D, 15, 185 Magalhes W, 17 Magalhes W C S, 112 Magalhaes W C S, 111 Maluceli A, 144 Maral L N, 167 Marcelino F C, 120 Marchaukoski J N, 19, 109 Marcolino L S, 127 Margis R, 120 Marques J T, 11 Martnez A, 150 Martins M L, 165, 167 Martins-Jos A, 56 Martins-Jr D C, 160 Matos F M S B D, 122 Mccouch S, 145 McCulloch J A, 133, 201 Medeiros I G D, 60 Medeiros S M O, 126 Meira J W, 37 Meira-Jr W, 59 MeiraJr W, 122 Melo A, 108 Melo H V, 28 Melo J C B D, 52 Melo M P, 139 Melo-Minardi R C D, 59 Mendes F H K, 8 Mendes R L, 167 Mendes T A D O, 115, 178 Meyer D, 8 Meyer L, 120 Minard R C M, 37

Minardi R C D M, 61 Miranda A B D, 6, 192 Miranda G H B, 157 Miranda R R C D, 115 Miyoshi A, 31, 32, 133, 200, 201 Miyoshi N S B, 118 Molfetta F A D, 60 Molina F, 132 Mombach J C M, 152, 164 Mondego J M C, 22, 68 Monteiro M, 53 Monteiro P R S, 40 Monteiro-Vitorello C B, 88 Moore R, 31, 32 Moraes G, 49 Moraes R, 129 Moreira E C D O, 51, 57 Moreira J C F, 25, 166 Moriel A R, 143 Motta P C, 67 Mouro M D M, 179 Mudado M A, 98 Mudado M D A, 82 Muniz M N M, 16, 18, 103, 123 Murakami M, 197 Murta-Junior L O, 143 Nagem R A P, 100 Narciso M G, 145 Nascimento A S, 48 Nascimento J P, 40, 60 Nascimento L C, 22, 119, 120 Nascimento S B, 43, 51, 57 Navarro F C P, 84, 176 Nepomuceno A L, 120 Neto A D M, 74, 121 Neto J X, 38 Neto R P D M, 187 Netto D S, 83 Nicols M F, 83 Nicolas M F, 5, 191 Nobre C N, 202 Noda R W, 15, 106, 185 NorbertodeSouza O, 44, 90 Noronha M F, 158 Novaes G M, 170 Nunes M C D S, 175 Ohara D T, 84 Old L J, 117 Oliveira A B N, 53 Oliveira A L D, 96 206

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Oliveira D R, 88 Oliveira E L M D, 37 Oliveira F S, 101 Oliveira F S D, 136 Oliveira G, 129 Oliveira G C, 4, 13, 21, 101, 107, 108, 136, 184, 201 Oliveira I L, 153 Oliveira I L D, 159 Oliveira L F D, 19 Oliveira L L D, 39, 190 Oliveira L S, 125 Oliveira M A A, 101 Oliveira M M, 93 Oliveira P S L D, 38 Oliveira R P, 155 Oliveira S H P D, 38 Oliveira T Y K, 181, 199 Olmo R P, 73 Omote D D Q, 35 Onuchic V F, 131 Ortega J M, 28, 30, 58, 87, 128, 142, 188, 195 Otto T D, 6 Pacheco L G D C, 31, 32 Pais F S, 101, 107 Paiva L R, 165 Paiva P, 27 Paiva S R, 16 Palu C C, 80 Parizzi L P, 68, 119 Parmigiani R B, 176 Paschoal A R, 50, 88, 125, 189 Passos L, 70 Paula D S D, 141 Paula I T B R, 155 Paula M G D, 199 Pedigoni R G, 199 Pedrosa F D O, 19, 109 Peixoto M G C D, 13 Pereira C A D B, 72 Pereira F C, 168 Pereira G A G, 22, 68, 120, 158, 194, 197 Pereira G G A, 85, 110, 119 Pereira L F P, 68 Pereira R, 70 Pereira R V, 73 Pereira U P, 4 Pereira V S, 155 Peres S, 132 Perez-Acle T, 62 Pessotti H C, 143

Pierce R, 47 Pigozzo A B, 172 Pinheiro D G, 118, 181, 199 Pinheiro M L P, 122 Pinheiro S D S, 43, 51 Pinto A C, 31, 32, 133, 200 Pinto M C X, 122 Piovezani A R, 186 Pires D E V, 37, 59 Pollettini J T, 137 Pontes A, 187 Portillo H A D, 65 Porto L M, 76, 168, 198 Postma M, 148 Pot D, 68 Prosdocimi F, 67 Purcino A A C, 185 Quadros L C D, 28 Queiroz K, 70 Queiroz K B D, 73 Rgo T G, 193 Raittz R T, 109 Ramalho R F, 8 Ramos F C, 60 Ramos R T J, 133 Rangel L T L D, 29 Razante H L, 14 Rech E, 67 Regitano L C A, 154 Rego T G D, 116 Reis A B, 93 Reis M A D, 41 Reis M S, 162 Reis O, 194 Renata G S, 73 Resende A C M D, 159 Resende D D M, 93 Resende J S D, 179 Rezende A M, 74, 121 Rialle S, 132 Ribeiro C, 188 Ribeiro H A L, 188, 195 Ribeiro R D S, 74 Ribeiro R S, 121 Rigo M M, 46 Rocha A S L, 114 Rocha C F, 108 Rocha G K, 45 Rocha J A P D, 60 Rocha M S, 167 207

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Rocha P A F, 169 Rocha P K L, 116, 193 Rodrigues F, 120 Rodrigues M, 17, 111 Rodrigues M R, 112 Rodrigues R F, 77 Rodrigues T D S, 174 Rodrigues T E, 150 Rodrigues-Luiz G, 115 Rodrigues-Luiz G F, 174, 178 Rosembach T V, 167 Rosse I C, 13 Ruiz D D, 90 Ruiz J C, 93 Rybarczyk-Filho J L, 25, 166 Saji G D R Q, 191 Saji G R Q, 83 Salgado L R, 182 Salvanha D M, 79 Samora A D, 88 Sampaio M I C, 57 Santoro M M, 37, 53, 59, 97 Santos A M, 42 Santos A M D, 36 Santos A R, 133, 201 Santos A R D, 31, 32 Santos C E F, 179 Santos E C D, 113 Santos E H D, 154 Santos F M D, 105 Santos K B, 34 Santos M A, 97, 127, 201 Santos M A D, 59, 113, 122, 124, 187 Santos M A O, 126 Santos M L D, 41 Santos R W D, 169, 170, 172 Saraiva A M, 41 Schhli G S E, 9 Scherer N D M, 12 Schroeder E K, 90 Scott L P B, 14, 102 Secundino A A, 171 Shimo H K, 190 Silva A, 31, 32, 49, 201 Silva A C E, 89 Silva A L, 133 Silva A P C D, 159 Silva C L, 77 Silva D, 111 Silva D C D A E, 196 Silva F G D, 99

Silva F L B D, 39 Silva F R D, 67, 185 Silva G G Z, 126 Silva H S, 167 Silva I T D, 181 Silva J L, 36, 40, 42, 43, 51, 57 Silva J W A, 199 Silva L M, 202 Silva M C F D, 23 Silva M M R D, 151 Silva M V B D, 123 Silva M V G B, 13 Silva M V G B D, 16, 18, 103 Silva N D F, 40, 43, 51, 60 Silva P C A D, 167 Silva R R, 7 Silva-Filho M C, 114 Silva-Jr W A D, 181 Silveira C H, 37 Silveira C H D, 59 Silveira N J F, 35 Simes A L, 79 Simes C C, 15 Simes M C, 107 Simes Z L P, 50 Simao E M, 152 Simoes S N, 163 Sinigaglia M, 46 Soares E G, 143 Soares S C, 133, 200 Soares S D C, 31, 32 Soares-Souza G B, 17 Sobreira T J P, 38, 96 Sousa P R M D, 36, 43, 51 Sousa R F D, 87 Sousa R G M A D, 181, 199 Souza A P D, 41 Souza E M D, 19 Souza J E D, 117 Souza J E S D, 8 Souza L H T, 25 Souza R A D, 20 Souza R C, 83 Souza R O, 104 Souza S J D, 8, 84, 117, 176 Stelle D, 102 Stephan B, 26 Stussi F, 142 Tagliatti R F, 16, 18, 103, 123 Talim L E C, 108 Tamulonis C, 148

208

November 15 to 18, 2010 Centro de Convenes e Artes da UFOP, Ouro Preto, Minas Gerais, Brazil

6th INTERNATIONAL CONFERENCE OF THE BRAZILIAN ASSOCIATION FOR BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Tarazona-Santos E, 17, 23, 111, 112 Teixeira P J, 22 Thierry A, 132 Tibes J H, 19 Tiburcio R A, 22 Tieppo E, 19, 109 Tins R, 39, 190 Tinoco C F D S, 15 Tojal I, 199 Tokuda E K, 68 Toledo M A F D, 96 Vncio R, 71 Vncio R Z N, 7, 65, 66, 79, 196 Valadares H M S, 10 Vasconcelos A T R, 55, 80, 83 Vasconcelos A T R D, 135 Velloso H, 30 Veloso C J M, 151 Veloso W N P, 53 Vencio R Z N, 182 Verneque R D S, 18 Verneque R S, 13 Vespero E C, 5 Vidal R O, 68, 110, 119, 197 Vieira G F, 46 Vieira L G E, 68 Virgili N S, 117 Volpini A C, 101 Waldemarin R C, 196 Wanner E F, 175 Weber G, 89, 175 Weber R, 171 Wright M H, 145 Xavier L P, 53 Yamagishi M E B, 110, 154 Yeager M, 17 Zrate L E, 202 Zaha A, 55 Zamudio R, 23 Zanotto P M D A, 96 Zerlotini A, 4, 13, 21, 107, 129 Zuccherato L W, 23

209

You might also like