Usman Adeyemi Lamidi: Motif discovery in DNA sequences using an improved Gibbs sampling algorithm
Host genetic polymorphisms associated with malaria resistance in HIV-infected children: a retrospective study from Sub-Saharan Africa
Gerald Mboowa, Ivan Sserwadda, Moses Joloba
Background: On average, studies have shown that malaria kills a child every 30 seconds, succumbing to about 3000 deaths recorded per day. Ultimately, 9 million deaths occur every year with the endemic exerting its effects especially in children less than 5 years with over 90% of the cases living in sub-Saharan Africa. In Uganda, malaria is one of the leading causes of morbidity and mortality in HIV infected children. Recent hereditability studies have indicated that approximately 25% of the severe malaria risk cases are determined by human host genetic factors. Many studies have reported various host genetic polymorphisms that confer resistance to plasmodium infections for example ovalocytosis, G6PD and pyruvate kinase deficiencies. As a result, malaria resistance alleles/variants will affect HIV/AIDS progression.
Significance: We seek to describe and gain deeper understanding of the host genetic polymorphisms conferring resistance to malaria in HIV infected children in sub-Saharan Africa.
Hypothesis: Long-term HIV/AIDS non-progressors have over the years accumulated malaria resistance conferring polymorphisms retarding their malaria-associated mortality and morbidity than in rapid HIV/AIDS progressors.
Objectives: To describe the host genetic exomic polymorphisms that confer malaria resistance in HIV infected children in sub-Saharan Africa
Scientific aims of the proposed project:
1. To compare and describe the distribution of malaria resistance conferring polymorphisms between the long-term HIV/AIDS non-progressor and rapid progressor pediatric individuals in sub-Saharan Africa.
2. To identify factors responsible for enhancing malaria resistance in the polymorphisms using CAfGEN Electronic Medical Records (EMR).
3. To identify novel polymorphisms in the known host malaria resistance conferring loci/genes that may play a role in the rate of HIV/AIDS progression.
Methodology: This project will be nested within the CAfGEN study. CAfGEN is an H3Africa Consortium group that is studying Host Genetic Factors important in pediatric HIV/AIDS and TB progression in sub-Saharan Africa. Using Bioinformatics tools, we will interrogate the CAfGEN exome dataset of both long term non-progressors and rapid progressors for known host malaria resistance conferring polymorphisms to describe their distributions between the two pediatric populations and assess the factors that may play a role in necessitating these polymorphs to thrive within the groups.
Furthermore, we will probe the exome sequences for other candidate host genetic polymorphisms that could have an effect on HIV/AIDS progression.
Insilico Identification of Protein-Coding and Non-Coding Regions in Next-Generation Technology Transcriptome Sequence Data: A Machine Learning Approach
Olaitan Awe, Angela Makolo, Segun Fatumo
Numerous multispecies transcriptome sequences have been identified through the development of high-throughput transcriptome sequencing techniques. Whole transcriptome sequencing, therefore, promises a rapid discovery of novel transcripts and genes. RiboNucleic Acid (RNA) molecules are of different classes and they make contribution to a number of biological processes such as cell cycle regulation, gene expression regulation, dosage compensation, involvement in X chromosome inactivation in placental females, linkage between long non-coding RNA molecules and cancers, imprinting or translational control, and establishment of cell identity during embryonic development. Though the biological mechanism of action of a vast majority of non-coding RNA molecules is unknown, scientists want to find out whether some of them actually encode short functional peptides and function as messenger RNAs. With the rapid increase in the volume of sequence data and multi-species transcripts generated using these next-generation sequencing technologies, designing algorithms to process these data in an efficient manner and gaining biological insight is becoming a significantly growing challenge as there is no known effective method to discriminate between non-coding and protein-coding regions in human transcriptomes because RiboNucleic Acids (RNA) show similar features to each other. The few existing techniques mostly involve intense computation or multi-threading for small/large datasets to achieve small performance difference and risking a high execution time of the tool. We cannot over-emphasize the increasing need for developing efficient algorithms for analyzing and working with these molecules. An approach to solving this problem is to develop machine learning algorithms for the accurate detection and characterization of non-coding RNA patterns in transcriptome sequence datasets. These algorithms are then improved over time, with the discovery of more biological properties through biochemical and molecular experiments. To solve this problem, we developed a fast, accurate and robust alignment-free predictor based on multiple feature groups using Logistic Regression, for the discrimination of protein-coding regions in multispecies transcriptome sequence data, where the predictive performance is influenced by Open Reading Frame(ORF)-Related and ORF-unrelated features used in the model rather than the training datasets, thereby achieving a relatively high performance and computational speed in processing small and large datasets of full-length and partial-length protein-coding and non-coding transcripts derived from transcriptome sequencing. We used our chosen technique because the nature of our problem is a binary classification decision. We also chose predictor variables that can be calculated directly from the sequence data without having to first do sequence alignments. This was done in order to reduce the computational time required to run our algorithm in comparison to using alignment-dependent variables which take a much longer time to compute. We used statistical measures of performance to evaluate our model. We describe a series of experiments on the human RNA-Seq datasets of full-length and partial-length transcripts with a goal of generally performing better than competing techniques. Our tool identified coding and non-coding regions in the human RNA-Seq dataset with 97% accuracy, 97% F1-score, 97% sensitivity and 97% specificity, and thus generalized better than competing techniques in many cases. We expect this new approach to result in an efficient computational cost of analyzing transcripts, thereby contributing to the annotation of genomes and also make it easier to do transcriptome analysis
Comparative Structural Analysis of 3-D Predicted Matrix Protein of Influenza A H1N2/Ibadan/2014 and H5N1/Ogbomoso/2014 in Nigeria
Oladipo Elijah Kolawole, Oloke Julius Kola
Introduction: Variation of genome is very high in influenza A viruses due to antigenic shift and drift. However comparative structural analysis of protein structure can provide functional insight on antigenicity, pathogenicity and virulence.
Methods: Obtained nucleotide sequences from influenza A/H1N2/Ibadan and A/H5N1/Ogbomoso were translated to their corresponding peptide sequences using EMBOSS Transeq. Prediction of Homology model for the 3D structure Matrix gene protein of the influenza virus was constructed using CPH Model, validated using PROCHEK and viewed with PyMOL Molecular Graphics System. Extensive structural comparison was performed on their domains and sub-regions to investigate domain specific variations.
Results: The predicted 3D protein model shows the residues in the most favoured region for the two influenza isolates. The comparative study of the predicted 3D protein structure of A/H1N2/Ibadan/2014 and A/H5N1/Ogbomoso/2014 influenza virus matrix gene shows the same position of chains, segments but contains different elements, atoms and numbers of residues.
Conclusion: Evidence from this study suggests that integrating extensive structural comparison can help in understanding the biological characteristics of these viruses. In particular, the observed variations can provide information on drugs and vaccine development. Also, the predicted 3D protein structures has assisted to extract the important information related to the genes and protein structure.
Whole genome assembly and functional significance of genetic variants of mycobacteria tuberculosis isolated from Kampala, Uganda
Marion Amujal, Daudi Jjingo
Background: Tuberculosis exerts a tremendous burden on global health with 9 million new infections and 2 million deaths annually. Mycobacteria tuberculosis complex (MTC) was initially regarded as a highly homogeneous population; however, recent data suggest the causative agents of tuberculosis are more genetically and functionally diverse than previously appreciated. In addition, this genetic diversity may render some species of MTC intrinsically resistant against one or multiple antibiotics and affects the spectrum and consequences of resistance mutations selected for during treatment. Moreover, neutral or silent changes within genes responsible for drug resistance can cause false-positive results with hybridization-based assays, which have been recently introduced to replace slower phenotypic methods. A study conducted by Hershberg et al. (2008) using sequence data from 89 genes in 108 MTC strains, observed that 58% of the non-synonymous mutations fell in positions that were highly conserved in other mycobacteria, suggesting that most of these mutations in MTC might have functional consequences. In addition, a study conducted by Stucki, et al. (2016) utilizing a global collection of tuberculosis isolates reported that Mycobacterium tuberculosis genetic diversity shows global distribution and geographically restricted sub lineages. Therefore, using Whole Genome Sequencing, we can address a broad range of topics - from questions on the transmission and fitness of clinical strains to how Mtb evolves over long and short time scales. Therefore, this study will utilize a combination of mapping and de novo assembly of the whole genome paired-end reads for finding missing or novel genes and resolving complex repetitive regions and consequently perform the functional significance of the genetic variants of Mycobacteria tuberculosis isolated from Kampala, Uganda.
• To accurately assemble the genome of Mycobacteria tuberculosis
• To determine the functional significance of novel genetic variants of the genome of Mycobacteria tuberculosis
• To determine the association between the predominant variants and the Genotypes
Significance: There is an urgent need for better treatments and vaccines, which in turn require a deeper understanding of the Biology of Mycobacteria tuberculosis. Knowledge of the genomic variability among Mycobacteria tuberculosis isolates could result in such biological insights, given the increasing evidence that strain genetics may play a role in disease outcome, transmission, variation in vaccine efficacy or emergence of drug resistance
Research design and Methods for achieving the stated goals: Sequencing of 102 Mycobacterium tuberculosis isolates will be done at Makerere University using a MiSeq platform. The fastq files generated from the samples will be quality control checked to ascertain the quality of the reads. This will include genome assembly, variant calling, and annotation. Whole genome sequence data will be deposited at the European Nucleotide Archive under a specific accession number.
Angela Uche Makolo, Usman Adeyemi Lamidi
Motifs are repeated patterns of short sequences usually of varying lengths between 6 to 20 bases. Within Deoxyribonucleic Acid (DNA) sequences, these motifs constitute the conserved region of most common signatures for recognizing protein domains that are relevant in it evolution, function and interaction. The Gibbs sampling is a Markov Chain Monte Carlo(MCMC) algorithm which has been applied in the past to discover motifs in DNA sequences. A problem with this technique is the profusion of iterative operations in the sampling process because it progressively chooses new possible motif positions from a continuous randomize sampling in DNA sequences. We applied an Improved Gibbs (IGibbs) sampling algorithm on Breast Cancer human disease DNA sequences to overcome this unwieldy iteration by altering the processes to obtain a reduced runtime and also achieve an accurate satisfactory motif result. The methodology applied in IGibbs algorithm takes an input of .gbk or .fasta DNA file and creates a list of all nucleotides to predict a random sampling starting position. It applies motif length, lesser iterative value and further computes the probability and position ranking scores using Position Weight Matrix (PWM). The algorithm was implemented using Python,Python(x,y) and Biopython. The IGibbs algorithm was evaluated using varying motif lengths of 12, 18 and 24 on different base lengths of 5,000, 10,000 and 15,000. The result showed that the IGibbs returned a better average runtime of 7, 10 and 23 seconds respectively compared to 12, 32 and 60 seconds respectively in the existing Gibbs sampling algorithm found at http://ccmbweb.ccv.brown.edu/gibbs/gibbs.html.