Exploring Genetic Diversity and Bioinformatic Strategies for Complex Data in the Genomic Revolution

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of Alabama Libraries

Over the past twenty-five years, we have gone from completing the first eukaryotic genome assembly to the new goal of sequencing and completing genome assemblies representing all known taxa on Earth. Genomic data has improved research across fields of biology from evolution and genetics to medicine and conservation. As genomic technologies rapidly improve and the amount of genomic data explodes, the need for complex genomic analysis tools has given rise to the field of Bioinformatics. Strategies are needed to address sequence read quality, sequence alignment, genomic variation estimation from sequence data, de-novo genome assembly, post assembly quality assessments and quantification, and genomic comparisons across species. Here we explore complexities in genomic analyses and how using large-scale genomic data can help us better understand the genomic variation that gives rise to the diversity of life. We show that using pooled-sequencing data we can assess recovery from inbreeding depression in Caenorhabditis remanei and the genomic changes that take place during inbreeding. We find that C. remanei is unable to recover from inbreeding even after 300 generations of recovery at large population size. Despite 23 generations of inbreeding, large portions of the genome remained heterozygous, suggesting that pseudo-overdominance may be preventing the ability to purge deleterious mutations. Our results indicate that recovery from inbreeding in the presence of high genetic load is unlikely. One improvement in genomics is the ability to use long-read sequencing to address complex genomic architecture. Using long-read, third-generation sequencing (TGS) generated with Oxford Nanopore Technologies (ONT) we assembled and identified novel insertions in Caenorhabditis elegans mutants, used for studies of human disease phenotypes. While genome assembly has rapidly improved, identifying contamination post-assembly has remained challenging. We show that using supervised machine-learning ensemble decision tree methods we can quickly and accurately identify contamination in genome assemblies. As genomic resources continue to improve and databases grow, developing new tools and analysis methods that can harness this large-scale genomic data will reveal a greater understanding of organisms across the tree of life.

Electronic Thesis or Dissertation