CofC Logo

2016 Projects


Research Abstracts

Lead REU investigators are shown in bold.

Comparative analysis of genomic structure and function among ecologically divergent turtle species

Bailey Lester, Christina Mulch, Julie Chow, and Aliya Dumas

In this research, four ecologically divergent turtles and their sequenced genomes are compared through a whole genome analysis. Apalone spinifera, Pelodiscus sinensis, and Chelonia mydas are each compared to Chrysemys picta in separated whole genome pairwise alignments through the use of a local alignment search tool (Kiełbasa, Wan, Sato, Horton, & Frith, 2011). Selected immunological related genes obtained from NCBI were selected from C. picta and were extracted from the alignments along with the corresponding sequence for each of the 3 other species through the command line and the grep  function.

CDPOnline: Complementary Domain Prioritization Made Simple

Hassam Solano-Morel and Julie Chow

Advancements in sequencing technology and methods have allowed investigators to gather an increasing amount of information from all genomics domains. While this has allowed for easier data collection it has also created a bottleneck when it comes to deriving meaning from this data. Reliable filtering methods can provide a way to narrow the search for significant meaning within this large amount of data. Complementary Domain Prioritization (CDP) is a sophisticated filtering method, implemented as an R package, that leverages protein enrichment information to prioritize transcriptomic data. To simplify the use of the CDP package a user friendly web tool (CDPOnline) was created to give investigators without programming experience access to the package's filtering capabilities.

Functional Clustering and Biochemical Pathway Analysis of RNA-seq gene ontology in Loggerhead Sea Turtle (Caretta caretta)

Aliya Dumas, Bailey Lester, Christina Mulch, and Julie Chow

Analysis of RNA-sequencing data, specifically differential gene expression allows the predictions of molecular response to treatment or tissue type. This analysis can be useful in understanding the genotype-phenotype relationship in non-model organisms like the loggerhead sea turtles (Caretta caretta.) Loggerheads exhibit unique phenotypic characteristics, like temperature-dependent sex determination, and the ability to travel thousands of miles between their feeding grounds and original nesting beach. Understanding the molecular basis for these characteristics can have huge implications in evolutionary patterns of turtles and human medicine. Differential gene expression allows a better understanding of treatment effects and tissue-type affects on a molecular level. Analyzing tissue samples from the brain and gonad of loggerhead hatchlings under two temperature conditions we were able to predict the functions of DEGs identified due to treatment in each tissue. Further analysis of this data, by enrichment analysis, predicted which processes, functions, or components were most or least affected by our treatment. To gain a deeper understanding on the biochemical pathways affected by our treatment requires network analysis to be conducted in the future.   

Hidden Markov Models for Repeatitive Element Detection

Geena Glenn and Sonia Kopel

Repetitive elements are an essential part of the genomes of all organisms, assisting in the shaping of genomes as well as playing a role in gene function [1].
These elements are capable of transposition through either a “copy and paste” or “cut and paste” method which allows them reinsert themselves into the genome at a new location [5]. Often times the element will become used for a different purpose by the host in its new location in a process known as exaptation. A large number of genes in the human genome have been identified as being exapted from mobile elements, and this number is increasing as more research is being done [2]. However, the ability to identify repeats is somewhat hindered by the tools available today. Many of the current tools use homology-based repeat identification that is heavily mammalian-biased. This becomes a large problem when taxonomically under-represented clades are being researched, because very few of their repetitive elements can be identified with homology-based methods. To overcome this obstacle, hidden Markov models (HMMs) can be used for de novo repeat identification. HMMs are probabilistic models that capture position-specific conservation information to improve detection of remote homologs without becoming too computationally expensive [3,4]. A profile HMM can be created from a multiple sequence alignment of a repetitive family found in the genome. By concatenating all profile HMMs created from each of the repetitive families identified, we can create a species-specific database of profile HMMs. Our ability to identify repetitive elements in the genome becomes greatly increased when this profile HMM database is used in comparison with homology-based identification.

Sea turtle population genomic discovery I: Global and locus-specific signatures of selection and adaptive potential

Julie Chow, Bailey Lester, Aliya Dumas, and Christina Mulch

Molecular markers such as Single Nucleotide Polymorphisms (SNPs) inform the study of evolutionary processes, gene-disease association, and population structure, enable the identification of functionally important genomic regions, and reveal signatures of loci-specific natural selection and adaptive potential. Knowledge of adaptive potential contributes to the improvement of biodiversity conservation practices and is particularly valuable for vulnerable populations, such as the loggerhead and Kemp’s ridley sea turtles, which are respectively classified as Endangered and Critically Endangered (WWF 2016). In this study, we develop a SNP calling pipeline for Genotyping by Sequencing (GBS) Illumina paired-end sequence reads of 48 loggerhead and Kemp’s ridley sea turtles. We use the program STACKS (Catchen et al. 2011, Catchen et al. 2013) to form loci, call SNPs, and generate various population genomics statistics. To measure linkage disequilibrium, we use the programs Genepop (Raymond and Rousset 1995, Rousset 2008) and GPLINK (Purcell et al. 2007) with Haploview (Barrett et al. 2004) to calculate and visualize nonrandom association of identified loci. To link signatures of selection to genomic regions, we functionally annotate SNPs using SnpEff (Cingolani et al. 2012) and ANNOVAR (Wang et al. 2010) with the Chrysemys picta bellii annotated genome as a reference (Shaffer et al. 2013). Approximately 7.4 million loci and 1.15 million SNPs were identified. Percentage of loci comparisons in linkage disequilibrium varied from 0.13% to 7.12%. Functional annotation of all SNPs returned intergenic function. Loci associated with linkage disequilibrium may indicate regions of natural selection and adaptive evolution.

SNP based Genome Structure Discovery of Loggerhead Turtles(Carreta, Carreta)

Christina Mulch, Julie Chow, Bailey Lester, and Aliya Dumas

Given data collected from 48 Loggerhead Turtles (Carreta, Carreta) in the south carolina atlantic coast, and florida coast populations, Samples were sequenced via rad-seq analysis produced fastq files for each individual. Given this data, We sought to address uncertainties and clarify current understanding of Loggerheads in the lower south east United states and give predictions for trends in population structure and adaptive evolution. The area is traditionally hosts one of the largest breeding populations of the threatened species.(Baldwin et al, 2003) This makes conservation efforts and management practices for the area increasing crucial to its success as a species. Use of Genome wide Single Nucleotide Polymorphisms(SNPs) and emerging high power computation tools have yet to fully address and explore the genome of Loggerhead Turtles thoroughly. We propose an analysis of population structure using the software tools STACKS, STRUCTURE, and ADMIXTURE for inference of subpopulations and F statistics in an attempt to both infer genomic structure and analysis incongruence between results of each tool.

Towards Integrating Alternative Splicing and Expression: Univariate vs. Machine Learning Multivariate Approaches to RNA-seq Analysis

Amanda Tursi and Brianna Richardson

Lung cancer  is extremely prevalent and potent, accounting for the highest rate of cancer deaths in the world. Despite advances in treatment, the response rate remains low and the revival rate high. An in-depth analysis of differentially expressed genes and isoforms between eradicated lung cancer cells and relapse cells offers insight into important biomarkers that play a role in cancer eradication or fortitude. Through a combination of traditional univariate and machine learning multivariate approaches, differential expression is able to be visualized and cross-examined between methods in order to find significant genes and isoforms. These results reiterate some previous research done on lung cancer, while also offering a deeper insight into the role alternative splicing plays in the disease. Furthermore, the bioinformatics and computer statistical methods employed in this study offers a more universal approach that can be used for a variety of RNA-Seq data analysis.

Big Data Genomics: A Population Predictive Analysis and Online Tool for Cluster Computing

Makenzie R. Whitener, Leonardo De Melo Joao, Luca Carvalho De Oliveira

With the rapid rise of available genomic data, came the rise of computational tools and programs to analyze this data. These computational tools often require a large amount of computing power that many labs don’t have access to. Many also require more than a fundamental understanding of computer science.  NAME, a web application, was created to facilitate the access to a large computing cluster with a easy to use web interface made to be user friendly. NAME is build on the R (R Core Team 2015) package R Shiny(Chang 2016) and sparkR(Apache 2015); the integration of both allow interaction with an Apache Spark(Apache 2016)  cluster. NAME takes a user file, usually in the VCF format, harnesses the functionality of ADAM to run conversions, and provides access to a deep learning class from which a predictive analysis can be run. The user is guided through the creation of a command line which is sent to a cluster and then run on an instance of a spark-context.  Popstrat(Ferguson 2015) was replicated from previous studies(Big Data Genomics 2015). NAME has begun to simplify the task of installing, configuring and maintaining a computing cluster while still allowing almost full functionality of all programs wrapped within the application.

Fastfood Elastic Net

Sonia Kopel and Geena Glen

As the complexity of a prediction problem grows, simple linear approaches tend to fail which has led to the development of algorithms to make complicated, nonlinear problems solvable both quickly and inexpensively. Fastfood, one of such algorithms, has been shown to generate reliable models, but its current state does not offer feature selection that is useful in solving a wide array of complex real-world problems that spans from cancer prediction to financial analysis.

The aim of this research is to extend Fastfood with variable importance by integrating with Elastic net. Elastic net offers feature selection, but is only capable of producing linear models. We show that in combining the two, it is possible to retain the feature selection offered by the Elastic net and the nonlinearity produced by Fastfood. Models constructed with the Fastfood enhanced Elastic net are relatively quick and inexpensive to compute and are also quite powerful in their ability to make accurate predictions.