CofC Logo

2015 Projects


Research Abstracts

Lead REU investigators are shown in bold.

Digital transcriptomics of temperature-and-brain-specific differential gene expression in the Loggerhead Sea Turtle, Caretta caretta

Katherine King, Connie Truong, Francisca Donkor

The era of genomics has brought forth a watershed of valuable new DNA sequence information, yet our understanding of organismal diversity and the genotype-phenotype relationship has in fact not been adequately explained by the relatively small global number of genes shared across major branches of the tree of life. Turtles present a particularly interesting challenge for genome biology and our hypothesis-driven goals for transcriptomics. Despite their popularity, their common ancestry has remained contentious based on conflicts among morphological, molecular and fossil evidence. Their gross anatomical development is unique in vertebrate evolution and their range of observed genome size and chromosomal architecture varies widely among terrestrial, freshwater and marine forms. Many turtle species have now become threatened by local habitat loss and excessive predation/mortality due to human activities and are legally protected. Despite this we know virtually nothing about the genetic basis for much of their unique adaptations, especially those species which have developed fully pelagic marine life histories and astonishing natal beach homing capabilities, exemplified by the loggerhead turtle, Caretta caretta.

This project focused primarily on the following goals:

  • Trimming and Quality Control Assessment of Reads
  • Transcriptome Assembly of C. caretta Brain cDNA Reads
  • Transcriptome Assembly Annotation
  • Differential expression analysis
  • Cluster analysis of expression data
  • Application of WGCNA to RNA-seq co-expression analysis of turtle gonad and brain data
  • Pathway analysis of co-expression modules

At the end of the summer the group presented their results at a Synthesis Seminar on the expression analysis, assembly, and cluster analysis. The results will be written up in a manuscript and submitted for publication.

ccKOPLS: Confounder-Correcting Kernel-based Orthogonal Projections to Latent Structures

David Moore, Kellan Fluette, and Heather Milne

Building accurate predictive models for biological data sets from next-generation high-throughput data sources is essential to bioinformatics. However, confounding variables such as sex, age, and habitat can skew the results of such models, leading to biased and inaccurate results. Due to a lack of for machine learning algorithms suited for high-dimensional data sets with small sample sizes, David extended an existing confounder-correcting (cc) algorithm (ccSVM) to allow Kernel Orthogonal Projections to Latent Structures (KOPLS) to explicitly account for confounding factors.

This project focused on the following goals:

  • Develop in silico pipeline and datasets for comparing confounding correcting algorithms
  • Compare confounder-correcting KOPLS to confounder-correcting SVM
  • Develop manuscript and R package based on results

At the end of the summer David presented his results at a Synthesis Seminar on the construction and application of ccKOPLS. The results were written up in a manuscript, and was accepted by the 2015 IEEE International Conference on Bioinformatics and Biomedicine.

GitHub: https://github.com/Anderson-Lab/CCPredict

Effects of HERVK and HIV1 Sequence Variation on Viral Control using GWAS and Bayesian Network Analysis

Alex Brown and Kellan Fluette

Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.

This project focused on the following:

  • Literary review to identify relevant papers that provide a coherent use of the 1,000 genome project.
  • Establishing a testable hypothesis that included a rationale and justification for specific target data needed to build and analyze from the global database
  • Generation of relevant preliminary data and results
  • Proposal of a follow-on study that builds on prior work using preliminary data and results

At the end of the summer Alex presented his preliminary data and results at a Synthesis Seminar. These results will be written up in a manuscript and submitted for publication.

Complementary Feature Selection from Gene Expression and Alternative Splicing Events for Phenotype Prediction

Charlie Labuzzetta, Margaret Antonio, and Chris Snyder

The analysis of differential gene expression data has been the primary method for expression based biomarker discovery. Recent research has shown that alternative splicing events (ASE) are critical for regulating biological phenotypes and that they are a complimentary source of data for predictive modeling. RNA-Seq has become a key technology in transcriptome studies because it can quantify overall expression levels and the degree of alternative splicing for each gene simultaneously. To interpret high-throughput transcriptome profiling data, functional enrichment analysis is critical. However, existing functional analysis methods can only account for differential expression, leaving differential splicing out altogether.

This project focused on the following goals:

  • Identify datasets (one is the lung cancer data from Dennis Watson at MUSC, but find one more from a public repository) and develop a framework for evaluating multiple AS quantification methods.
  • Evaluate competing methods for alternative splicing (AS) quantification (Tuxedo vs diffsplice, etc vs EBseq)
  • Extend Gene Set Enrichment Analysis (SeqGSEA) using new fractional alternative splicing and additional novel alternative splicing representations. Show how alternative methods for detecting alternative splicing events and modules can affect gene set enrichment.

At the end of the summer Charlie presented his results at a Synthesis Seminar on the novel improvements and evaluation of alternative splicing quantification methods for gene set enrichment. The results will be written up in a manuscript and submitted for publication.

De novo Transcriptome Assembly and Characterization of Polyketide Synthase Domains in the Toxic Dinoflagellate Gambierdiscus polynesiensis

Heather Milne and David Moore

The era of genomics has brought forth a watershed of valuable new DNA sequence information, yet our understanding of organismal diversity and the genotype-phenotype relationship has in fact not been adequately explained by the relatively small global number of genes shared across major branches of the tree of life. Despite much investment in gene annotation and genome assembly, we do yet know where our transcriptome begins and ends, and to what extent non-coding DNA is controlling the translational apparatus in different types of cells and different types of species with complex traits of evolutionary and/or ecological and/or medical interest. Presently the genomic view of the world is still very flat. To make it round, we need to interrogate transcriptome sequence assemblies based on mRNA transcripts of tissue-specific samples of non-model species and relate them to model species for which complete genomic information has been assembled and mapped.

The focus of the project was on the following:

  • Trimming and Quality Control Assessment of Reads
  • Transcriptome Assembly of Gambierdiscus Reads
  • Transcriptome Assembly Annotation
  • Differential expression analysis
  • Cluster analysis of expression data

At the end of the summer Heather presented her results at a Synthesis Seminar on the expression analysis, assembly, and cluster analysis. The results will be written up in a manuscript and submitted for publication.

Gene Prioritization using Reactome Pathway Analysis

Christopher Snyder, Margaret Antonio, Charlie Labuzzetta

Individually, the fields of proteomics, genomics, and transcriptomics continue to receive significant research attention as their utility as novel discovery platforms increases; however, there is a growing interest in algorithms and tools that leverage two or more of these heterogeneous data streams. This is for two main reasons: (i) it is becoming reasonable both from an experimental and cost perspective to run two methods simultaneously (e.g., both proteomics and transcriptomics), and (ii) it is believed that combining data sources will give rise to a deeper understanding of the system being interrogated. Thus integrating the data in a non-trivial manner supports improved systems biology approaches by high throughout biological analysis.

This project focused on the following goals:

  • Extend the GeneListPrioritization methodology on new datasets and new domains, including but not limited to incorporating reactome pathway database and in-package enrichment analysis (via hypergeometric analysis).
  • Use isoform-isoform interactions and/or isoform network modules to prioritize differential alternative splicing and multivariate models.
  • Generate preliminary results and synthesis on isoform-isoform driven prioritization for end of the program seminar, paper, and for future journal publication.

At the end of the summer Christopher presented his results at a Synthesis Seminar on the expression analysis, assembly, and cluster analysis. The results will be written up in a manuscript and submitted for publication.

Fast Food Kernel Expansions for Supervised Learning

Kellan Fluette and Alex Brown

Deep learning models that capture high-level abstractions in data often outperform standard models for classification problems. On large datasets, significant gains in classification accuracy can be achieved by using computationally efficient non-linear transforms, such as using deep neural networks (DNNs) or stacked denoising autoencoders (SDAEs), to model higher-level abstractions in the data before using standard models for classification on the transformed dataset. Le et al. have developed Fastfood, a method for approximating kernel expansions in loglinear time; kernel expansions are performed in neural networks and must be calculated for every pair of training samples–this quickly becomes costly for large datasets, and is partially resolved by using Fastfood kernel expansions. As the existing paper describes using Fastfood optimized neural networks (FONNs) for binary classification problems, we extend the algorithm such that it can be applied to classification problems with more than two classes using a logistic classifier.

This project focused on the following:

  • Extend FFNET to work with different activation functions
  • Develop evaluation framework for deep learning neural networks using both the standard data from the literature and the 1,000 genome data
  • Evaluate performance of FFNet
  • Apply fast food kernels to SVMs

At the end of the summer Kellan presented his preliminary data and results at a Synthesis Seminar. The results will be written up in a manuscript and submitted for publication.

Digital transcriptomics and Pathway Analysis for FLI1 in Human Breast Cancer Cells

Margaret Antonio, Charlie Labuzzetta, and Chris Synder

Lung cancer is the leading cause of cancer-related deaths worldwide, with the subtype non-small cell lung carcinoma (NSCLC) compromising approximately 87% of lung cancer cases in the United States and causing an estimated 500,000 deaths per year worldwide. Many factors contribute to lung cancer etiology, including age, race, environmental and genetic factors. Collectively, genetic and epigenetic alterations contribute to multiple events leading to the development of lung cancer.

Despite advances in diagnosis and clinical treatments, lung cancer continues to be present in advanced stages with high risk of relapse. In addition, it is critical that new methods that distinguish aggressive versus indolent cancers be identified. In lung cancer, this question will likely become even more relevant in the near future as the medical community reacts to the recent multi-institutional prospective randomized study showing a survival benefit for patients that underwent CT screening to detect early lung cancers. As a direct result, we will likely be identifying smaller indolent lung cancers with uncertain clinical significance. Thus, developing a better understanding of the molecular events that drive indolent lung cancers into more aggressive tumors will help guide clinical patient management. Previous studies have shown that alternative splicing (AS) can distinguish lung cancer from matched non-tumor tissue and between lung cancer.

This project focused on the following:

  • Perform differential expression and analysis on FLI1 gain of function and loss of function, including pathway analysis and synthesis of results
  • Compare different alternative splicing pipelines
  • Develop a wrapper and pipeline for both command line execution of the optimal pipeline for biomarker discovery.
  • Create a Galaxy wrapper for pipeline
  • Evaluate and compare feature selection techniques for heterogeneous data streams (expression + alternative splicing + SNP) including multiple objective genetic algorithms and elastic net.

At the end of the summer Margaret presented her results at a Synthesis Seminar on the feature selection techniques for heterogeneous data streams. The results will be written up in a manuscript for communication and publication.

Genomics Wet Lab Experience: ApeKI-indexed Illumina genomic library construction and de novo SNP discovery in pooled Loggerhead (C. carretta) and Kemp's Ridley (L. kempii) Atlantic sea turtle populations

Katherine King, Connie Truong, Francisca Donkor, Heather Milne