Computational model for detecting DNA gene types
1. Introduction and Background
DNA analysis is now a data intensive discipline. New technology has transformed biomedical research by making a plethora of molecular data available at reduced costs and great speeds. Large consortiums and many individual laboratories have already generated vast datasets: as an example, one such database, the GEO contains more than 1.8 million samples. This data is readily, publicly available but analyzing it requires computational and statistical resources.
This study will use statistical models to identify genes that cause a skin cell to differ from a muscle cell. The factors that cause one cell type to be different from another have been shown to be epigenetic: they influence gene activity and not the DNA itself.
Gene analysis and epigenetics in particular are gradually more reliant on numerical analysis:
- Scientists are now able to identify epigenetic mechanisms that affect the behavior of a gene
- We can now map these mechanisms and visualize the patterns they produce
- These patterns have been shown to differ from one gene to another (Figure 1 attached)
- These patterns are numerical and can be analyzed with regular statistical and computational tools
- By analyzing patterns, we will be able to differentiate between different types of cells.
2. Project Objectives
We will continue previous work and develop methods to find a comprehensive set of epigenetic features that uniquely identify a cell type.
The key objectives of this project include:
- Collect data and manipulate it in a desirable format; epigenetic data is abundantly available online but more steps will be needed to clean, filter and format it.
- develop a regression model to distinguish between cell identity genes (Figure 2 attached). Regression is a well-established and verified method that is suitable for our task. Each gene has a binary nature of being cell identity gene or not. The model uses a combination of histone modification patterns and RNA expression level as predictors. It will provide ranking which will indicate how likely a gene can be classified as cell identity gene. Regression models have the advantage that they provide a picture of the underlying process by producing not only classification, but also measures of fit, parameter estimation, significance values as well. Therefore, the regression model gives us a deep understanding of the overall relationships between the predictors.
- We have intriguing preliminary data that the transcription factor MECOM manifests epigenetic signatures and expression patterns that are distinct for cell identity genes of endothelial lineage. Our work will be the first to systemically study MECOM function in skin cells.
3. Significance of Project Results
Our proposed work is significant because it addresses the most challenging and fundamental aspects of cell identity research.
Many important questions related to our body’s inner mechanisms can be answered using the answers we find. Genomic research has made major advances the past twenty years and medicine will increasingly rely on genomics; this project is part of a major wave of innovation that will affect us and future generations.