Biology (2)

Thursday, 15 February 2018 01:21

Computational model for detecting DNA gene types

Written by

1. Introduction and Background


DNA analysis is now a data intensive discipline. New technology has transformed biomedical research by making a plethora of molecular data available at reduced costs and great speeds. Large consortiums and many individual laboratories have already generated vast datasets: as an example, one such database, the GEO contains more than 1.8 million samples. This data is readily, publicly available but analyzing it requires computational and statistical resources.

This study will use statistical models to identify genes that cause a skin cell to differ from a muscle cell. The factors that cause one cell type to be different from another have been shown to be epigenetic: they influence gene activity and not the DNA itself.  

Gene analysis and epigenetics in particular are gradually more reliant on numerical analysis:

  • Scientists are now able to identify epigenetic mechanisms that affect the behavior of a gene
  • We can now map these mechanisms and visualize the patterns they produce
  • These patterns have been shown to differ from one gene to another (Figure 1 attached)
  • These patterns are numerical and can be analyzed with regular statistical and computational tools
  • By analyzing patterns, we will be able to differentiate between different types of cells.

2. Project Objectives

We will continue previous work and develop methods to find  a comprehensive set of epigenetic features that uniquely identify a cell type.

The key objectives of this project include:

  • Collect data and manipulate it in a desirable format; epigenetic data is abundantly available online but more steps will be needed to clean, filter and format it.


  • develop a regression model to distinguish between cell identity genes (Figure 2 attached). Regression is a well-established and verified method that is suitable for our task. Each gene has a binary nature of being cell identity gene or not. The model uses a combination of histone modification patterns and RNA expression level as predictors. It will provide ranking which will indicate how likely a gene can be classified as cell identity gene. Regression models have the advantage that they provide a picture of the underlying process by producing not only classification, but also measures of fit, parameter estimation, significance values as well. Therefore, the regression model gives us a deep understanding of the overall relationships between the predictors.



  • We have intriguing preliminary data that the transcription factor MECOM manifests epigenetic signatures and expression patterns that are distinct for cell identity genes of endothelial lineage. Our work will be the first to systemically study MECOM function in skin cells.


3. Significance of Project Results

Our proposed work is significant because it addresses the most challenging and fundamental aspects of cell identity research.

Many important questions related to our body’s inner mechanisms can be answered using the answers we find. Genomic research has made major advances the past twenty years and medicine will increasingly rely on genomics; this project is part of a major wave of innovation that will affect us and future generations.


My research employs a synthetic biology technique that can site-specifically encode unnatural amino acids into proteins (see Young, TS and Schultz, PG. 2010. J. Biol. Chem. 285 (15). pp 11039-44). While my specific research targets histone proteins (using crosslinkers), the system can be applied to any protein of interest (using whichever of the 100+ different unnatural amino acids make sense). I would like to collaborate on a research project that would benefit from this technology, regardless of the protein of interest. Protein targets that are encoded with an unnatural amino acid can be studied in vivo or be expressed recombinantly and isolated for solution studies (ideal for enzymatic, structural an­­­d binding assays).

Additionally, I would be interested in developing new unnatural amino acids and expanding the synthetic biology molecular tool box. If there is a novel chemistry of interest that can be synthesized as an amino acid I would like to pursue the directed evolution of the enzymes in this system to specifically recognize and install these new amino acids.



    I am interested in developing a highly interdisciplinary course(s) focused in Synthetic Biology. The aim is to dissolve some of the barriers that tend to isolate the classic scientific majors within narrowly defined parameters. My intentions are to blend topics in chemistry, biology, math and engineering to create a set of comprehensive courses that meld the disciplines into a single identity. This initiative is mutually beneficial for both Manhattan College students as well as any institution that would be interested in a collaborative online course. However, this effort would work best as a summer semester on location at either institute. The courses being developed are meant to familiarize students with interdisciplinary topics that are not typically introduced until graduate level study.

   The drive for such an initiative stems from the fact that at most schools (particularly in the U.S.) the biology and chemistry departments are guilty of manoeuvring students through a traditional set of courses that are compartmentalized as either chemistry or biology related. The two “parent” subjects only appear to overlap significantly within biochemistry courses, and even then, the topics remain within a very classical definition of the field. Moreover, the typical science education curriculum includes mathematical courses that often lose their relative connection to the students’ focused discipline. In an era where such an extraordinary emphasis is being placed on science, technology, engineering, and mathematics programs (STEM) it is puzzlingly why there are so few courses that truly tie the four disciplines together. In fact, the National Research Council (NRC) released a report calling for a renovation of the biological teaching curriculum stating a need for a greater integration of physical sciences, mathematics, and interdisciplinary laboratory experiences.

            Synthetic biology is a branch of science that unites the components of the STEM acronym into a cohesive unit. This field uses engineering and mathematical modeling to design ways in which to genetically manipulate biological systems in order to alter, or create de novo, unique physiological pathways. Cellular biochemistry can be redesigned and “tuned” for highly specific outputs, essentially treating cellular stimuli, genes and promoters as circuit timers, gates and switches. Remarkably, nearly 20 years after the first synthetic biology research was published, it still remains missing from teaching curriculums at most undergraduate institutes.

     The scope of synthetic biology at Manhattan College would be developed to include new courses that emphasize the history, evolution and ingenuity of the field. Initial courses will be designed to help students discover the complex, yet intriguing possibilities of biological pathway restructuring. A laboratory portion of the course would allow for students to explore these systems first hand with exposure to bacterial photography systems, cells that grow to smell like bananas, and the ability to produce color changes in cellular systems (Biobuilder; These exercises provide multilayered STEM applications and add an immediate “wow” factor to the students learning experience. The broader impact of these courses will illustrate to students that teams of scientists with diverse skill sets are often required to achieve large project goals.

   Innovations in biotechnology are far outpacing the textbooks so it is difficult to merge exciting new discoveries and technologies into lectures alongside classical biochemistry topics. This course would be aimed at helping span that gap as a supplement to classical biology and biochemistry classes.