Machine Learning Analysis of Polymerase-II Pause locations

This project revolves around predicting pause locations in the pure form of Saccharomyces cerevisiae yeast (NET-Seq) by harnessing the power of advanced sequencing techniques like Chip-Seq and Chip-Exo. It encompasses critical phases, starting with meticulous data cleansing of vast genomic files, followed by in-depth correlation analysis within the Chip-Seq data. Machine learning takes the forefront, employing Random Forests for binary classification and feature reduction methods like BORUTA, RFE, and RFECV. The project extends to algorithm comparisons, pitting Random Forests against Logistic Regression and Gradient Boosting Machines. Additionally, it delves into histone modification analysis, identifying the top five modifications using Chip-Seq data and exploring how the inclusion of Chip-Exo data impacts predictive performance, providing valuable insights into genomic regulatory mechanisms.

Polymerase-II details.

Pause Site: The below visual shows the NET-seq of saccharomyces cerevisiae. The intensities are marked as the places where potential pause locations occur.

Correlation: The below visual shows the correlation matrix of twenty-six Chip-Seq files used to predict the pause location in NET-Seq files using the Random Forest Algorithm. Chip-Seq is Histone modifications, which are chemical alterations that occur on histone proteins, which are crucial components of chromatin, the material that makes up chromosomes within the cell nucleus.

Pause Condition: The equation specifies whether a location is paused or not.

Results: Random Forest Binary classification results are given below.

Confusion Matrix: Confusion matrix of Chromosome-I of a single strand. I took only one strand as the data was millions to get an overview.

Comparison: Spyder Graph to compare the Random Forest with other Binary Classification Algorithms like Logistic Regression and Gradient Boosting Machine (GBM).

Biological Importance: The best five histone modifications are found in Random Forest Binary Classification.

RFE: Comparison of Random Forest with Feature Reduction Algorithms like RFE( Recursive Feature Elimination).

BORUTA: Comparison of Random Forest with other feature reduction Algorithm - BORUTA

Addition of Chip-Exo : Performance improvement when Chip-Exo ( DNA binding proteins) is added with Chip-Seq ( Histone modifications) file in finding the pause location in NET-Seq of saccharomyces cerevisiae.

Log-Loss: Comparison of Log-loss value when DNA binding proteins are added to Histone Modifications.

Supplementary information: Please refer to the supplementary information you may require to understand the above projects. Please refer to the same for a better understanding of this project.

Definition of Biological Terminologies.

DNA and RNA: The below image shows us the place of DNA and RNA in the hierarchy of living Organisms

Transcription: The image below depicts how the transcription process happens using polymerase II (RNAPII) by attaching it to the transcription start site of DNA and copying it to mRNA.

Distribution: The below image shows how the Pause scores are distributed in NET-Seq or the pure form of DNA sequence captured of saccharomyces cerevisiae.

Contact

I'm always looking for new and exciting opportunities. Let's connect.

anishsac@gmail.com

+44 7436628501