Bioinformatics analysis of epigenomic
Motif analysis of genetics
Molecular modeling of protein structures
Statistical learning of genetic network
Biophysical modeling of epigenetic landscape
Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Traditionally, motifs are usually stored in PWMs1 and visualized by sequence LOGO2. However, PWMs is bulky and non-intuitive to visualize, and LOGO is limited by graphical interface support. Here, we propose to represent motifs by wildcard-style consensus sequences. We further show how this conversion can be compact, informative and intuitive. Based on mutual information theory and Jenson- Shannon Divergence3, we propose a mathematical framework to optimize the proposed conversion. Then we implement an efficient algorithm to achieve such conversion. We show how using such representation can be a significant improvement over current alternatives4. In summary, we believe our package will find its niche in textual representation of motif, where visualization support is often lacking.
Students will learn
-- How to use design efficient algorithm implemented in python/R/Bash.
-- How to design and test statistical hypothesis, specifically with information theory.
-- What are the principle bioinformatics tools for DNA sequence manipulation and motif analysis.
-- How to visualize DNA sequence in browser and customized scripting tools.
-- How to design biological problem that can be solved by application of various machine learning algorithms including logistic regression, random forest, SVM, neural network, and deep learning.
By the end of the summer, student should achieve working knowledge in statistics, programming execution and domain knowledge in human genomics. The other aim of the project is to btter prepare students for transition to and application of PhD level bioinformatics or master level data science-related disciplines.
-- Programming: Students need to be able to successfully run "hello world" for python/R/Bash on your laptop.
-- Statistics: Students should be able to articulate what a binomial distribution is, preferably to a 5-year old.
-- Biology: Students need to know the genomic information is encoded in the DNA, and DNA sequences is composed of A, T, C and Gs.