Bioinformatics 101
Welcome to the world of bioinformatics!
Having guided quite a number of student mentees and sharing this introduction I wrote via email to dozens of students throughout my postgraduate study, I guess it’s good to share this piece somewhere for a broader interest. As this is originally meant as a catalog of resources to supplement verbal introduction, it’s mainly in point form and I will keep it this way (lazily haha).
So, here we go! No worries about not having learnt anything about programming and bioinformatics. We all need a starting point for anything.
Basic skills and knowledge
To work on bioinformatics, here are the basic knowledge required
1. Sequencing technologies:
- What is it? Why and when to do sequencing?
- What current technologies are there?
2. Sequencing analysis workflow
- What are the common file formats?
- Standard procedures: e.g., quality control, read trimming, alignment
- What are the common tools?
It is sufficient to first understand the big picture and be familiar with the most common tools. Fine details will be specific for each project by experience.
3. Basic Linux commands and programming skills
Not all bioinformaticians will develop new software, but it is necessary to become familiar with writing up simple scripts to automate tasks and handle data, e.g., table manipulation and string extraction.
4. Statistics
To get better at bioinformatics or any data analysis, it’s good to know about at least the principles of common statistical tests, clustering and dimension reduction methods.
Advanced knowledge
5. More advanced computer skills
System administration, data structures, algorithms, software engineering, etc.
6. Bioinformatics algorithms and data structures
To better understand and evaluate existing tools and research — important when designing own programs and especially when developing software.
7. Current advancement
As in any field — read current literature, but of course we will only understand after grasping the basic concepts
First IT knowledge to acquire
There are much to talk about for each of the above areas. We will first start with one we can start anywhere anytime.
- Linux– a lineage of open-source operating systems (OS), just as Windows & Mac are OS, but common for servers for the good performance .
- BASH shell — a “typing” way to get around without mouse on servers without graphics- to handle repetitive things that will take you tonnes of time- Google and learn these basic commands:
ls
,cd
,less
,head
,tail
,cp
,mv
,cat
,wc
,wget
,gzip
,tar
,vi
,rm
….
Beware whenrm
(removing) andmv
(moving, which may replace existing files xd)sed
,awk
are useful commands for data manipulation - Python — A more sophisticated language the most learnt for years to handle more complicated operations, while the language is intuitive for programming beginners. Many libraries (ready-to-use functions written by others) are available.
- R — While more packages written in Python are published, many bioinformatics packages are written in R on the bioinformatics software market
Bioconductor
, a common example is theDESeq
. Need to have basic knowledge in R to run them. Being familiar with the data analytics package familytidyverse
will ease life. - Concepts — these are most important as they apply regardless of language syntax
- Variables: Boolean (True or False), integer, float(ing point number), string, list/array
- Logic: IF-THEN-ELSE, FOR-loop, WHILE-loop-, functions, etc. - Data manipulation & visualization
- Table manipulation: extract columns/rows by condition, transformation, etc.
- Graph plotting: basic chart types like line, scatter and box plots to volcano plots and heatmaps
Resources
Recommended Reviews
Sequencing and bioinformatics
- Sequencing technologies — the next generation (Metzker, 2009)
- The sequence of sequencers: The history of sequencing DNA (Heather, et al., 2015)
- DNA sequencing at 40: past, present and future (Shendure, et al., 2017)
- The Third Revolution in Sequencing Technology (Dijk, et al., 2018)
- Bioinformatics applied to biotechnology: A review towards bioenergy research (Carvalho, et al., 2019)
- Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application (Lightbody, et al., 2019)
Genomics
- New advances in sequence assembly (Phillippy, 2017)
- Current challenges and solutions of de novo assembly (Liao, et al., 2019)
- Long-read sequencing in deciphering human genetics to a greater depth (Midha, et al., 2019)
- Long walk to genomics: History and current approaches to genome sequencing and assembly (Giani, et al., 2020)
- The road ahead in genetics and genomics (McGuire, et al., 2020)
- Advances in optical mapping for genomic research (Yuan, et al., 2020)
RNA-seq
- RNA-Seq: a revolutionary tool for transcriptomics (2009)
- Coming of age: ten years of next-generation sequencing technologies (Goodwin, et al., 2016)
- RNA sequencing: the teenage years (Stark, et al., 2019)
Single-cell transcriptomics
- An Introduction to the Analysis of Single-Cell RNA-Sequencing Data (Aljanahi, et al., 2018)
- Single-Cell RNA-Seq Technologies and Related Computational Data Analysis (Chen, et al., 2019)
- Integrative single-cell analysis (Stuart, et al., 2019)
- Current best practices in single‐cell RNA‐seq analysis: a tutorial (Luecken, et al., 2019)
- The single-cell sequencing: new developments and medical applications (Andersson, et al., 2020)
- Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography (Andersson, et al., 2020)
- Eleven grand challenges in single-cell data science (Lähnemann, et al., 2020)
- Power in Numbers: Single-Cell RNA-Seq Strategies to Dissect Complex Tissues (Birnbaum, 2018)
Spatial transcriptomics
Integrated omics
- A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries (Raja, et al., 2017)
- Multi-omics approaches to disease
- Integrated omics: tools, advances and future approaches (Misra, et al., 2018)
- Integrative omics for health and disease (Karczewski, et al., 2018)
- Relevance of Multi-Omics Studies in Cardiovascular Diseases (Leon-Mimila, et al., 2019)
- The Progress of Multi-Omics Technologies: Determining Function in Lactic Acid Bacteria Using a Systems Level Approach (O’Donnell, et al., 2020)
- Integrating imaging and omics data: A review (Antonelli, et al., 2019)
- Single-Cell (Multi)omics Technologies (Chappell, et al., 2017)
Deep learning
- Recent Advances of Deep Learning in Bioinformatics and Computational Biology (Tang, et al., 2019)
- A review of deep learning applications for genomic selection (Montesinos-López, et al., 2021)
Recommended Books
Quite many useful IT books and resources from O’Reilly and Packt accessible via O’Reilly online learning subscription. If you are a university student, very likely you have free education access through the library. Do check it out!
Recommended online courses
- Any beginner Python course, e.g., on Youtube, MIT OpenCourseWare, Khan Academy (Free) Udemy (Partly Free)
- Stepik (Free) — interactive learning platform
https://stepik.org - - Bioinformatics Algorithms
https://stepik.org/course/Bioinformatics-Algorithms-2 - - Introduction to Python
https://stepik.org/course/238/syllabus - - Adaptive Python
https://stepik.org/course/568/info - - Rosalind (Free) — A fun place to challenge your programming skills and bioinformatic concepts
http://rosalind.info/ - Coursera (mostly paid now) — Among the first MOOC platform. Many useful courses. Provides certificates and microdegrees
https://cousera.org - - Data analysis track by Johns Hopkins Hospital
- - Bioinformatics courses by University of Toronto
- DataCamp (mostly paid)
https://datacamp.com
Systematic interactive online learning platform focusing on data analytic skills, e.g., Python, R, SQL
Recommended Sites
- BASH
http://www.tldp.org/LDP/Bash-Beginners-Guide/html
Systematic beginner guide to UNIX BASH - TutorialPoints
https://www.tutorialspoint.com
Place to learn various programming languages and to test them out - StackOverflow
https://stackoverflow.com
Q&A forum for coding guide, debugging, etc. Paste your error message or questions to google. You will find answers here anyway lol. Also check out other forums on StackExchange for other fields of knowledge. - Biostar Forum
https://www.biostars.org
Q&A forum for bioinformatics questions - SEQanswers
http://seqanswers.com
Q&A forum for bioinformatics questions - GitHub
https://github.com
Where you can find (and hopefully one day share) useful source code - Towards Data Science
https://towardsdatascience.com
Great place to pick up and update on data science skills and tips. You may also find various related topics such as productivity and career advice
Hope this can serve as a starter for anyone who may be interested. It’s always a nice challenge to pick up some knowledge and skills at your own pace in your free time.
And of course, we learn the best by having use cases. While there is a myriad of data out there and problem, you may want some guidance and exposure to actual practice in a research lab. If you are much interested, just look for a lab and talk to the professor for a chance of internship, just like my mentees did. Some time way before summer is usually a time you get better chance of being admitted.
Enjoy the journey!