GENOMIC DATA ANALYSIS
The innovation of next-generation sequencing (NGS) technologies has enabled exponential growth of the production of high throughput omics data which is widely analyzed for identification of genomic variants including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels) in a spectrum of genetic-related disorders, and provide new insights into how genetic polymorphisms affect disease phenotypes. To facilitate this research, the T-Bioinfo platform hosts the variant calling pipeline that has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. Variant calling refers to the task of identifying possible variations in genome or transcriptome sequences with respect to a chosen reference sequence. In germline variant calling, the reference sequence is the standard for the species of interest. For somatic variant calling, the reference is the genome of a chosen control somatic cell sample. “The variant calling pipeline identifies single nucleotide variants present within the whole genome and exome data. The variants are identified by comparing the datasets of an individual with a reference sequence”.
Genomic Data Analysis pipeline follows a series of algorithms to report the variants observed in the samples. These algorithms include: Bowtie2 which takes reads in .fq/.fa files, aligns these onto the reference genome and gives mapping results in SAM format. Each row of the SAM file contains input read name, reference read name, position on the reference read & number of mapped/skipped/inserted/deleted positions, Visualization where Users can view genome annotations against a reference “ruler,” with an overhead bar giving a visual indication of chromosome position in JBrowse Visualization and Variant Calling Algorithms: Freebayes which uses short-read alignments (BAM files) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference.
It reports positions which it finds putatively polymorphic in variant call file (VCF) format and Mutect2, developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes. muTect attempts to call mutations; it also generates a coverage file (in a wiggle file format, which indicates for every base whether it is sufficiently covered in the tumor and normal to be sensitive enough to call mutations). We currently use cutoffs of at least 14 reads in the tumor and at least 8 in the normal (these cutoffs are applied after removing noisy reads in the preprocessing step).
The Genomic Data Analysis on the T-Bioinfo Server integrates JBrowse for visualizing the variants observed in the patient samples, JBrowse 2 is a pluggable open-source platform for visualizing and integrating biological data. At its core, it is a genome browser, but it has also been built as an extensible platform to enable visualization of all kinds of biological data.
The results obtained by running the pipeline, also includes the “Mapping Statistics” table highlighting the “Overall alignment rate” of the reads on the reference genome.
In the JBrowse, we can visualize three different tracks,
- Reference sequence (GRCh38NoPatch)- Includes the GRCh38 reference sequence on which the reads are aligned for variant calling
- GRCh38NoPatch.gtf.sorted.gff- Represents genes and transcripts in GFF format
- Mutect.vcf –A VCF file contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
Get Started with Your Project Today
Cloud HPC Resources
Experiment & Analysis Planning
Pipeline Modification for Best Results
Custom Analysis and Troubleshooting