University of New South Wales
Project Description
Research and/or policy focus
This project concerns genomics, transcriptomics and proteomics analyses and addresses the issue of how they can be integrated. For genomics and transcriptomics, while it is routine to achieve good coverage of the genome or transcriptome with next-generation sequencing, it is difficult to assess the accuracy of these assembled sequences. Proteomics analysis offers complementary results as it is routine to gain accurate data from advanced mass spectrometers, but it is difficult to get comprehensive coverage of the proteins. This project will integrate genomics, transcriptomics and proteomics data to cross-validate the assembled gene sequences and proteins / peptides, and will co-visualize the results in a specialized viewer to confirm the expression and processing of genes.
Project impetus and drivers
To address the challenge of integration and cross-validation, we pioneered a method in which proteomic analysis can be used to validate the quality of the assembled sequences. This involves the identification peptides that match the novel genes to validate their existence. The peptide are also used to validate proteins that are encoded by differentially spliced mRNA. A major innovation of the pipeline is to enable biologists to co-visualize proteomics data with next-generation sequencing data to assist with the validation of alternatively spliced mRNA. This is important since alternatively spliced mRNAs encode a variety of proteins that are important for diverse biological functions, for example, the development of cells into different types of tissues.
Before the Proteomic-Genomic Nexus (PG Nexus) project, there was no efficient means of integrating genomics data generated from next-generation sequencing with proteomics data generated from protein mass spectrometry. A set of tools has now been developed which allow users to:
- Co-visualize genomic and proteomic data using the Integrated Genomics Viewer (IGV)
- Analyse mass spectrometry results using customizable filters, which can be used to validate alternatively spliced isoforms of proteins.
- Analyse transcriptomics data from bacterial, archaeal, and viral genomes by the generation of novel open reading frames for predicted genes from glimmer or from simple overlapping intervals in the latter case, those simple intervals can be re-arranged to enable the prediction of novel transcript sites.
Research Champion:
Prof. Marc Wilkins - UNSW, School of Biotechnology and Biomolecular Sciences (marc.wilkins@unsw.edu.au).
Prof. Marc Wilkins - UNSW, School of Biotechnology and Biomolecular Sciences (marc.wilkins@unsw.edu.au).
Dr Paul Chambers - Australian Wine Research Institute (paul.chambers@awri.com.au)
Dr Paul Chambers - Australian Wine Research Institute (paul.chambers@awri.com.au)
Data Type:
Input Data:
- Peptides and nucleotides data will be integrated
- Nucleotide sequences (e.g. DNA and RNA)
- Gene sequences, and messenger RNA and their alternatively spliced forms (represented as contigs or isotigs)
- Generated and assembled from next-generation (deep) sequencing machines
- FASTA format is required for downstream compatibility with the Mascot protein identification engine
- Peptide sequences (fragments of proteins)
- Peak lists representing fragmented peptides
- Generated from tandem mass spectrometers
- Validate the existence of proteins and their alternatively spliced forms
- mzML format (or other equivalent format) for use in the Mascot identification engine
Output Data:
- Mascot output and nucleotide sequences (contigs) will be modified to make it compatible with the Integrated Genomics Viewer
High Level Software Functionality:
- The Integrated Genomics Viewer (IGV) (http://www.broadinstitute.org/igv/)
- Open-source software implemented in Java
- Custom scripts for data parsing and integration