Project promotion materials:

Project Homepage:

Data collections can be seen on:

Software is available at:

Project Members:

Georgina Edwards (

Prof. Marc Wilkins (

Carlos Aya (

Schemek Pochopien (

Dr. Ignatius Pang (

Aidan Tay (

ANDS Contact:

Mingfang Wu (

Project Status:


Validation of genomes and transcriptomes with proteomic data (The Proteomic-Genomic Nexus)

University of New South Wales

Project Description

Research and/or policy focus

This project concerns genomics, transcriptomics and proteomics analyses and addresses the issue of how they can be integrated. For genomics and transcriptomics, while it is routine to achieve good coverage of the genome or transcriptome with next-generation sequencing, it is difficult to assess the accuracy of these assembled sequences. Proteomics analysis offers complementary results as it is routine to gain accurate data from advanced mass spectrometers, but it is difficult to get comprehensive coverage of the proteins. This project will integrate genomics, transcriptomics and proteomics data to cross-validate the assembled gene sequences and proteins / peptides, and will co-visualize the results in a specialized viewer to confirm the expression and processing of genes.

Project impetus and drivers

To address the challenge of integration and cross-validation, we pioneered a method in which proteomic analysis can be used to validate the quality of the assembled sequences. This involves the identification peptides that match the novel genes to validate their existence. The peptide are also used to validate proteins that are encoded by differentially spliced mRNA. A major innovation of the pipeline is to enable biologists to co-visualize proteomics data with next-generation sequencing data to assist with the validation of alternatively spliced mRNA. This is important since alternatively spliced mRNAs encode a variety of proteins that are important for diverse biological functions, for example, the development of cells into different types of tissues.

Before the Proteomic-Genomic Nexus (PG Nexus) project, there was no efficient means of integrating genomics data generated from next-generation sequencing with proteomics data generated from protein mass spectrometry. A set of tools has now been developed which allow users to:

Data Type:

Input Data:
- Peptides and nucleotides data will be integrated - Nucleotide sequences (e.g. DNA and RNA) - Gene sequences, and messenger RNA and their alternatively spliced forms (represented as contigs or isotigs) - Generated and assembled from next-generation (deep) sequencing machines - FASTA format is required for downstream compatibility with the Mascot protein identification engine - Peptide sequences (fragments of proteins) - Peak lists representing fragmented peptides - Generated from tandem mass spectrometers - Validate the existence of proteins and their alternatively spliced forms - mzML format (or other equivalent format) for use in the Mascot identification engine

Output Data:
- Mascot output and nucleotide sequences (contigs) will be modified to make it compatible with the Integrated Genomics Viewer

High Level Software Functionality:

- The Integrated Genomics Viewer (IGV) (
- Open-source software implemented in Java
- Custom scripts for data parsing and integration