ANDS Logo

Data collections can be seen on:

http://services.ands.org.au/home/orca/rda/search#!/group=RMIT%20University/tab=collection

Instruments where data are captured/transferred from:

From high-performance computing facilities

Software is available at:

http://hpctardis.googlecode.com

Software categories:

Metadata Store Solutions

Metadata Feed/Harvest/Publish

Data Repository Solutions

Data Access Control

Project Members:

Ravi Sreenivasamurthy (Project Manager, ravi.sreenivasamurthy@rmit.edu.au)

Heinz Schmidt (Project Manager, heinrich.schmidt@rmit.edu.au)

Venki Balasubramanian (Lead Developer, venki.balasubramanian@rmit.edu.au)

Ian Thomas Ian Edward Thomas (Lead Developer, Ian.Edward.Thomas@rmit.edu.au)

Salvy Russo (Research Officer, salvy.russo@rmit.edu.au)

ANDS Contact:

Mingfang Wu (Mingfang.Wu@ands.org.au)

Project Status:

Completed!

Data Capture from High Performance Computing Multi-User Environments

RMIT University

Project Description:

RMIT is currently a very large user of state and national High Performance Computing (HPC) facilities. It is currently the third largest user of the National Computational Infrastructure (NCI) Supercomputer Facility (based on 2008/2009 NCI Merit Allocations) and the third largest user of the Victorian Partnership for Advanced Computing (VPAC) Supercomputer facility (based on the latest usage statistics). A significant proportion of data generated using these facilities are used to support the aims and objectives of National Competitive Grants such as the Australian Research Council (ARC) and the National Health and Medical Research Council (NHMRC). In addition, the researchers generating this data are expected to conform to the principles outlined in the NHMRC/ARC/UA Australian Code for the Responsible Conduct of Research (RCR) 2007. In broader terms, this ANDS data curation project located in research field of condensed matter physics and theoretical and computational chemistry.



Problem Statement



RMIT is a very high user of physics and chemistry simulation packages such as VASP, CRYSTAL, SIESTA, and GULP running on a number of HPC facilities. The datasets generated by simulation packages may range from a few GBs to tens of GBs. The number of data sets from the community, which uses these packages, would be expected to be over 100 per year. Each dataset typically consists of data input and output files. In addition, a simulation run takes from a few hours to several hundred hours on a large supercomputing cluster facility. This project consolidate the offsite and onsite support so that these datasets are curated in institutional (RMIT) research repositories and also satisfy the data retention responsibilities of the university under the RCR Code.



Proposed Solution



The developed software from this project would interface with the HPC facilities and would assist researchers in collecting, managing and storing their data. The software would also facilitate subsequent retrieval and reuse of generated data that would enable the researchers to curate the data generated from simulation runs. The curation process creates domain specific metadata registry and also provides connector software for publishing ANDS specific metadata (RIF-CS) for the generated datasets to the ANDS research repositories.



Developed Solution



The developed software consists of two major modules - the data curation scripts installed at the HPC facilities and the data curation web portal installed at RMIT University. The researchers initiate the installed script at the facilities to automatically transfer the selected datasets for particular simulation packages. The transferred datasets are stored at the RMIT repository using the web portal. The web portal creates necessary information for the researchers and also generates metadata for both the researchers as well as for ANDS harvesting with minimal user interaction.



The developed software would be successful among the researcher because both the developed modules gives total control to the researcher in both curating the datasets and also making the datasets public. And also, these two modules work seamlessly with minimal user intervention, once initiated.



Extended Functionalities in HPCTardis



HPCTardis - the modified and extended version of myTardis has the following features:

- HPCTardis is equipped with newly developed protocol to create experiment automatically in the HPCTardis web portal. The protocol also transfers datasets from the HPC Facilities to the institutional repository (HPCTardis store).

- The extended functionalities in HPCTardis are capable of communicating with Unix servers seamlessly using Unix scripts.



- The HPCTardis is equipped with functionalities to extract metadata from four simulation packages such as SIESTA, CRYSTAL, VASP and GULP.



- The HPCTardis is developed with various functions, which is capable of producing ANDS specific metadata. The functions can also generate RIF-CS dynamically with related party, activity and collection records that can be harvested automatically using OAI-PMH.

High Level Software Functionality:

a. Captures metadata from microscopy images generated on various microscopy instruments (see instrument list attached)
b. Imports microscopy images in a standard format (OME-XML and/or OME-TIFF) and maintains provenance of metadata and image data
c. Allows users to browse and manage their image collections using a web-based interface
d. Allows users to share their image collections
e. Allows users to annotate their images and image collections
f. Allows users to search image collections on metadata attributes
g. Allows users to flag images and/or image collections for publication
h. Generates and sends “published” image collection metadata (RIF-CS) to Griffith's Metadata Exchange hub for publication to RDA
i. Automates backup and archiving of image data and metadata
j. Serves published images and image collections for external (open) access either independently or consistent with image collection URIs provided to RDA

ANZSRC-FOR code:

02 PHYSICAL SCIENCES