METABOLOMICS CONSORTIUM COORDINATING CENTER (M3C)
Contact PI: Richard A. Yost, University of Florida
The Metabolomics Consortium Coordinating Center (M3C) is proposed as the Stakeholder Engagement and Program Coordinating Center (SEPCC) for stage 2 of the Common Fund Metabolomics Program of the National Institutes of Health. The M3C will operate in conjunction with the Southeast Center for Integrated Metabolomics (SECIM), one of six stage 1 Regional Comprehensive Metabolomics Resource Cores (RCMRCs) founded in 2013, and now a self-supporting metabolomics service center. The overarching goal of the M3C will be the promotion of metabolomics as a key component of biomedical research (basic, clinical, and translational) and clinical care. The mission of the M3C is to serve as a catalyst for the advancement of metabolomics in biomedical research and clinical care by engaging the diverse range of stakeholders, organizing the consortium, and promoting its work. Stakeholder engagement will include bi-annual symposia to identify roadblocks in the use of metabolomics and suggest approaches leading to remediation. Pilot and Feasibility awards will be focused on biomedical research projects new to the use of metabolomics. Consortium work will be organized through a web portal providing access to all consortium resources, including access to datasets in the National Metabolomics Data Repository, and tools developed by the Metabolomics Data Analysis and Interpretation Tools awardees. M3C further organizes the work of the consortium by facilitating on-line and in-person meetings of governance groups and workgroups. M3C promotes the work of the consortium through social media, its web portal, presentations at scientific meetings, and distribution of consortium standards, policies, procedures, protocols, best practices, and guidelines. M3C develops a strategic plan for the consortium, generates metrics regarding the activities of the consortium, participates in evaluation, and creates the consortium annual report. The work of the coordinating center, and of the stage 2 metabolomics consortium, will lead to improved human health through basic research findings, improved laboratory practices, and translation to clinical research and care.
NATIONAL METABOLOMICS DATA REPOSITORY NEXTGEN METABOLOMICS WORKBENCH
Contact PI: Shankar Subramaniam, University of California, San Diego
The primary objective of this proposal is the establishment of a robust, cloud-based National Metabolomics Data Repository founded on the FAIR principle, i.e. findable, accessible, interoperable, and reusable with scalability, extensibility, and portability. Over the past six years, we have developed the first version of such a repository, known as the metabolomics Workbench (MW), and our overarching effort through this proposal will be to develop the next generation MW as follows: – Support a state-of-the-art database infrastructure based on an open-access public domain relational database, PostGreSQL, which will house all metadata and data pertaining to metabolomics. – Establish community-acceptable metadata and data standards and develop newer standards, where lacking, with the aid of the expert community and other stakeholders. – Provide multiple easy-user interfaces with multiple format options for researchers to enter all metadata and data associated with a metabolomics study. – Develop interfaces and tools for easy access to querying and analyzing the data along with Application Programmer Interfaces (APIs) to add/extend existing tools and interfaces. – Generate, in consultation with the SEPCC, best practice protocols and APIs for tool integration. – Coordinate with the metabolomics Steering Committee, the Governing Board, the SEPCC, and other stakeholders to ensure that the broad goals of the NIH metabolomics Consortium are achieved. – Formulate mechanisms for very large community participation in burgeoning metabolomics resources and communicate to the larger biomedical community the value of using metabolomics data and tools for research. The proposed NMDR will have 4 cores, namely a) the Admin Core, whose responsibility will be the overall coordination, including the administration of all the NMDR cores and establishment of guidelines for coordination with all other stakeholders; b) the Data Repository Core, which will house all metabolomics data, provide a large suite of tools and provide interfaces for querying, analyzing, and displaying the data on a web portal interface known as the nextgen MW, and encourage the broader community to deposit data; c) the Governance Core, which will, along with the SEPCC, establish a body of eight experts (four recommended by the PI) to provide deep guidance to the NMDR for choice of formats, protocols, and tools; and d) an additional core, called the Data Services Core, which will be the harbinger for long-term stability. This forward-looking Core will equip the NMDR for the future by porting the nextgen MW into a hybrid cloud environment and developing a community-supported container technology for cloud-based metabolomics analysis tools.
Compound Identification Cores (CIDCs)
GENETICS AND QUANTUM CHEMISTRY AS TOOLS FOR UNKNOWN METABOLITE IDENTIFICATION
Contact PI: Arthur S. Edison, University of Georgia
Overall: Our project combines the significant advantages of a genetic model organism, sophisticated pathway mapping tools, high-throughput and accurate quantum chemistry (QM), and state-of-the-art experimental measurements. The result will be an efficient and cost-effective approach for unknown compound identification in metabolomics, which is one of the major limitations facing this growing field of medical science. Caenorhabditis elegans has several advantages for this study, including over 10,000 available genetic mutants, well-developed CRISPR/Cas9 technology, and a panel of over 500 wild C. elegans isolates with complete genomes. Half of C. elegans genes have homologs to human disease genes, making this model organism an outstanding choice to improve our understanding of metabolic pathways in human disease. We will develop an automated pipeline for sample preparation to reproducibly measure tens of thousands of unknown features by UHPLC-MS/MS. We will use the wild isolates to conduct metabolome-wide genetic association studies (m-GWAS), and SEM-path to locate unknowns in pathways using partial correlations. The relevance of the unknown metabolites to specific pathways will be tested by measuring UHPLC-MS/MS data from genetic mutants of those pathways. Molecular formula and pathway information will be the inputs for automated quantum mechanical calculations of all possible structures, which will be used to accurately calculate NMR chemical shifts that will be matched to experimental data. The correct structures will be validated by comparing them with 2D NMR data of the same compound. The validated computed structures will then be used to improve QM-based MS/MS fragment prediction, using the experimental UHPLC-MS/MS data. This project will enhance many areas of science beyond worms and model organisms. First, C. elegans is the simplest animal model available with significant homology to other animals and humans. The discoveries we make in metabolic pathways will have a direct impact on studies of several human diseases. Second, our approach is highly transferable to other genetic systems and with little modification can be applied to many other applications. Perhaps most important is the relevance to large-scale human precision medicine studies. The wild C. elegans isolates are “individuals” with diverse genomes that are a model for natural populations such as humans. It is true that we are using mutant animals that would not be available in a human precision medicine study, but the mutants are used primarily to validate pathways that are constructed entirely by wild isolate data. Once the approaches are fully developed and validated, the mutants will not be necessary. C. elegans and other genetic model organisms were instrumental in the development of modern genomics and DNA sequencing technologies. Our premise is that the worm will have a comparable impact in metabolomics.
MICHIGAN COMPOUND IDENTIFICATION DEVELOPMENT CORES (MCIDC)
Contact PI: Alexey Nesvizhskii, University of Michigan
Overall – Project Summary As a member of the NIH Common Funds Metabolomics Consortium, the Michigan Compound Identification Development Core (MCIDC) will using cutting-edge computational and experimental methods to systematically identify metabolites among the high proportion of features in untargeted metabolomics data which are presently considered unknown. In so doing, we will address a long-standing challenge in the field of metabolomics and enhance biological insights from extant and future metabolomics data. Our data will greatly contribute to platform-agnostic, rapidly-searchable metabolite databases, and the methods we develop will facilitate future compound identification efforts. We will achieve these goals by carrying out the following aims: Through the computational core of MCIDC, we will refine software currently operational in our lab that aids in annotation of features in untargeted metabolomics data as either primary features or as artifacts or degenerate features (e.g., isotopes, fragments, adducts, contaminants). This software will help prioritize identification efforts on primary features, while allowing artifacts and degenerate features to be indexed and rapidly removed from future data sets. We will implement a `hybrid search’ approach that will allow unknown metabolite spectra to be searched against both in-silico and experimentally-derived spectra of compounds with similar structural motifs. We expect this approach will improve certainty of metabolite identification compared to in-silico spectra alone. We will contribute our data output to the National Metabolomics Data Repository and other databases. Through the experimental core of MCIDC, we will develop and implement novel and cutting-edge analytical technologies to aid in compound identification, and will systematically apply these techniques to unknown primary features in metabolomics data determined to be of high priority based on survey of public metabolomics databases. Techniques we will use to identify metabolites include high-resolution tandem mass spectrometry (MSn), ion mobility spectrometry, high-resolution chromatographic methods including ultra-high pressure liquid chromatography, sample pre-fractionation and multidimensional separations, in-vivo stable isotope labeling for structural elucidation, chemical derivatization, pre-concentration followed by NMR analysis, and (when necessary) synthesis and characterization of novel metabolite standards. Finally, through our administrative core, we will ensure coordinated operation between our own experimental and computational cores, and with other members of the NIH common funds metabolomics consortium. By coordinating between CIDC sites and prioritizing compound identification tasks as a group, we will maximize productivity and improve outcome of the metabolomics consortium efforts. By carrying out these aims, we anticipate that our CIDC will yield a lasting, unifying impact on interpretation of biological findings from the rich and growing datasets yielded by untargeted metabolomics.
WEST COAST METABOLOMICS CENTER FOR COMPOUND IDENTIFICATION
Contact PI: Oliver Fiehn, University of California at Davis
Project Summary – Overall West Coast Metabolomics Center for Compound Identification (WCMC) The West Coast Metabolomics Center for Compound Identification (WCMC) is committed to the overall goals of the NIH Common Fund Metabolomics Initiative and specifically aims to largely improve small molecule identifications. Understanding metabolism is important to gain insight into biochemical processes and relevant to battle diseases such as cancer, obesity and diabetes. Compound identification in metabolomics is still a daunting task with many unknown compounds and false positive identifications. The major goal of the WCMC is therefore to develop processes and resources that accelerate and improve the accuracy of the compound identification workflow for experts and medical professionals. The WCMC for Compound Identification is structured in three different entities: the Administrative Core, the Computational Core and the Experimental Core. The Center is led by the Director Prof. Fiehn in close collaboration with quantum chemistry experts Prof. Wang and Prof. Tantillo, and metabolomics experts Dr. Barupal and Dr. Kind with broad support from mass spectrometry, computational metabolomics and programming experts. The Administrative Core will assist the Computational and Experimental Core to develop and validate large in-silico mass spectral libraries, retention time prediction models and innovative methods for constraining and ranking lists of isomers in an integrated process of cheminformatics tools and databases. The developed tools and databases will be made available to all Common Fund Metabolomics Consortium (CF-MC) members and professional working groups. The WCMC will also provide guidance for compound identification to the National Metabolomics Data Repository. The broad dissemination of developed compound identification protocols, training for compound identification workflows, databases and distribution of internal reference standard kits for metabolomic standardization will overall widely support the metabolomics community.
MEGA-SCALE IDENTIFICATION TOOLS FOR XENOBIOTIC METABOLISM
Contact PI: Dean Paul Jones, Emory University
Project Summary Human evolution has created complex metabolism systems to transform and eliminate potentially harmful chemicals to which we are exposed. Available evidence indicates that these systems generate a million or more different chemical metabolites, most of which are completely uncharacterized. Widespread use of mass spectrometry-based metabolomics methods shows that many unidentified mass spectral features are significantly associated with human diseases. Substantial epidemiological research implicates environmental contributions to many disease processes, and we believe that many of the unidentified mass spectral features are metabolites of environmental chemicals. We have an established and successful human exposome research center focused on improving the understanding of environmental contributions to disease. The present proposal is to build upon this foundation to develop powerful new chemical identification tools that can be scaled to identify hundreds of thousands of foreign chemical metabolites in the human body. We have assembled an exposome research team of analytical scientists with expertise in mass spectrometry, xenobiotic metabolism, computational chemistry and robotic methods, to develop and test new chemical identification tools to identify hundreds of thousands of foreign chemical metabolites. Our approach relies upon expertise in 1) computational chemistry to predict possible xenobiotic metabolites, respective adduct forms and ion dissociation patterns in mass spectrometry, 2) use of enzymatic and cellular xenobiotic biotransformation systems, which allows creation of multi-well panels containing specific biotransformation systems to generate xenobiotic metabolites, 3) ion fragmentation mass spectrometry and NMR spectroscopy methods to confirm chemical identities and 4) expertise with robotic systems which can be used to scale the approach to identify hundreds of thousands of metabolites of environmental chemicals. An Administrative Core will maintain an organizational structure and coordinate activities between the Experimental Core and the Computational Core, NIH and the Stakeholder Engagement and Program Coordination Center (SEPCC). The Experimental Core will develop and provide compound identification capability with ultra-high-resolution mass spectrometry support. The Computational Core will develop a predicted xenobiotic metabolite database to support metabolite identification. The Administrative Core will maintain interactions with HERCULES Exposome Research Center and support interactions with prospective Core users. Milestones are established to monitor progress toward goals to establish tools for compound identification that can be scaled to identify hundreds of thousands of foreign chemical metabolites. The results will catalyze metabolomics research by providing new ways to identify unknown metabolites of environmental chemicals, and also support identification of a broader range of metabolites of drugs, food, microbiome, dietary supplements and commercial products.
PACIFIC NORTHWEST ADVANCED COMPOUND IDENTIFICATION CORE
Contact PI: Thomas O. Metz, Pacific Northwest National Laboratory
OVERALL SUMMARY The capability to chemically identify thousands of metabolites and other chemicals in clinical samples will revolutionize the search for environmental, dietary, and metabolic determinants of disease. By comparison to near-comprehensive genetic information, comparatively little is understood of the totality of the human metabolome, largely due to insufficiencies in molecular identification methods. Through innovations in computational chemistry and advanced ion mobility separations coupled with mass spectrometry, we propose to overcome a significant, long standing obstacle in the field of metabolomics: the absence of methods for accurate and comprehensive identification of metabolites without relying on data from analysis of authentic chemical standards. A paradigm shift in metabolomics, we will use gas-phase molecular properties that can be both accurately predicted computationally and consistently measured experimentally, and which can thus be used for comprehensive identification of the metabolome without the need for authentic chemical standards. The outcomes of this proposal directly advance the mission and goals of the NIH Common Fund by: (i) transforming metabolomics science by enabling consideration of the totality of the human metabolome through optimized identification of currently unidentifiable molecules, eventually reaching hundreds of thousands of molecules, and (ii) developing standardized computational tools and analytical methods to increase the national capacity for biomedical researchers to identify metabolites quickly and accurately. This work is significant because it enables comprehensive and confident chemical measurement of the metabolome. This work is innovative because it utilizes an integrated quantum-chemistry and machine learning computational pipeline to accurately predict physical-chemical properties of metabolites coupled to measurements.
Data and Tools Cores (DTCs)
COMPUTATIONAL TOOLS FOR ANALYSIS AND VISUALIZATION OF QUALITY CONTROL ISSUES IN METABOLOMIC DATA
Contact PI: John Weinstein, University of Texas MD Anderson Cancer Center
Abstract * * * In omic studies of all types (e.g., genomic, transcriptomic, proteomic, metabolomic), technical batch effects pose a fundamental challenge to quality control and reproducibility. The possibilities for serious error are greatly magnified in metabolomics, however, due to a range of possible platform, operator, instrument, and environmental factors that can cause batch (or trend) effects. Hence, there is a need for routine surveillance and correction of batch effects within and across metabolomics laboratories and technological platforms. Accordingly, we propose here to develop the MetaBatch algorithms, computational tool, and web portal. For development of MetaBatch, we will leverage our experience in developing MBatch, a tool that became indispensible for quality-control of data in all 33 projects of The Cancer Genome Atlas (TCGA) program. Our first aim is to translate the successful quality control model from TCGA to metabolomics by customizing and extending the MBatch pipeline for detection, quantitation, diagnosis, interpretation, and correction of batch and trend effects. The second aim is to develop and incorporate innovative metabolomics-specific algorithms, including major visualization resources such as our interactive Next-Generation Clustered Heat Maps. The third aim is to distribute MetaBatch to the research community as open-source software and in cloud-based and Galaxy versions. The fourth aim is to provide plug-in capability for integration of MetaBatch with other metabolomic resources, prominently including Metabolomics Workbench (in collaboration with Dr. Shankar Subramaniam) and others developed within the Common Fund Metabolomics Program. Our fifth aim is to promote MetaBatch actively and interact extensively with other Consortium members and the metabolomics research community. With active support from MD Anderson Faculty and Academic Development, we will provide documentation, tutorials, videos, demonstrations, and training to accelerate use and to solicit feedback on limitations, possible improvements, and additional modules that would be useful in real-world workflows. We bring a variety of assets to the project, including: the MBatch resource as a starting point for software development; multidisciplinary expertise in bioinformatics, biostatistics, software engineering, biology, and clinical medicine; PIs with a combined 21 years of experience in molecular profiling studies of clinical disease (in a consortial context); international leadership in batch effects analysis; a software engineering team with a track record of producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose expertise can be called on; extensive computing resources, including one of the most powerful academically based machines in the world; strong institutional support; and close working relationships with first-class basic, translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers in the country. Our bottom-line mission will be to aid the research community’s effort to improve rigor and reproducibility in metabolomics for scientific understanding and to alleviate disease.
TOOLS FOR LEVERAGING HIGH-RESOLUTION MS DETECTION OF STABLE ISOTOPE ENRICHMENTS TO UPGRADE THE INFORMATION CONTENT OF METABOLOMICS DATASETS
Contact PI: Jamey Young, Vanderbilt University
PROJECT SUMMARY/ABSTRACT Recent advances in high-resolution mass spectrometry (HRMS) instrumentation have not been fully leveraged to upgrade the information content of metabolomics datasets obtained from stable isotope labeling studies. This is primarily due to lack of validated software tools for extracting and interpreting isotope enrichments from HRMS datasets. The overall objective of the current application is to develop tools that enable the metabolomics community to fully leverage stable isotopes to profile metabolic network dynamics. Two new tools will be implemented within the open-source OpenMS software library, which provides an infrastructure for rapid development and dissemination of mass spectrometry software. The first tool will automate tasks required for extracting isotope enrichment information from HRMS datasets, and the second tool will use this information to group ion peaks into interaction networks based on similar patterns of isotope labeling. The tools will be validated using in-house datasets derived from metabolic flux studies of animal and plant systems, as well as through feedback from the metabolomics community. The rationale for the research is that the software tools will enable metabolomics investigators to address important questions about pathway dynamics and regulation that cannot be answered without the use of stable isotopes. The first aim is to develop a software tool to automate data extraction and quantification of isotopologue distributions from HRMS datasets. The software will provide several key features not included in currently available metabolomics software: i) a graphical, interactive user interface that is appropriate for non-expert users, ii) support for native instrument file formats, iii) support for samples that are labeled with multiple stable isotopes, iv) support for tandem mass spectra, and v) support for multi-group or time-series comparisons. The second aim is to develop a companion software that applies machine learning and correlation-based algorithms to group unknown metabolites into modules and pathways based on similarities in isotope labeling. The third aim is to validate the tools through comparative analysis of stable isotope labeling in test standards and samples from animal and plant tissues, including time-series and dual-tracer experiments. A variety of collaborators and professional working groups will be engaged to test and validate the software, and the tools will be refined based on their feedback. The proposed research is exceptionally innovative because it will provide the advanced software capabilities required for both targeted and untargeted analysis of isotopically labeled metabolites, but in a flexible and user-friendly environment. The research is significant because it will contribute software tools that automate and standardize the data processing steps required to extract and utilize isotope enrichment information from large-scale metabolomics datasets. This work will have an important positive impact on the ability of metabolomics investigators to leverage information from stable isotopes to identify unknown metabolic interactions and quantify flux within metabolic networks. In addition, it will enable entirely new approaches to study metabolic dynamics within biological systems.
CROSS-PLATFORM AND GRAPHICAL SOFTWARE TOOL FOR ADAPTIVE LC/MS AND GC/MS METABOLOMICS DATA PREPROCESSING
Contact PI: Xiuxia Du, University of North Carolina Charlotte
Project Summary / Abstract Data preprocessing is critical for the success of any MS-based untargeted metabolomics study, as it is the first informatics step for making sense of the data. Despite the enormous contributions that existing software tools have made to metabolomics, errors in compound identification and relative quantitation are still plaguing the field. This issue is becoming more serious as the sensitivity of LC/MS and GC/MS platforms is constantly increasing. Preprocessing involves peak detection, peak grouping and annotation for LC/MS or spectral deconvolution for GC/MS data, and peak alignment. Existing software tools invariably yield an immense number of false positive and false negative peaks, produce inaccurate peak groups, mis-align detected peaks, and extract inaccurate information of relative metabolite quantitation. These errors can translate downstream into spurious or missing compound identifications and cause misleading interpretations of the metabolome. Furthermore, users need to specify a large number of parameters for existing software tools to work. Unfortunately, general users usually do not understand how to optimize these parameters, and maximizing one aspect (e.g., sensitivity) often has deleterious effects on another (e.g., specificity). We will address these challenges by developing more accurate algorithms for improving the rigor and reproducibility of data preprocessing. The proposed algorithms will be implemented in Java and integrated with the widely-used MZmine 2, making the software cross-platform and user-friendly with rich visualization capabilities. In addition, the implementation will be optimized for memory efficiency and computing speed allowing large-scale data preprocessing. Extensive testing of the software will be conducted in close collaborations with metabolomics core facilities and users around the world.
MUMMICHOG 3, ALIGNING MASS SPECTROMETRY DATA TO BIOLOGICAL NETWORKS
Contact PI: Shuzhao Li, Emory University
Abstract The mummichog software was initially published in 2013, as a computational approach to match patterns in metabolomics data to known biochemical networks, without the requirement of upfront metabolite identification. This approach enables rapid generation of biological hypotheses from untargeted data, and has gained considerable popularity, which also creates urgent needs to upgrade the software itself. This proposal aims to add a rich user interface, and better support of LC-MS, LC- MS/MS, IMS/MS and GC-MS. Furthermore, this work will make a conceptual leap to establish a framework of network alignment as a vehicle to interpret metabolomics data by integrating multiple layers of information. The new development will be integrated into XCMS Online and MetaboAnalyst, and will be made freely available as modular software tools.
METHODS AND TOOLS FOR INTEGRATIVE FUNCTIONAL ENRICHMENT ANALYSIS OF METABOLOMICS DATA
Contact PI: Alla Karnovsky, University of Michigan
Abstract Modern analytical methods allow simultaneous detection of hundreds of metabolites, generating increasingly large and complex data sets. Analysis of metabolomics data is a multi-step process that requires variety of bioinformatics and statistical tools. One of the biggest challenges in metabolomics is how alterations in metabolite levels can be linked to specific biological processes that are disrupted contributing to the development of disease, or reflecting the disease state. To address this challenge, we propose to develop methods and build computational tools to help researchers interrogate their metabolomics data and integrate them with other molecular phenotypes to build testable hypotheses and derive biological knowledge that could help addressing this challenge. Our team has extensive collaborative experience working together in the Phase I Common Fund-supported Regional Comprehensive Metabolomics Resource Core (MRC2) and in building computational methods and tools for the analysis of multi-dimensional omics data. We propose to build on our past efforts to develop a novel functional enrichment testing (FET) approach that will not be limited to compounds found in canonical metabolic pathways and will include both known and unknown metabolites in the analysis. We will leverage our previously developed methodology for building partial correlation networks that allows identifying commonalities and differences in network structures derived from different experimental conditions. Exploring relationships between key metabolic changes and alterations in transcript, proteins and other molecular components can provide additional levels of information and help build biological insights from experimental data. The overarching goal of this proposal is to develop FET methods that would enable analysis of multi-condition, multi-layer omics data sets. To that end, we propose a network-based data integration strategy that will help uncover relationships both within and between different molecular layers, identify subnetworks, containing metabolites, transcripts etc., and test their significance. We anticipate that the application of our methods will lead to better insights into molecular networks affected by many complex diseases.
ADDRESSING SPARSITY IN METABOLOMICS DATA ANALYSIS
Contact PI: Katerina Kechris, University of Colorado, Denver
Project Summary Comprehensive profiling of the small molecule repertoire in a sample is referred to as metabolomics, and is being used to address a variety of scientific questions in biomedical studies. Metabolomics offers more immediate measures of the physiology of an individual, and more direct examination of the effects of exposures such as nutrition, smoking and bacterial infections. For human health, metabolomics studies are being used to investigate disease mechanisms, discover biomarkers, diagnose disease, and monitor treatment responses. Metabolomics is increasingly recognized as an important component of precision medicine initiatives to complement and enhance collected genomic data. This is critical as the metabolome cannot be predicted from knowledge of the genome, transcriptome or proteome, but provides important information on the phenotype. Recent technological advances in mass spectrometry-based metabolomics have allowed for more comprehensive and sensitive measurements of metabolites. We focus on untargeted ultra-high pressure liquid chromatography coupled to mass spectrometry, which is one of the more commonly used methods. Despite the technological advances, the bottleneck for taking full advantage of metabolomics data is often the paucity and incompleteness of analytical tools and databases. Our goal is to develop novel statistical methods and software for the research community to improve the utilization of metabolomics data. There are many steps in a metabolomics data analysis pipeline, and we will focus on the downstream steps of normalization, and univariate, multivariate and pathway analyses. In particular, we will address the high levels of sparsity, which is one of the more unique aspects of metabolomics data compared to other –omics data sets. For metabolomics data, there is sparsity in individual metabolites due to a large percentage of missing data for biological or technical reasons, and sparsity in connections between metabolites due to high collinearity and sparsely connected networks in metabolic pathways. The methods and software we develop will maximize the potential of metabolomics to provide new discoveries in disease etiology, diagnosis, and drug development.
A COMPREHENSIVE PLATFORM FOR HIGH-THROUGHPUT PROFILING OF THE HUMAN REFERENCE METABOLOME
Contact PI: Gary Patti, Washington University at St. Louis
Project Summary The last decade has seen two complementary trends: (i) technology to perform untargeted metabolomics with liquid chromatography/mass spectrometry (LC/MS) has become readily available to most investigators, and (ii) interest in metabolism has continued to heighten in many disparate research fields ranging from cancer and immunology to neuroscience and aging. Accordingly, the number of investigators who are acquiring untargeted metabolomic data with LC/MS is dramatically increasing. Yet, informatic tools to analyze the acquired data have lagged far behind and interpretation of the results remains a serious challenge, even for experienced users. Thus, there is a substantial number of investigators performing untargeted metabolomics with LC/MS who either cannot interpret the data generated or, even worse, are interpreting it incorrectly. When untargeted metabolomics is performed on a typical biological sample, it is common to detect thousands to tens of thousands of signals (aka features). Translating these signals into metabolite names is the biggest informatic barrier limiting biomedical applications of the technology. The process is arduous, particularly for inexperienced investigators, because the majority of signals detected do not correspond to non- redundant metabolites originating from the biological sample. Rather, most signals (up to 95% in some of our experiments) are due to complicating factors such as contaminants, artifacts, fragments, etc. Because many of these complicating signals are not currently in metabolomic databases such as METLIN, they can be challenging to annotate for inexperienced users. While there are software programs available to annotate the signals within the data, these tools are beyond the reach of most clinical and biological investigators because (i) they are not automated with a graphical user interface, and (ii) they rely on a costly experimental design involving isotopes to find contaminants and artifacts. We propose to develop an automated solution to name and quantify most of the metabolites detected in untargeted metabolomic LC/MS experiments. Our strategy is to assume the computational burden of completely annotating all detected metabolites in untargeted metabolomic data, which only has to be performed once for a given sample type, so that less-experienced investigators do not have to in their future experiments. We will completely annotate untargeted metabolomic data sets from different biological samples using the mz.unity software and credentialing technology developed by the Patti lab. Based on experiments that we have already performed, we expect to find ~5,000 unique bonafide metabolites per sample. We will then use these endogenous signals to develop targeted LC/MS methods that enable automated analysis of all detectable metabolites (i.e., the “reference metabolome”). This will allow investigators with minimal expertise in metabolomics to profile the unique and bonafide metabolites in their samples at an untargeted scale, but without informatic barriers that have historically limited progress in the field.