Title: | Separate Metabolites into Likely Measurement Artifacts and True Metabolites |
---|---|
Description: | Split an untargeted metabolomics data set into a set of likely true metabolites and a set of likely measurement artifacts. This process involves comparing missing rates of pooled plasma samples and biological samples. The functions assume a fixed injection order of samples where biological samples are randomized and processed between intermittent pooled plasma samples. By comparing patterns of missing data across injection order, metabolites that appear in blocks and are likely artifacts can be separated from metabolites that seem to have random dispersion of missing data. The two main metrics used are: 1. the number of consecutive blocks of samples with present data and 2. the correlation of missing rates between biological samples and flanking pooled plasma samples. |
Authors: | Mark Chaffin |
Maintainer: | Mark Chaffin <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.1 |
Built: | 2025-03-13 04:01:17 UTC |
Source: | https://github.com/cran/MetProc |
Package to separate metabolites from an untargeted metabolomics experiment into likely artifacts versus likely true metabolites. The general strategy is to compare missing rates of pooled plasma samples and missing rates of biological samples across an injection order. With a randomized injection order for biological samples, generally metabolites that are present for only certain sections of the entire run (exhibiting a block structure) are likely artifacts whereas metabolites with random patterns of missingness are likely true metabolites. The package uses 3 main metrics to separate metabolites and provides tools to plot patterns of missing data across injection order to visualize differences in likely artifacts compared to true metabolites. Details of the separation process and applied metrics can be found in the details section of met_proc
.
Package: | MetProc |
Type: | Package |
Version: | 1.0 |
Date: | 2016-05-18 |
License: | GPL (>= 2) |
If data is formatted appropriately (see sampledata
for an example), generally only need to use the read.met
function followed by the met_proc
function to output a separate dataframe for likely true metabolites and likely measurement artifacts.
Mark Chaffin
Maintainer: Mark Chaffin <[email protected]>
library(MetProc) #Read in metabolomics dataset metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow = 3, metidcol = 1, fvalue = 8, sep = ",", ppkey = "PPP", ippkey = "BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Separate likely artifacts from true signal using custom cutoffs and criteria #Uses 5 groups of metabolites based on the pooled plasma missing rate, applies #custom metric thersholds, sets the minimum pooled plasma missing rate to 0.05, #sets the maximum pooled plasma missing rate to 0.95, sets the missing rate #to consider a block of samples present at 0.6 results <- met_proc(metdata, numsplit = 5, cor_rates = c(0.4,.7,.75,.7,.4), runlengths = c(80, 10, 12, 10, 80), mincut = 0.05, maxcut = 0.95, scut = 0.6, ppkey = 'PPP', sidkey = 'X', plot = FALSE) #Uses default criteria for running met_proc, but plots the results #and saves them in a PDF in the current directory. Adding plots #may substantially increase running time if many samples are #included results <- met_proc(metdata, plot = TRUE, missratecut = 0.001, histcolors = c('red','yellow','green','blue','purple')) #Write the retained metabolites to current directory write.met(results,'sample_retained.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='keep') #Write the removed metabolites to current directory write.met(results,'sample_removed.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='remove')
library(MetProc) #Read in metabolomics dataset metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow = 3, metidcol = 1, fvalue = 8, sep = ",", ppkey = "PPP", ippkey = "BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Separate likely artifacts from true signal using custom cutoffs and criteria #Uses 5 groups of metabolites based on the pooled plasma missing rate, applies #custom metric thersholds, sets the minimum pooled plasma missing rate to 0.05, #sets the maximum pooled plasma missing rate to 0.95, sets the missing rate #to consider a block of samples present at 0.6 results <- met_proc(metdata, numsplit = 5, cor_rates = c(0.4,.7,.75,.7,.4), runlengths = c(80, 10, 12, 10, 80), mincut = 0.05, maxcut = 0.95, scut = 0.6, ppkey = 'PPP', sidkey = 'X', plot = FALSE) #Uses default criteria for running met_proc, but plots the results #and saves them in a PDF in the current directory. Adding plots #may substantially increase running time if many samples are #included results <- met_proc(metdata, plot = TRUE, missratecut = 0.001, histcolors = c('red','yellow','green','blue','purple')) #Write the retained metabolites to current directory write.met(results,'sample_retained.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='keep') #Write the removed metabolites to current directory write.met(results,'sample_removed.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='remove')
Calculates the correlation of missing rates between the two flanking pooled plasma samples and intervening biological samples for each block in the injection order. A block is defined as a set of biological samples and their flanking pooled plasma samples. See sampledata
for an example of the data format and block structure. Requires 2 arguments as input: 1. The metabolomics dataset formatted from the read.met
function and 2. A list of 2 elements output from the get_group
function containing column indices of pooled plasma samples and biological samples, respectively. If either pooled plasma or biological samples are entirely absent or entirely present, the function will return NA for the metric of that metabolite as the standard deviation of a vector will be 0.
corr_metric(df, grps)
corr_metric(df, grps)
df |
The metabolomics dataset, ideally read from the |
grps |
A list of 2 elements from the |
Returns a vector of equal length to the number of rows in df
(representing metabolites) with the correlation of missing rates between flanking pooled plasma and intervening biological samples across all blocks.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of samples and pooled plasma grps <- get_group(metdata,'PPP','X') #get correlation metrics of metabolites corrs <- corr_metric(metdata,grps)
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of samples and pooled plasma grps <- get_group(metdata,'PPP','X') #get correlation metrics of metabolites corrs <- corr_metric(metdata,grps)
Takes a metabolomics data matrix and retrieves the column indices of biological samples and pooled plasma samples. Columns must be ordered by injection order and each column ID should have a unique prefix designating the particular type of sample it is. For example, “PPP”' to designate pooled plasma samples and “X” to designate biological samples. Generally if data is read into R using the read.met
function, columns will be labeled appropriately.
get_group(df, ppkey = "PPP", sidkey = "X")
get_group(df, ppkey = "PPP", sidkey = "X")
df |
The metabolomics dataset, ideally read from the |
ppkey |
The unique prefix of pooled plasma samples. Default is |
sidkey |
The unique prefix of biological samples. Default is |
A list of length 2 with the following keys:
pp |
A vector with column indices of pooled plasma |
sid |
A vector with column indices of samples |
See MetProc-package
for examples of running the full process.
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get groups based on samples and pooled plasma grps <- get_group(metdata,'PPP','X')
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get groups based on samples and pooled plasma grps <- get_group(metdata,'PPP','X')
Computes two missing rates per metabolite: 1. Missing rate of biological samples and 2. Missing rate of pooled plasma samples. Requires a metabolomics data matrix from read.met
function as well as the indicies of pooled plasma and biological samples from get_group
. Returns a list with the two missing rates across all metabolites
get_missing(df, ppind, sampind)
get_missing(df, ppind, sampind)
df |
The metabolomics dataset, ideally read from the |
ppind |
The indices of the pooled plasma samples. |
sampind |
The indices of the biological samples. |
A list with the missing rates of the pooled plasma samples and biological samples for all metabolites in dataframe. The keys are:
ppmiss |
The pooled plasma missing rate for each metabolite |
sampmiss |
The biological sample missing rate for each metabolite |
See MetProc-package
for examples of running the full process.
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get groups based on samples and pooled plasma grps <- get_group(metdata,'PPP','X') #Get the missing rates of each category for all metabolites missrate <- get_missing(metdata,grps[['pp']],grps[['sid']])
library(MetProc) #Read metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get groups based on samples and pooled plasma grps <- get_group(metdata,'PPP','X') #Get the missing rates of each category for all metabolites missrate <- get_missing(metdata,grps[['pp']],grps[['sid']])
Generates a heatmap to show patterns of missing data for metabolites. Useful to visualize the block structure of data to compare differences between removed metabolites and retained metabolites.
heatmap_res(df, ppkey = "PPP", sidkey = "X", missratecut = .01,title)
heatmap_res(df, ppkey = "PPP", sidkey = "X", missratecut = .01,title)
df |
The metabolomics dataset, ideally read from the |
ppkey |
Unique prefix of pooled plasma columns. Default is |
sidkey |
Unique prefix of biological sample columns. Default is |
missratecut |
The missing rate limit for displaying a metabolite. Only metabolites with overall missing rates equal to or greater than this cutoff will be plotted. Useful for avoiding plotting too many metabolites as the heatmap generation can be an expensive computation. If a metabolite has a very small missing rate, plotting is uninformative as all data is present. Default set to |
title |
The title of the heatmap plotted |
Returns a heatmap illustrating the patterns of missing data for metabolites.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get the good versus bad metabolites results <- met_proc(metdata) #Plot Removed metabolites #Similarly run for retained metabolites but #replacing 'remove' with 'keep' heatmap_res(results[['remove']],missratecut=.02,title='Removed Metabolites')
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get the good versus bad metabolites results <- met_proc(metdata) #Plot Removed metabolites #Similarly run for retained metabolites but #replacing 'remove' with 'keep' heatmap_res(results[['remove']],missratecut=.02,title='Removed Metabolites')
Takes a metabolomics data matrix and processes metabolites into likely artifacts versus likely true metabolites. Biological samples should follow a randomized injection order with pooled plasma samples interspersed. Columns of data should be samples and rows are metabolites. Columns must be ordered by injection order. Metabolites are first grouped by missing rate of pooled plasma and then processed based on metrics of blocky structure to identify likely artifacts. Specifically, corr_metric
and run_metric
are used to quantify the degree to which structure is present in the patterns of missing data. Must pass all thresholds to be considered a true metabolite.
met_proc(df, numsplit = 5, cor_rates = c(0.6, 0.65, 0.65, 0.65, 0.6), runlengths = c(NA, 15, 15, 15, NA), mincut = 0.02, maxcut = 0.95, scut = 0.5, ppkey = "PPP", sidkey = "X", missratecut=0.01, histcolors=c('white'), plot=TRUE, outfile='MetProc_output')
met_proc(df, numsplit = 5, cor_rates = c(0.6, 0.65, 0.65, 0.65, 0.6), runlengths = c(NA, 15, 15, 15, NA), mincut = 0.02, maxcut = 0.95, scut = 0.5, ppkey = "PPP", sidkey = "X", missratecut=0.01, histcolors=c('white'), plot=TRUE, outfile='MetProc_output')
df |
The metabolomics dataset, ideally read from the |
numsplit |
The number of equal sized sections to divide metabolites into based on missing rate of pooled plasma columns. Divides the range of missing rates between |
cor_rates |
A vector of length equal to |
runlengths |
A vector of length equal to |
mincut |
A cutoff to specify that any metabolite with pooled plasma missing rate less than or equal to this value should be retained. Default is |
maxcut |
A cutoff to specify that any metabolite with pooled plasma missing rate greater than this value should be removed. Default is |
scut |
The cutoff of missingness to consider a metabolite as having data present in a given biological sample block. Relevant only to |
ppkey |
The unique prefix of pooled plasma columns. Default is |
sidkey |
The unique prefix of biological samples columns. Default is |
missratecut |
A parameter for heatmap plots when |
plot |
Indicate whether you would like to obtain plots of missingness patterns and distributions of calculated metrics. Plots will be output as a PDF. Default is |
histcolors |
A vector of length equal to |
outfile |
Name and path of the file to store images if |
The function uses a four step process:
1. Retain all metabolites with pooled plasma missing rate below mincut
and remove all metabolites with pooled plasma missing rate above maxcut
.
2. Split the remaining metabolites into numsplit
groups that are defined by pooled plasma missing rates. The numsplit
groups will divide the range of pooled plasma missing rates evenly.
3. For each group of metabolites based on pooled plasma missing rates from step 2, calculate the correlation metric with corr_metric
. Any metabolite below the cutoff for that group, defined by cor_rates
, will be retained and any metabolite above will be removed.
4. For each group of metabolites based on pooled plasma missing rates from step 2, calculate the longest run metric with run_metric
. Any metabolite below the cutoff for that group, defined by runlengths
, will be retained and any metabolite above will be removed.
keep |
A dataframe of the retained metabolites |
remove |
A dataframe of the removed metabolites |
If plot = True
, a PDF file will be saved containing the correspondence between pooled plasma missing rate and sample missing rate, the distribution of the correlation metric and longest run metric in each of the groups based on pooled plasma missing rates, and heatmaps displaying the patterns of present/missing data for both the removed and retained metabolites.
See run_metric
for details on the longest run metric.
See corr_metric
for details on the correlation metric.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Separate likely artifacts from true signal using custom cutoffs and criteria #Uses 5 groups of metabolites based on the pooled plasma missing rate, applies #custom metric thersholds, sets the minimum pooled plasma missing rate to 0.05, #sets the maximum pooled plasma missing rate to 0.95, sets the missing rate #to consider a block of samples present at 0.6 results <- met_proc(metdata, numsplit = 5, cor_rates = c(0.4,.7,.75,.7,.4), runlengths = c(80, 10, 12, 10, 80), mincut = 0.05, maxcut = 0.95, scut = 0.6, ppkey = 'PPP', sidkey = 'X', plot = FALSE) #Uses default criteria for running met_proc, but plots the results #and saves them in a PDF in the current directory. #Colors of the histograms set by histcolors. #Adding plots may substantially increase running time if many #samples are included results <- met_proc(metdata, plot = TRUE, missratecut = 0.001, histcolors = c('red','yellow','green','blue','purple'))
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Separate likely artifacts from true signal using custom cutoffs and criteria #Uses 5 groups of metabolites based on the pooled plasma missing rate, applies #custom metric thersholds, sets the minimum pooled plasma missing rate to 0.05, #sets the maximum pooled plasma missing rate to 0.95, sets the missing rate #to consider a block of samples present at 0.6 results <- met_proc(metdata, numsplit = 5, cor_rates = c(0.4,.7,.75,.7,.4), runlengths = c(80, 10, 12, 10, 80), mincut = 0.05, maxcut = 0.95, scut = 0.6, ppkey = 'PPP', sidkey = 'X', plot = FALSE) #Uses default criteria for running met_proc, but plots the results #and saves them in a PDF in the current directory. #Colors of the histograms set by histcolors. #Adding plots may substantially increase running time if many #samples are included results <- met_proc(metdata, plot = TRUE, missratecut = 0.001, histcolors = c('red','yellow','green','blue','purple'))
For a given number of splits of data based on pooled plasma missing rate, calculate the longest run metric (run_metric
) and the correlation metric (corr_metric
) for metabolites in each group. Plot the distribution of these metrics for each group color coding those that exceed thresholds.
plot_metric(df,ppkey='PPP',sidkey='X',numsplit=5,mincut=.02,maxcut=0.95, scut=0.5,cor_rates=c(.6,.65,.65,.65,.6),runlengths=c(NA,15,15,15,NA), histcolors=c('white'))
plot_metric(df,ppkey='PPP',sidkey='X',numsplit=5,mincut=.02,maxcut=0.95, scut=0.5,cor_rates=c(.6,.65,.65,.65,.6),runlengths=c(NA,15,15,15,NA), histcolors=c('white'))
df |
The metabolomics dataset, ideally read from the |
ppkey |
The unique prefix of pooled plasma samples. Default is |
sidkey |
The unique prefix of biological samples. Default is |
numsplit |
The number of equal sized sections to divide metabolites into based on missing rate of pooled plasma columns. Divides the range of missing rates between |
mincut |
A cutoff to specify that any metabolite with pooled plasma missing rate less than or equal to this value should be retained. Default is |
maxcut |
A cutoff to specify that any metabolite with pooled plasma missing rate greater than this value should be removed. Default is |
scut |
The cutoff of missingness to consider a metabolite as having data present in a given biological sample block. Relevant only to |
cor_rates |
A vector of length equal to |
runlengths |
A vector of length equal to |
histcolors |
A vector of length equal to |
Returns histograms showing the correlation metric and longest run metric distributions for each group of the metabolites based on pooled plasma missing rate.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Plot distributions of the two metrics for each group plot_metric(metdata,ppkey='PPP',sidkey='X',numsplit=5,mincut=0.02,maxcut=0.95, scut=0.5,cor_rates=c(.6,.65,.65,.65,.6),runlengths=c(NA,15,15,15,NA), histcolors=c('red','yellow','green','blue','purple'))
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Plot distributions of the two metrics for each group plot_metric(metdata,ppkey='PPP',sidkey='X',numsplit=5,mincut=0.02,maxcut=0.95, scut=0.5,cor_rates=c(.6,.65,.65,.65,.6),runlengths=c(NA,15,15,15,NA), histcolors=c('red','yellow','green','blue','purple'))
Calculates the missing rate of the pooled plasma columns and biological sample columns for each metabolite. Plots a scatterplot showing the correspondence between the two.
plot_pp_sample_missing(df, ppkey = "PPP", sidkey = "X")
plot_pp_sample_missing(df, ppkey = "PPP", sidkey = "X")
df |
The metabolomics dataset, ideally read from the |
ppkey |
The unique prefix of pooled plasma samples. Default is |
sidkey |
The unique prefix of biological samples. Default is |
Returns a scatterplot comparing the pooled plasma missing rate to the sample missing rate
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Plot the pooled plasma missing rate against the sample missing rate plot_pp_sample_missing(metdata,ppkey='PPP',sidkey='X')
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Plot the pooled plasma missing rate against the sample missing rate plot_pp_sample_missing(metdata,ppkey='PPP',sidkey='X')
Read a metabolomics file. The file must be structured in a specific way. The columns of the file designate samples and the rows designate metabolites. The first n rows may contain any information. However, starting at row n+1 there must be a header line with column labels. The remaining rows are designated as one per metabolite. One column should contain the ID of each metabolite. Other columns can be included, but starting at some column, and continuously after this point, each sample or pooled plasma sample should be given its own column sorted by injection order. All pooled plasma columns should have a unique prefix differentiating them from biological samples. Up to 2 types of pooled plasma samples can be included in the file – each with a unique prefix. This may be useful when both a pooled plasma control generated from biological samples and a commercially available pooled plasma standard are used. All biological samples may have a designating prefix or simply lack a prefix designating pooled plasma samples. If no prefix designates the biological samples, a prefix of “X” will be used for biological samples in subsequent analysis. Missing data must be coded as NA.
read.met(data, headrow = 3, metidcol=1, fvalue=8, sep=",", ppkey='PPP', ippkey = 'BPP', sidkey="none")
read.met(data, headrow = 3, metidcol=1, fvalue=8, sep=",", ppkey='PPP', ippkey = 'BPP', sidkey="none")
data |
The metabolomics dataset file. The columns of the file designate samples and the rows designate metabolites. The first n rows can contain any information. However, starting at row n+1 there must be a header line with column labels. The remaining rows are designated as one per metabolite. One column should contain the ID of each metabolite. Other columns can be included, but starting at some column, and continuously after this point, each biological sample or pooled plasma sample should be given it's own column sorted by injection order. All pooled plasma columns should have a unique prefix differentiating them from samples. Up to 2 types of pooled plasma samples can be included in the file – each with a unique prefix. All biological samples may have a designated prefix or simply lack the the prefix designating pooled plasma samples. If no prefix designates the biological samples, a prefix of “X” will be used for biological samples in subsequent analysis. Missing data must be coded as NA. See file |
headrow |
The row number that contains the header line. Default is |
metidcol |
The column number that contains the metabolite ID. Default is |
fvalue |
The column number where data begins. Default is |
sep |
File delimiter. Default is |
ppkey |
The unique prefix of biological sample-based pooled plasma columns. Default is |
ippkey |
The unique prefix of standard pooled plasma columns. Default is |
sidkey |
The unique prefix of biological samples in the csv file. If ‘none’ provided as value, any column that does not contain the prefix of |
A matrix with the metabolomics data fully loaded. Should have the number of rows equal to the number of metabolites and columns equal to the number of samples + pooled plasma samples.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP")
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP")
For each metabolite, data is split into blocks that consist of the preceding pooled plasma sample and following biological samples in an injection order. For each block, data is deemed present in biological samples if the missing rate is less than scut
. An entire block is deemed to have data present if both the preceding pooled plasma and folllowing biolgical samples are both considered to have data present. The length of the longest consecutive run of blocks with data present is returned for each metabolite.
run_metric(df, grps, scut = 0.5)
run_metric(df, grps, scut = 0.5)
df |
The metabolomics dataset, ideally read from the |
grps |
A group list from the |
scut |
The cutoff missing rate to determine if data is present in a group of biological samples. If the missing rate of the biological samples is greater than or equal to this missing rate threshold, data will be considered absent from the block of biological samples. Default is |
Returns a vector containing the longest consecutive run of blocks with data present for each metabolite
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of pooled plasma and samples grps <- get_group(metdata,'PPP','X') #Get the longest run metric for each metabolite runs <- run_metric(metdata,grps,scut=.5)
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of pooled plasma and samples grps <- get_group(metdata,'PPP','X') #Get the longest run metric for each metabolite runs <- run_metric(metdata,grps,scut=.5)
This is a simulated dataset to show the format of the metabolomics data; patterns of missing data are generated roughly from a real metabolomics experiment. Rows represent metabolites and columns represent samples. The file contains 100 metabolites (rows) and 505 samples (480 biological sample columns and 25 pooled plasma columns) sorted by injection order. There are 20 biological samples between pooled plasma runs. Pooled plasma columns have prefix ‘PPP’ and biological samples are simple integers with no prefix.
sampledata
sampledata
The first row (Date) contains the date of processing. The second row (Inject) contains the injection number and is ordered from 1 to 505. The third row contains the column headers:Metab
is the metabolite ID.Meth
is the type of metabolite.HMDB
is the HMDB ID of the metabolite, if it exists.m/z
is the mass-to-charge ratio of the metabolite.rt
is the retention time.Com
contains any comments.ProcID
is the processing ID of the metabolite.
The remaining columns are either pooled plasma samples (prefix: ‘PPP’) or biological samples (prefix: No prefix). The basic structure of the csv file is as follows:
Date | 415 | 415 | .. | 415 | 415 | 415 | .. | ||||||
Inject | 1 | 2 | .. | 21 | 22 | 23 | .. | ||||||
Metab | Meth | HMDB | m/z | rt | Com | ProcID | PPP1 | 1 | .. | 20 | PPP2 | 21 | .. |
M1 | Lipid | H1 | 304 | 8.7 | 1 | 6.7 | 6.7 | .. | 5.0 | 6.7 | 4.6 | .. | |
M2 | Lipid | H2 | 309 | 7.6 | 2 | 1.0 | 1.1 | .. | 1.1 | 1.0 | 1.2 | .. | |
.. | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. |
M100 | Lipid | H100 | 249 | 6.2 | 100 | 2.4 | 1.9 | .. | 2.2 | 2.4 | 1.6 | .. | |
See read.met
for example of reading this csv file for use.
See MetProc-package
for examples of running the full process.
Separates metabolites into groups based on pooled plasma missing rates so that different thresholds of metrics can be applied to each group.
subset_met(df, miss, numsplit = 5, mincut = 0.02, maxcut = 0.95)
subset_met(df, miss, numsplit = 5, mincut = 0.02, maxcut = 0.95)
df |
The metabolomics dataset, ideally read from the |
miss |
Vector of missing rates of equal length to number of rows in |
numsplit |
The number of equal sized sections to divide metabolites into based on missing rate of pooled plasma columns. Divides the range of missing rates between |
mincut |
A cutoff to specify that any metabolite with pooled plasma missing rate less than or equal to this value should be retained. Default is |
maxcut |
A cutoff to specify that any metabolite with pooled plasma missing rate greater than this values should be removed. Default is |
A list consisting of a number of elements equal to numsplit
. Each element contains a matrix of the given metabolite group based on the pooled plasma missing rate. The list keys are simple integers corresponding to the split number.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of pooled plasma and samples groups <- get_group(metdata,"PPP","X") #Calculate a pooled plasma missing rate and sample missing rate #for each metabolite in data missrate <- get_missing(metdata,groups[['pp']],groups[['sid']]) #Group metabolites into 5 groups based on pooled plasma #missing rate subsets <- subset_met(metdata,missrate[['ppmiss']],5,.02,.95)
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Get indices of pooled plasma and samples groups <- get_group(metdata,"PPP","X") #Calculate a pooled plasma missing rate and sample missing rate #for each metabolite in data missrate <- get_missing(metdata,groups[['pp']],groups[['sid']]) #Group metabolites into 5 groups based on pooled plasma #missing rate subsets <- subset_met(metdata,missrate[['ppmiss']],5,.02,.95)
Write a metabolomics file based on the metabolites identified to be retained or removed using met_proc
. Requires the filepath for the original metabolomics file in order to extract row and column information. Will take in this original file and the results of the met_proc
function to write a file that contains only the retained or removed metabolites.
write.met(res, filename, origfile, headrow = 3, metidcol=1, fvalue=8, sep=",", type="keep")
write.met(res, filename, origfile, headrow = 3, metidcol=1, fvalue=8, sep=",", type="keep")
res |
The result output from |
filename |
The name and path for new metabolomics file. |
origfile |
The name and path for the original metabolomics file. |
headrow |
The row number that contains the header line in the original metabolomics file. Default is |
metidcol |
The column number that contains the metabolite ID in the original metabolomics file. Default is |
fvalue |
The column number where data begins in the original metabolomics file. Default is |
sep |
File delimiter for both the original metabolomics file and the new file. Default is |
type |
Either ‘keep’ or ‘remove’ to determine whether the retained metabolites or removed metabolites should be written to the file. Default is |
Writes a file to filename
that is of the same structure as the original metabolomics file but only containing either the retained or removed metabolites.
See MetProc-package
for examples of running the full process.
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Write the retained metabolites to current directory write.met(results,'sample_retained.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='keep')
library(MetProc) #Read in metabolomics data metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"), headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP") #Separate likely artifacts from true signal using default settings results <- met_proc(metdata,plot=FALSE) #Write the retained metabolites to current directory write.met(results,'sample_retained.csv', system.file("extdata/sampledata.csv", package="MetProc"), headrow=3,metidcol=1,fvalue=8,sep=",",type='keep')