Jan Stanstrup
@JanStanstrup
stanstrup@gmail.com
2017/06/25
Steno Diabetes Center Copenhagen, Denmark
The data used in this presentation was kindly provided by Drs. Richard and Nichole Reisdorph and their Mass Spectrometry Laboratory
This presentation and sample data can be found at github.com/stanstrup/XCMS-course
Why are we here and why do we need to do this?
Simply to do. Difficult to do well. Impossible to do perfectly.
Is this a peak or just noise?
First we locate the files. They need to be in an open format:
Lets make a list of the files we want to pre-process.
files <- list.files("_data", recursive = TRUE, full.names = TRUE, pattern=".mzXML")
head(files,5)
[1] "_data/A/POOL_1_A_1.mzXML" "_data/A/POOL_1_A_2.mzXML"
[3] "_data/A/POOL_1_A_3.mzXML" "_data/A/POOL_2_A_1.mzXML"
[5] "_data/A/POOL_2_A_2.mzXML"
Blanks interfer with peak alignment since there are so few peaks.
So, better remove them before starting. Same goes for compound mixtures.
Now we can do peak picking. We need to load some packages first.
suppressMessages(library(xcms)) # suppress avoids a flood of messages
suppressMessages(library(BiocParallel))
Now lets do the peak picking.
There are many parameters to set that we will explain in a sec.
There are a few more settings available but these are the most important.
xset <- xcmsSet(files,
BPPARAM = SerialParam(),
method = 'centWave',
prefilter = c(3,1E3),
ppm = 30,
snthr = 1E2,
profparam = list(step=0.01),
peakwidth = c(0.05*60,0.20*60),
verbose.columns = TRUE,
fitgauss = TRUE
)
By default the newest XCMS uses all available cores.
Before we set it to use a single core (SerialParam) just because it works better for generating this presentation.
To control the number of cores you could for example do:
library(parallel)
and set:
BPPARAM = SnowParam(detectCores()-1, progressbar = TRUE)
SnowParam()
on windowsMulticoreParam()
on linuxmatchedfilter[1]
centWave[2]
massifquant[3]
Here we will go through peakpicking with the centwave algorithm[2].
prefilter = c(3,1E3)
3
scans with intensity above 1000
.Centwave asks you to set the minimum and maxmimum peak widths you have in your data.
You set it in seconds (always seconds in XCMS) and this is what we did before.
peakwidth = c(0.05*60,0.20*60)
It is a vector of length 2 with the min and max length.
To determine reasonable values we need to look at the raw data (you'd probably use something interactive such as MzMine (or future XCMS!)).
Here is a TIC:
xraw <- xcmsRaw(xset@filepaths[1])
plotEIC(xraw, mzrange=c(0, 2000), , rtrange=c(0, 12*60))
Lets zoom in on a peak.
plotEIC(xraw,
mzrange=c(84.9612-0.01, 84.9612+0.01),
rtrange=c(0.7*60, 0.9*60),
type="o", cex=3, pch=19, lwd=3
)
You can also use a 2D plot to try to find short and long peaks.
Here I plotted with MzMine (remember to use the continuous/profile mode toggle otherwise it looks very wrong).
ppm: Relative deviation in the m/z dimension
In the centWave context it is the maximum allowed deviation between scans when locating regions of interest (ROIs).
Do not use the vendors number. Like the mileage of a car these numbers are far from real world scenarios.
We need a range that is true also for the tails of the peaks.
What we choose before was:
ppm = 30
A 2D plot is a reasonable way to look at this.
Check the peak bounderies with an EIC.
Check what are actually reasonably sized peaks in the spectrum.
((189.074-189.0690)/189.0690)*1e6
[1] 26.44537
First we need to clear up the confusion about the difference between profile mode data and the profile matrix.
Profile mode: The data is continuous in the m/z dimension.
As opposed to centroid mode data where you have discrete (sticks) in the m/z dimension.
Profile matrix: A rt (or scan rather) x m/z matrix of m/z slices.
XCMS uses this for some procedures (fillPeaks, but not peakpicking) instead of the full raw data. Think of it as binned data with less m/z resolution.
profparam = list(step=0.01)
Signal to noise ratio. It really depends on your data. And it is defined differently for centWave and matchedfilter.
It appears to be very unstable even for very similar peaks. So, I suggest to set it low and filter on intensity afterwards if needed.
For centWave I would start with:
snthr = 1E2
For matchedfilter I would start with:
snthr = 5
If low peaks that you would like to include are missing then lower the value.
Is this peak the same as that peak?
This step tries to group/match features between samples.
This is probably the most important step!
It is a 3 step procedure:
There are three methods for retention time correction/alignment.
xset_g <- group(xset,
bw = 0.2*60,
minfrac = 0.5,
minsamp = 5,
mzwid = 0.01,
max = 20,
sleep = 0
)
xset_g
An "xcmsSet" object with 27 samples
Time range: 0.6-720 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 2978
Sample classes: A, B, C
Feature detection:
o Peak picking performed on MS1.
Profile settings: method = bin
step = 0.01
Memory usage: 18.4 MB
Example of grouping density.
bw: bandwidth (standard deviation or half width at half maximum) of gaussian smoothing kernel to apply to the peak density chromatogram.
Perhaps a bit obscure definition but it roughly translates to:
The maximum expected RT deviation across samples.
Overlay the BPI and find the peak with the highest deviation.
Here we have zoomed. Compare peak apexes.
Calculate the deviation.
4.90-4.78
[1] 0.12
Since there could be small peaks with larger variation than what we found we bump it up to 0.2 min.
minfrac = 0.5
minsamp = 5
minsamp: minimum number of samples a peak should be found in.
minfrac: minimum fraction of samples a peak should be found in.
These are used per sample group.
Files in different subfolders are considered different groups.
The groups can be manipulated by changing:
head(xset@phenoData)
class
POOL_1_A_1 A
POOL_1_A_2 A
POOL_1_A_3 A
POOL_2_A_1 A
POOL_2_A_2 A
POOL_2_A_3 A
In this dataset we had 3 groups with 9 samples in each. The first time we do the grouping we are pretty liberal with how often it should be found.
Consider your experimental design and group VERY carefully when you set these parameters!
Example:
You set:
minfrac = 0.7
minsamp = 30
You have now removed everything that is unique to one group!
So possibly the most important features have been removed.
mzwid = 0.01
How close the m/z need to be the same compound.
max: maximum number of groups to identify in a single m/z slice.
Meaning how many isomers do you have across a chromatogram.
max = 20
If you are processing GC data remember that many fragments will be isomers. So, set this high.
xset_g_r <- retcor(xset_g,
method="peakgroups",
missing = 3,
extra = 3,
smooth = "loess",
span = 0.6,
plottype = "mdevden",
col=xset_g@phenoData[,"class"]
)
span: degree of smoothing for local polynomial regression fitting.
The higher the less flexible the fitting is.
0 == overfitting
1 == only captures very overall trends
0.4-0.7 is typical in my opinion. 0.2 is default.
We set:
span = 0.6
missing = 3
extra = 3
missing: number of missing samples to allow in retention time correction groups
extra: number of extra peaks to allow in retention time correction correction groups
plottype = "mdevden",
col=xset_g@phenoData[,"class"]
plottype:
col: We can tell it how to color the points/lines in the deviations plot.
Now we do grouping again with the corrected retention times.
Notice that we give more strict bw
, minfrac
and minsamp
since we expect better matching now that we have corrected the retention times.
xset_g_r_g <- group(xset_g_r,
bw = 5,
minfrac = 0.75,
minsamp = 7,
mzwid = 0.01,
max = 20,
sleep = 0
)
xset_g_r_g
An "xcmsSet" object with 27 samples
Time range: 0.5-720.5 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 1627
Sample classes: A, B, C
Feature detection:
o Peak picking performed on MS1.
Profile settings: method = bin
step = 0.01
Memory usage: 20.2 MB
We can redo the retcor
to get the plot using corrected RTs. We won't use the returned object.
xset_g_r_g_r <- retcor(xset_g_r_g,
method="peakgroups",
missing = 3,
extra = 3,
smooth = "loess",
span = 0.6,
plottype = "mdevden",
col=xset_g@phenoData[,"class"]
)