XCMS workshop 2017

Jan Stanstrup

@JanStanstrup
stanstrup@gmail.com

2017/06/25

Steno Diabetes Center Copenhagen, Denmark

Outline


  1. What does the data look like?
    Why pre-processing?
  2. Peak picking
  3. Grouping peaks
  4. Retention time correction
  5. Filling missing values

  6. Checking the results and troubleshooting
  7. Getting quick stats

The data used in this presentation was kindly provided by Drs. Richard and Nichole Reisdorph and their Mass Spectrometry Laboratory

This presentation and sample data can be found at github.com/stanstrup/XCMS-course

What does the data look like? Why pre-processing?


Why are we here and why do we need to do this?

What does the data look like? Why pre-processing?


What does the data look like? Why pre-processing?






Simply to do. Difficult to do well. Impossible to do perfectly.

XCMS


  • R package
  • First published in 2006[1] by Colin A. Smith et al. at The Scripps Research Institute (Patti Lab).
  • Open source: https://github.com/sneumann/xcms
  • The 'X' in the XCMS denotes that it can be used for an chromatography (GC/LC)
  • Maintained for many years primarily by Steffen Neumann and his group
  • New major version being developed by Johannes Rainer (EURAC Research , Bolzano, Italy)

Peak picking


Is this a peak or just noise?

Find the files


First we locate the files. They need to be in an open format:

  • mzML (newest)
  • mzData
  • mzXML (most widely supported)
  • netCDF (obsolete, last resort)


Lets make a list of the files we want to pre-process.

files <- list.files("_data", recursive = TRUE, full.names = TRUE, pattern=".mzXML")
head(files,5)
[1] "_data/A/POOL_1_A_1.mzXML" "_data/A/POOL_1_A_2.mzXML"
[3] "_data/A/POOL_1_A_3.mzXML" "_data/A/POOL_2_A_1.mzXML"
[5] "_data/A/POOL_2_A_2.mzXML"


Blanks interfer with peak alignment since there are so few peaks.
So, better remove them before starting. Same goes for compound mixtures.

Peak picking



Now we can do peak picking. We need to load some packages first.

suppressMessages(library(xcms)) # suppress avoids a flood of messages
suppressMessages(library(BiocParallel))


Now lets do the peak picking.
There are many parameters to set that we will explain in a sec.
There are a few more settings available but these are the most important.

xset <- xcmsSet(files, 
                BPPARAM  = SerialParam(),
                method = 'centWave',
                prefilter = c(3,1E3),
                ppm = 30,
                snthr = 1E2,
                profparam = list(step=0.01),
                peakwidth = c(0.05*60,0.20*60),
                verbose.columns = TRUE,
                fitgauss = TRUE
                )

Peak picking - BPPARAM


By default the newest XCMS uses all available cores.
Before we set it to use a single core (SerialParam) just because it works better for generating this presentation.

To control the number of cores you could for example do:

library(parallel)

and set:

BPPARAM  = SnowParam(detectCores()-1, progressbar = TRUE)



  • SnowParam() on windows
  • MulticoreParam() on linux

Peak picking - which method?


matchedfilter[1]

  • Original algoritm
  • Developed for nominal mass instruments
    • Therefore good for low accuracy data
    • Also works for accurate mass data
  • Good for chromatographically noisy data
  • Fixed width peak fitting
  • Bins data in m/z dimension


centWave[2]

  • Developed for accurate mass instruments
  • Handles different peak widths better
  • prefilter makes it faster
  • Usually the best if the data is “nice”
  • Bin-less approach


massifquant[3]

  • Developed (among other reasons) to better handle variable mass accuracy and peak shapes
  • Bin-less approach
  • No personal experience but from the paper it appears that:
    • Good at finding low intensity and low “quality” peaks
    • More prone to false positives (picking noise) than centWave

Peak picking - What peak picking does

Here we will go through peakpicking with the centwave algorithm[2].

  • First it finds regions of interest (ROIs).
    Those are where there could be a peak: Stable mass for a certain time.
  • Next the peak shape is evaluated.
  • The peak is expanded beyond the ROI if needed.

Figure from Tautenhahn, Böttcher and Neumann, 2008 (centWave paper)

Peak picking - What peak picking does


Figure from Tautenhahn, Böttcher and Neumann, 2008 (centWave paper)

Peak picking - Prefilter


prefilter = c(3,1E3)



  • Says to only consider regions where there are at least 3 scans with intensity above 1000.
  • Check your peak widths to see how many scans per peak you are sure to have.
  • For the intensity set to approximately the intensity of the random noise you see in a spectrum.

Peak picking - Peak width

Centwave asks you to set the minimum and maxmimum peak widths you have in your data.

You set it in seconds (always seconds in XCMS) and this is what we did before.

peakwidth = c(0.05*60,0.20*60)

It is a vector of length 2 with the min and max length.

To determine reasonable values we need to look at the raw data (you'd probably use something interactive such as MzMine (or future XCMS!)).
Here is a TIC:

xraw <- xcmsRaw(xset@filepaths[1])
plotEIC(xraw, mzrange=c(0, 2000), , rtrange=c(0,  12*60))

plot of chunk unnamed-chunk-9

Peak picking - Peak width

Lets zoom in on a peak.

plotEIC(xraw, 
        mzrange=c(84.9612-0.01, 84.9612+0.01), 
        rtrange=c(0.7*60,  0.9*60), 
        type="o", cex=3, pch=19, lwd=3
        )

plot of chunk unnamed-chunk-10










  • So, we can see that this peak has ~15 scans and is about 7 s (~0.1 min) long.
  • We could do the same looking at one of the longer peaks at the end of the run.
  • If the shortest peak we can find is about 0.1 min I'd go for 0.05 min to be on the safe side. Also multiply what you can find for the longest by 2 to be safe.

Peak picking - Peak width

You can also use a 2D plot to try to find short and long peaks. Here I plotted with MzMine (remember to use the continuous/profile mode toggle otherwise it looks very wrong).

Peak picking - ppm


ppm: Relative deviation in the m/z dimension
    In the centWave context it is the maximum allowed deviation between scans when locating regions of interest (ROIs).


Do not use the vendors number. Like the mileage of a car these numbers are far from real world scenarios.
We need a range that is true also for the tails of the peaks.


What we choose before was:

ppm = 30


A 2D plot is a reasonable way to look at this.

Peak picking - ppm


Peak picking - ppm

Check the peak bounderies with an EIC.

Peak picking - ppm

Check what are actually reasonably sized peaks in the spectrum.

Peak picking - ppm


((189.074-189.0690)/189.0690)*1e6
[1] 26.44537

Peak picking - profparam


First we need to clear up the confusion about the difference between profile mode data and the profile matrix.

Profile mode: The data is continuous in the m/z dimension.
As opposed to centroid mode data where you have discrete (sticks) in the m/z dimension.

Profile matrix: A rt (or scan rather) x m/z matrix of m/z slices.
XCMS uses this for some procedures (fillPeaks, but not peakpicking) instead of the full raw data. Think of it as binned data with less m/z resolution.


  • The profparam parameter says how fine the binning in the Profile matrix is going to be.
  • It cannot be set too low as slices will be combined as needed. Only memory usage will increase.
  • For high dimensional data the default setting can cause problems. So setting something like this will set it to 0.01 Da slices.


profparam = list(step=0.01)

Peak picking - snthr

Signal to noise ratio. It really depends on your data. And it is defined differently for centWave and matchedfilter.
It appears to be very unstable even for very similar peaks. So, I suggest to set it low and filter on intensity afterwards if needed.

For centWave I would start with:

snthr = 1E2

For matchedfilter I would start with:

snthr = 5


If low peaks that you would like to include are missing then lower the value.

Grouping and retention time correction


Is this peak the same as that peak?

Grouping and retention time correction


This step tries to group/match features between samples.
This is probably the most important step!


It is a 3 step procedure:

  1. Features are grouped according to RT. Loose parameters for matching is used.
  2. The matched features are used to model RT shifts. –> retention time correction/alignment.
  3. Using the corrected RTs features are matched again with stricter criteria.


There are three methods for retention time correction/alignment.

  1. LOESS: Fits a LOESS curve between retention times. Works on the peaktable (therefore fast). The default and usually no reason to look further.
  2. Obiwarp: Warps the raw data (and hence slow) to a reference sample. Can handle dramatic shifts. Often overfitting. Use if needed with great attention.
  3. Linear: When all else fails.

Grouping and retention time correction - group

xset_g <- group(xset, 
                bw = 0.2*60, 
                minfrac = 0.5, 
                minsamp = 5, 
                mzwid = 0.01, 
                max = 20, 
                sleep = 0
                )

xset_g
An "xcmsSet" object with 27 samples

Time range: 0.6-720 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 2978 
Sample classes: A, B, C 

Feature detection:
 o Peak picking performed on MS1.
Profile settings: method = bin
                  step = 0.01

Memory usage: 18.4 MB

Grouping and retention time correction - group

Example of grouping density.

Grouping and retention time correction - group bw


bw: bandwidth (standard deviation or half width at half maximum) of gaussian smoothing kernel to apply to the peak density chromatogram.

Perhaps a bit obscure definition but it roughly translates to:
The maximum expected RT deviation across samples.


Overlay the BPI and find the peak with the highest deviation.

Grouping and retention time correction - group bw


Here we have zoomed. Compare peak apexes.


Calculate the deviation.

4.90-4.78
[1] 0.12

Since there could be small peaks with larger variation than what we found we bump it up to 0.2 min.

Grouping and retention time correction - minfrac/minsamp

minfrac = 0.5
minsamp = 5


minsamp: minimum number of samples a peak should be found in.

minfrac: minimum fraction of samples a peak should be found in.


These are used per sample group.

Files in different subfolders are considered different groups.
The groups can be manipulated by changing:

head(xset@phenoData)
           class
POOL_1_A_1     A
POOL_1_A_2     A
POOL_1_A_3     A
POOL_2_A_1     A
POOL_2_A_2     A
POOL_2_A_3     A

In this dataset we had 3 groups with 9 samples in each. The first time we do the grouping we are pretty liberal with how often it should be found.

Grouping and retention time correction - minfrac/minsamp


Consider your experimental design and group VERY carefully when you set these parameters!

Example:

  1. You put all files on one folder.
  2. You have 2 study groups with each 20 samples. So 40 samples in total.
  3. You set:

    minfrac = 0.7
    minsamp = 30
    
  4. You have now removed everything that is unique to one group!
    So possibly the most important features have been removed.

Grouping and retention time correction - mzwid


mzwid = 0.01


How close the m/z need to be the same compound.

Grouping and retention time correction - max


max: maximum number of groups to identify in a single m/z slice.
Meaning how many isomers do you have across a chromatogram.


max = 20


If you are processing GC data remember that many fragments will be isomers. So, set this high.

Grouping and retention time correction

xset_g_r <- retcor(xset_g,
                   method="peakgroups",
                   missing = 3, 
                   extra = 3, 
                   smooth = "loess",
                   span = 0.6, 
                   plottype = "mdevden",
                   col=xset_g@phenoData[,"class"]
                  )

plot of chunk unnamed-chunk-23

Grouping and retention time correction - span


span: degree of smoothing for local polynomial regression fitting.
The higher the less flexible the fitting is.
0 == overfitting
1 == only captures very overall trends

0.4-0.7 is typical in my opinion. 0.2 is default.

We set:

span = 0.6

Grouping and retention time correction - span

Span: 0.01


Span: 0.2
Span: 0.6


Span: 1

Grouping and retention time correction - missing & extra


missing = 3
extra = 3


missing: number of missing samples to allow in retention time correction groups

extra: number of extra peaks to allow in retention time correction correction groups

Grouping and retention time correction - plotting


plottype = "mdevden",
col=xset_g@phenoData[,"class"]


plottype:

  • “deviation”: Plots only the RT deviations
  • “mdevden”: As “deviation” but adds density plot below



col: We can tell it how to color the points/lines in the deviations plot.

  • Can be a factor, numeric or a character vector of colors.
  • Useful to color by batch.

Grouping and retention time correction - grouping one more time

Now we do grouping again with the corrected retention times.

Notice that we give more strict bw, minfrac and minsamp since we expect better matching now that we have corrected the retention times.

xset_g_r_g <- group(xset_g_r,
                    bw = 5, 
                    minfrac = 0.75, 
                    minsamp = 7, 
                    mzwid = 0.01, 
                    max = 20, 
                    sleep = 0
                   )

xset_g_r_g
An "xcmsSet" object with 27 samples

Time range: 0.5-720.5 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 1627 
Sample classes: A, B, C 

Feature detection:
 o Peak picking performed on MS1.
Profile settings: method = bin
                  step = 0.01

Memory usage: 20.2 MB

Grouping and retention time correction

We can redo the retcor to get the plot using corrected RTs. We won't use the returned object.

xset_g_r_g_r <- retcor(xset_g_r_g,
                       method="peakgroups",
                       missing = 3, 
                       extra = 3, 
                       smooth = "loess",
                       span = 0.6, 
                       plottype = "mdevden",
                       col=xset_g@phenoData[,"class"]
                      )