XCMS workshop 2017

Jan Stanstrup

@JanStanstrup
stanstrup@gmail.com

2017/06/25

Steno Diabetes Center Copenhagen, Denmark

Outline

What does the data look like?
Why pre-processing?
Peak picking
Grouping peaks
Retention time correction
Filling missing values
Checking the results and troubleshooting
Getting quick stats

The data used in this presentation was kindly provided by Drs. Richard and Nichole Reisdorph and their Mass Spectrometry Laboratory

This presentation and sample data can be found at github.com/stanstrup/XCMS-course

What does the data look like? Why pre-processing?

Why are we here and why do we need to do this?

What does the data look like? Why pre-processing?

Simply to do. Difficult to do well. Impossible to do perfectly.

XCMS

R package
First published in 2006[1] by Colin A. Smith et al. at The Scripps Research Institute (Patti Lab).
Open source: https://github.com/sneumann/xcms
The 'X' in the XCMS denotes that it can be used for an chromatography (GC/LC)
Maintained for many years primarily by Steffen Neumann and his group
New major version being developed by Johannes Rainer (EURAC Research , Bolzano, Italy)

Peak picking

Is this a peak or just noise?

Find the files

First we locate the files. They need to be in an open format:

mzML (newest)
mzData
mzXML (most widely supported)
netCDF (obsolete, last resort)

Lets make a list of the files we want to pre-process.

files <- list.files("_data", recursive = TRUE, full.names = TRUE, pattern=".mzXML")
head(files,5)

[1] "_data/A/POOL_1_A_1.mzXML" "_data/A/POOL_1_A_2.mzXML"
[3] "_data/A/POOL_1_A_3.mzXML" "_data/A/POOL_2_A_1.mzXML"
[5] "_data/A/POOL_2_A_2.mzXML"

Blanks interfer with peak alignment since there are so few peaks.
So, better remove them before starting. Same goes for compound mixtures.

Peak picking

Now we can do peak picking. We need to load some packages first.

suppressMessages(library(xcms)) # suppress avoids a flood of messages
suppressMessages(library(BiocParallel))

Now lets do the peak picking.
There are many parameters to set that we will explain in a sec.
There are a few more settings available but these are the most important.

xset <- xcmsSet(files, 
                BPPARAM  = SerialParam(),
                method = 'centWave',
                prefilter = c(3,1E3),
                ppm = 30,
                snthr = 1E2,
                profparam = list(step=0.01),
                peakwidth = c(0.05*60,0.20*60),
                verbose.columns = TRUE,
                fitgauss = TRUE
                )

Peak picking - BPPARAM

By default the newest XCMS uses all available cores.
Before we set it to use a single core (SerialParam) just because it works better for generating this presentation.

To control the number of cores you could for example do:

library(parallel)

and set:

BPPARAM  = SnowParam(detectCores()-1, progressbar = TRUE)

SnowParam() on windows
MulticoreParam() on linux

Peak picking - which method?

matchedfilter[1]

Original algoritm
Developed for nominal mass instruments
- Therefore good for low accuracy data
- Also works for accurate mass data
Good for chromatographically noisy data
Fixed width peak fitting
Bins data in m/z dimension

centWave[2]

Developed for accurate mass instruments
Handles different peak widths better
prefilter makes it faster
Usually the best if the data is “nice”
Bin-less approach

massifquant[3]

Developed (among other reasons) to better handle variable mass accuracy and peak shapes
Bin-less approach
No personal experience but from the paper it appears that:
- Good at finding low intensity and low “quality” peaks
- More prone to false positives (picking noise) than centWave

Peak picking - What peak picking does

Here we will go through peakpicking with the centwave algorithm[2].

First it finds regions of interest (ROIs).
Those are where there could be a peak: Stable mass for a certain time.
Next the peak shape is evaluated.
The peak is expanded beyond the ROI if needed.

Figure from Tautenhahn, Böttcher and Neumann, 2008 (centWave paper)

Peak picking - What peak picking does

Figure from Tautenhahn, Böttcher and Neumann, 2008 (centWave paper)

Peak picking - Prefilter

prefilter = c(3,1E3)

Says to only consider regions where there are at least 3 scans with intensity above 1000.
Check your peak widths to see how many scans per peak you are sure to have.
For the intensity set to approximately the intensity of the random noise you see in a spectrum.

Peak picking - Peak width

Centwave asks you to set the minimum and maxmimum peak widths you have in your data.

You set it in seconds (always seconds in XCMS) and this is what we did before.

peakwidth = c(0.05*60,0.20*60)

It is a vector of length 2 with the min and max length.

To determine reasonable values we need to look at the raw data (you'd probably use something interactive such as MzMine (or future XCMS!)).
Here is a TIC:

xraw <- xcmsRaw(xset@filepaths[1])
plotEIC(xraw, mzrange=c(0, 2000), , rtrange=c(0,  12*60))

plot of chunk unnamed-chunk-9

Peak picking - Peak width

Lets zoom in on a peak.

plotEIC(xraw, 
        mzrange=c(84.9612-0.01, 84.9612+0.01), 
        rtrange=c(0.7*60,  0.9*60), 
        type="o", cex=3, pch=19, lwd=3
        )

plot of chunk unnamed-chunk-10

So, we can see that this peak has ~15 scans and is about 7 s (~0.1 min) long.
We could do the same looking at one of the longer peaks at the end of the run.
If the shortest peak we can find is about 0.1 min I'd go for 0.05 min to be on the safe side. Also multiply what you can find for the longest by 2 to be safe.

Peak picking - Peak width

You can also use a 2D plot to try to find short and long peaks. Here I plotted with MzMine (remember to use the continuous/profile mode toggle otherwise it looks very wrong).

Peak picking - ppm

ppm: Relative deviation in the m/z dimension
In the centWave context it is the maximum allowed deviation between scans when locating regions of interest (ROIs).

Do not use the vendors number. Like the mileage of a car these numbers are far from real world scenarios.
We need a range that is true also for the tails of the peaks.

What we choose before was:

ppm = 30

A 2D plot is a reasonable way to look at this.

Peak picking - ppm

Check the peak bounderies with an EIC.

Peak picking - ppm

Check what are actually reasonably sized peaks in the spectrum.

Peak picking - ppm

((189.074-189.0690)/189.0690)*1e6

[1] 26.44537

Peak picking - profparam

First we need to clear up the confusion about the difference between profile mode data and the profile matrix.

Profile mode: The data is continuous in the m/z dimension.
As opposed to centroid mode data where you have discrete (sticks) in the m/z dimension.

Profile matrix: A rt (or scan rather) x m/z matrix of m/z slices.
XCMS uses this for some procedures (fillPeaks, but not peakpicking) instead of the full raw data. Think of it as binned data with less m/z resolution.

The profparam parameter says how fine the binning in the Profile matrix is going to be.
It cannot be set too low as slices will be combined as needed. Only memory usage will increase.
For high dimensional data the default setting can cause problems. So setting something like this will set it to 0.01 Da slices.

profparam = list(step=0.01)

Peak picking - snthr

Signal to noise ratio. It really depends on your data. And it is defined differently for centWave and matchedfilter.
It appears to be very unstable even for very similar peaks. So, I suggest to set it low and filter on intensity afterwards if needed.

For centWave I would start with:

snthr = 1E2

For matchedfilter I would start with:

snthr = 5

If low peaks that you would like to include are missing then lower the value.

Grouping and retention time correction

Is this peak the same as that peak?

Grouping and retention time correction

This step tries to group/match features between samples.
This is probably the most important step!

It is a 3 step procedure:

Features are grouped according to RT. Loose parameters for matching is used.
The matched features are used to model RT shifts. –> retention time correction/alignment.
Using the corrected RTs features are matched again with stricter criteria.

There are three methods for retention time correction/alignment.

LOESS: Fits a LOESS curve between retention times. Works on the peaktable (therefore fast). The default and usually no reason to look further.
Obiwarp: Warps the raw data (and hence slow) to a reference sample. Can handle dramatic shifts. Often overfitting. Use if needed with great attention.
Linear: When all else fails.

Grouping and retention time correction - group

xset_g <- group(xset, 
                bw = 0.2*60, 
                minfrac = 0.5, 
                minsamp = 5, 
                mzwid = 0.01, 
                max = 20, 
                sleep = 0
                )

xset_g

An "xcmsSet" object with 27 samples

Time range: 0.6-720 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 2978 
Sample classes: A, B, C 

Feature detection:
 o Peak picking performed on MS1.
Profile settings: method = bin
                  step = 0.01

Memory usage: 18.4 MB

Grouping and retention time correction - group

Example of grouping density.

Grouping and retention time correction - group bw

bw: bandwidth (standard deviation or half width at half maximum) of gaussian smoothing kernel to apply to the peak density chromatogram.

Perhaps a bit obscure definition but it roughly translates to:
The maximum expected RT deviation across samples.

Overlay the BPI and find the peak with the highest deviation.

Grouping and retention time correction - group bw

Here we have zoomed. Compare peak apexes.

Calculate the deviation.

4.90-4.78

[1] 0.12

Since there could be small peaks with larger variation than what we found we bump it up to 0.2 min.

Grouping and retention time correction - minfrac/minsamp

minfrac = 0.5
minsamp = 5

minsamp: minimum number of samples a peak should be found in.

minfrac: minimum fraction of samples a peak should be found in.

These are used per sample group.

Files in different subfolders are considered different groups.
The groups can be manipulated by changing:

head(xset@phenoData)

           class
POOL_1_A_1     A
POOL_1_A_2     A
POOL_1_A_3     A
POOL_2_A_1     A
POOL_2_A_2     A
POOL_2_A_3     A

In this dataset we had 3 groups with 9 samples in each. The first time we do the grouping we are pretty liberal with how often it should be found.

Grouping and retention time correction - minfrac/minsamp

Consider your experimental design and group VERY carefully when you set these parameters!

Example:

You put all files on one folder.
You have 2 study groups with each 20 samples. So 40 samples in total.
You set:
```
minfrac = 0.7
minsamp = 30
```
You have now removed everything that is unique to one group!
So possibly the most important features have been removed.

Grouping and retention time correction - mzwid

mzwid = 0.01

How close the m/z need to be the same compound.

Grouping and retention time correction - max

max: maximum number of groups to identify in a single m/z slice.
Meaning how many isomers do you have across a chromatogram.

max = 20

If you are processing GC data remember that many fragments will be isomers. So, set this high.

Grouping and retention time correction

xset_g_r <- retcor(xset_g,
                   method="peakgroups",
                   missing = 3, 
                   extra = 3, 
                   smooth = "loess",
                   span = 0.6, 
                   plottype = "mdevden",
                   col=xset_g@phenoData[,"class"]
                  )

plot of chunk unnamed-chunk-23

Grouping and retention time correction - span

span: degree of smoothing for local polynomial regression fitting.
The higher the less flexible the fitting is.
0 == overfitting
1 == only captures very overall trends

0.4-0.7 is typical in my opinion. 0.2 is default.

We set:

span = 0.6

Grouping and retention time correction - span

Span: 0.01

Span: 0.2

Span: 0.6

Span: 1

Grouping and retention time correction - missing & extra

missing = 3
extra = 3

missing: number of missing samples to allow in retention time correction groups

extra: number of extra peaks to allow in retention time correction correction groups

Grouping and retention time correction - plotting

plottype = "mdevden",
col=xset_g@phenoData[,"class"]

plottype:

“deviation”: Plots only the RT deviations
“mdevden”: As “deviation” but adds density plot below

col: We can tell it how to color the points/lines in the deviations plot.

Can be a factor, numeric or a character vector of colors.
Useful to color by batch.

Grouping and retention time correction - grouping one more time

Now we do grouping again with the corrected retention times.

Notice that we give more strict bw, minfrac and minsamp since we expect better matching now that we have corrected the retention times.

xset_g_r_g <- group(xset_g_r,
                    bw = 5, 
                    minfrac = 0.75, 
                    minsamp = 7, 
                    mzwid = 0.01, 
                    max = 20, 
                    sleep = 0
                   )

xset_g_r_g

An "xcmsSet" object with 27 samples

Time range: 0.5-720.5 seconds (0-12 minutes)
Mass range: 60.5133-1418.4605 m/z
Peaks: 97334 (about 3605 per sample)
Peak Groups: 1627 
Sample classes: A, B, C 

Feature detection:
 o Peak picking performed on MS1.
Profile settings: method = bin
                  step = 0.01

Memory usage: 20.2 MB

Grouping and retention time correction

We can redo the retcor to get the plot using corrected RTs. We won't use the returned object.

xset_g_r_g_r <- retcor(xset_g_r_g,
                       method="peakgroups",
                       missing = 3, 
                       extra = 3, 
                       smooth = "loess",
                       span = 0.6, 
                       plottype = "mdevden",
                       col=xset_g@phenoData[,"class"]
                      )

plot of chunk unnamed-chunk-28

Filling missing values

Is there really nothing here?

Filling missing values - fillPeaks

Peaks that were not originally found in some samples are problematic for statistics
Therefore we re-examine each file again and check the area where peaks were found in other files
We then integrate (blindly) the area we expect to find a peak
The integration is done with the slice-size we set with the profparam parameter in xcmsSet

xset_g_r_g_fill <- fillPeaks(xset_g_r_g, 
                             expand.mz=1.5,
                             expand.rt=1.1, 
                             BPPARAM = SerialParam()
                            )

C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_1_A_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_1_A_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_1_A_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_2_A_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_2_A_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_2_A_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_3_A_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_3_A_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/A/POOL_3_A_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_1_B_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_1_B_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_1_B_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_2_B_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_2_B_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_2_B_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_3_B_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_3_B_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/B/POOL_3_B_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_1_C_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_1_C_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_1_C_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_2_C_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_2_C_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_2_C_3.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_3_C_1.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_3_C_2.mzXML 
method:  bin 
step:  0.01 
C:/Users/JPZS/Desktop/gits/XCMS_course/_data/C/POOL_3_C_4.mzXML 
method:  bin 
step:  0.01

Filling missing values - fillPeaks

Here are some issues to be aware of

The m/z and rt ranges that is integrated is based on previously found peaks.
This creates some issues:
- For very accurate data the range of the original peaks can have more or less zero width. This is unrealistic for re-ingration and peaks can be underestimated
- We could get around this by using larger profparam. But with this we risk too large ranges

In the next major iteration XCMS (XCMS v3 under development by Johannes Rainer) has fixed this by introducing a minimum m/z width for the re-integration as well as using the raw data instead of the profile matrix.

For now we only have:

expand.mz=1.5,
expand.rt=1.1

which can expand the found range. But not set a minimum.

We can see the issue here:

peaks <- peakTable(xset_g_r_g_fill)
hist(peaks$mzmax-peaks$mzmin, 100, xlab = "Da", main="Histogram of m/z deviation between matched peaks")

plot of chunk unnamed-chunk-31

We see that the max matches the mzwid we set in group.

Checking the results and troubleshooting

How do I look at the results and how do I know if they are good?

Checking the results and troubleshooting - Know your friends

Check if compounds you know behave as they should
- Are they found including adducts and fragments?
- Are they found consistently in most samples?
- Are known isomers separated?
- Are compounds split in multiple groups?
More features are not always good. Could mean you are splitting features that are really the same or picking a lot of noise.

So, to check these you need a handful of:

close eluting isomers
compounds that have lots of fragments adducts (it makes good sense to wait to check these until after you have used CAMERA)

Checking the results and troubleshooting - Aligned peaks

There are two types of peak-tables in XCMS.
First we look at the aligned table that is what you normally want.

peaks_aligned <- peakTable(xset_g_r_g_fill)

kable(peaks_aligned[1:5,])

mz	mzmin	mzmax	rt	rtmin	rtmax	npeaks	A	B	C	POOL_1_A_1	POOL_1_A_2	POOL_1_A_3	POOL_2_A_1	POOL_2_A_2	POOL_2_A_3	POOL_3_A_1	POOL_3_A_2	POOL_3_A_3	POOL_1_B_1	POOL_1_B_2	POOL_1_B_3	POOL_2_B_1	POOL_2_B_2	POOL_2_B_3	POOL_3_B_1	POOL_3_B_2	POOL_3_B_3	POOL_1_C_1	POOL_1_C_2	POOL_1_C_3	POOL_2_C_1	POOL_2_C_2	POOL_2_C_3	POOL_3_C_1	POOL_3_C_2	POOL_3_C_4
61.01003	61.00712	61.01144	48.15840	45.97316	49.72256	30	7	6	8	38075.079	52447.461	52424.818	5465.812	47601.083	54370.54	53125.328	43200.446	38586.210	62924.915	58556.928	46069.602	56147.985	61582.010	58743.252	54645.072	37352.74	27995.13	67923.936	53031.87	56230.999	43703.066	56143.07	41684.683	37514.649	19179.73	65017.13
61.03312	61.03207	61.03775	126.20805	118.72978	128.54574	20	5	5	9	5038.794	7162.446	5194.051	6894.288	7397.976	8236.99	4988.633	9605.344	7452.918	5891.725	5341.044	2807.101	9018.885	6957.214	7472.683	9297.984	5808.21	12176.58	5883.911	10039.12	6341.003	6732.476	31249.38	9520.272	7766.445	7730.50	13402.52
61.04023	61.03987	61.04052	64.01050	63.43508	65.68202	25	9	7	9	223318.111	238427.375	269852.126	255093.148	238455.027	265708.80	242910.686	251129.713	261868.842	256141.244	291707.044	293110.612	265205.513	270167.064	309357.329	281836.764	253368.29	226272.86	300412.376	274562.86	256816.406	271979.925	295028.72	260680.237	306660.086	217554.47	263441.64
62.02575	62.02530	62.02753	47.96607	47.74444	50.34064	16	3	4	7	21043.082	21590.664	24242.002	24992.062	25625.437	23870.82	22226.699	27324.189	29159.180	25005.926	24066.597	20014.394	24096.372	24159.740	26036.273	26323.844	26975.93	25885.17	23857.677	23358.23	19247.134	22113.304	21276.41	24950.830	25648.139	24735.84	29383.24
67.02352	67.02284	67.02747	48.07639	47.74444	49.33655	21	6	7	7	15200.286	31904.073	44377.910	52157.129	43349.550	52778.02	48232.505	56645.330	59538.266	40012.801	35502.727	43410.012	48997.098	51849.124	52285.220	57527.981	50296.78	58615.99	50712.845	43169.51	51522.220	47593.294	59554.39	54261.507	56605.366	50269.63	70215.64

npeaks: Number of peaks assigned to the group.
   Can be higher than the number of samples if several original peaks (from peak-picking) were combined.
   If this is a lot higher than the number of samples something is probably wrong.
   If this is a lot lower than the number of samples something is probably wrong.
A, B, C… (As the groups, or the folders you used, were named): Number of samples where the feature was found (in original peak-picking).
Each file only counted one time.
Other columns: The intensities/areas in each file.

Checking the results and troubleshooting - Raw detected peaks

The other kind of table if the raw peaks as they were detected in each peak.
That includes all peaks that didn't survive the grouping step.

You would usually use this to check if a peak was actually found but removed by grouping.
Also contains extra information for each peak.

peaks <- peaks(xset_g_r_g_fill)

kable(peaks[1:5,])

mz	mzmin	mzmax	rt	rtmin	rtmax	into	intb	maxo	sn	egauss	mu	sigma	h	f	dppm	scale	scpos	scmin	scmax	lmin	lmax	sample
143.9569	143.9563	143.9574	15.583338	14.5953384	17.065338	2483.788	2481.812	1458.833	1458	0.2796792	32.27348	1.502006	1448.339	108	7	3	32	29	35	30	35	1
106.9905	106.9875	106.9920	6.689338	2.2413384	10.642338	18666.327	18657.432	3370.252	3369	0.1346849	13.64883	8.271001	2513.487	123	18	5	14	9	19	5	22	1
145.9344	145.9316	145.9377	18.547338	16.0773384	22.995338	12956.458	12949.540	2295.993	2295	0.1663331	38.36919	7.059818	2126.379	172	42	3	42	39	45	33	47	1
145.9351	145.9333	145.9384	2.735338	0.2653384	5.206338	10280.366	10274.931	2447.574	2447	0.0894202	5.84216	6.009684	2157.746	172	19	3	6	3	9	1	11	1
145.9349	145.9333	145.9389	9.653338	8.1713384	11.630338	7238.725	7234.772	2300.619	2300	0.0701653	19.74068	4.750215	2066.645	172	15	3	20	17	23	17	24	1

The sample columns says which sample the peak was found in. Same order as your original input and xset_g_r_g_fill@filepaths.

Checking the results and troubleshooting - Isomeric separation

Check if the individually found peaks (dots) are separated in groups (rectangles) properly.
The big red dot just marks the spot to asked about.

par(cex = 2.5)
library(scales)
devtools::source_url("https://raw.githubusercontent.com/stanstrup/chemhelper/master/R/xcms_helpers.R")

analyze.xcms.group(xset_g_r_g_fill,
                   mz = 121.03636,
                   rt = 5.7*60,
                   rt_tol_sample=60,
                   mz_tol_sample=0.01,
                   rt_tol_group=30,
                   mz_tol_group=0.05)

plot of chunk unnamed-chunk-36

  rtmed (min) rtmin (min) rtmax (min)    mzmed    mzmin    mzmax rt_sd (s)
1    5.601758    5.598557    5.607708 121.0346 121.0331 121.0351 0.1917794
2    5.795123    5.788983    5.798973 121.0364 121.0349 121.0370 0.1877455

Checking the results and troubleshooting - Isomeric separation

Troubleshooting

Problem	Solution
Peaks not separated in RT dimension	Lower bw in the grouping step
Peaks split in RT dimension	Increase bw in the grouping step and/or better alignment
Peaks not separated in m/z dimension	Lower mzwid in the grouping step
Peaks split in m/z dimension	Increase mzwid in the grouping step and/or better alignment
Peaks missing	Check raw peaklist and revise peak-picking
Peak found but not assigned to group	Compare raw peaklist to minfrac & minsamp parameters for group

Getting quick stats

stats <- diffreport(xset_g_r_g_fill, class1="A", class2="B")

kable(stats[1:10,c("name", "fold", "tstat", "pvalue", "mzmed", "rtmed", "npeaks", "A", "B", "C")])

name	fold	tstat	mzmed	rtmed	npeaks	B	C
M378T570	33.74178	119.17000	378.2339	570.2822	13	4	9
M337T570	606.41057	139.06885	337.2370	570.4102	18	9	9
M440T612	29.34710	62.80708	440.2462	611.6873	9	0	9
M301T570	27.50325	78.62787	301.2156	570.4287	10	1	9
M319T570	2240.39652	110.53159	319.2262	570.4318	17	9	8
M725T570	511.31146	81.01122	725.3773	569.6257	17	8	9
M377T570	884.90834	91.54924	377.2293	570.3076	18	9	9
M320T570	77.19498	63.88314	320.2291	570.4380	16	8	8
M338T570	477.57777	64.06715	338.2399	570.4348	14	5	9
M372T570	219.90619	57.46099	372.2737	570.4102	16	7	9

Be aware that doing the stats directly on the full and unprocessed peak table is rarely a good idea.

You should normally consider doing things like:

Inter- and intra-batch correction
Scaling and normalization
Maybe filter some peaks (known contaminants? not stable in QC samples?)
Univariate stats above should be corrected for multiple testing

XCMS based packages

There are many packages that add additional features to XCMS.
Here are some examples:

CAMERA[4]: Grouping of features and annotation (next talk!)
X¹³CMS[5]: Unbiased mapping of isotopic fates
XCMS Online[6]: Online and interactive use of XCMS
metaXCMS[7]: Finding shared alterations among multiple sample classes
IPO[8]: Automatic optimization of XCMS parameters
more exist… See[9]

The End

Thank you for your attention

Questions?

Bibliography

[1] C. a. Smith, E. J. Want, G. O'Maille, R. Abagyan, et al. “XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification”. In: Analytical Chemistry 78.3 (feb. 2006), pp. 779-787. DOI: 10.1021/ac051437y. .

[2] R. Tautenhahn, C. BÃ¶ttcher and S. Neumann. “Highly sensitive feature detection for high resolution LC/MS.”. En. In: BMC bioinformatics 9.1 (nov. 2008), p. 504. DOI: 10.1186/1471-2105-9-504. .

[3] C. J. Conley, R. Smith, R. J. O. Torgrip, R. M. Taylor, et al. “Massifquant: Open-source Kalman filter-based XC-MS isotope trace feature detection”. In: Bioinformatics 30.18 (2014), pp. 2636-2643. ISSN: 14602059. DOI: 10.1093/bioinformatics/btu359. .

[4] C. Kuhl, R. Tautenhahn, C. Böttcher, T. R. Larson, et al. “CAMERA: An integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets”. In: Analytical Chemistry 84.1 (jan. 2012), pp. 283-289. DOI: 10.1021/ac202450g. eprint: NIHMS150003. .

[5] X. Huang, Y. Chen, K. Cho, I. Nikolskiy, et al. “X13CMS: Global Tracking of Isotopic Labels in Untargeted Metabolomics”. In: Analytical Chemistry 86.3 (2014). PMID: 24397582, pp. 1632-1639. DOI: 10.1021/ac403384n. .

[6] H. Gowda, J. Ivanisevic, C. H. Johnson, M. E. Kurczy, et al. “Interactive XCMS online: Simplifying advanced metabolomic data processing and subsequent statistical analyses”. In: Analytical Chemistry 86.14 (2014), pp. 6931-6939. DOI: 10.1021/ac500734c.

[7] R. Tautenhahn, G. J. Patti, E. Kalisiak, T. Miyamoto, et al. “MetaXCMS: Second-order analysis of untargeted metabolomics data”. In: Analytical Chemistry 83.3 (feb. 2011), pp. 696-700. DOI: 10.1021/ac102980g. .

[8] G. Libiseller, M. Dvorzak, U. Kleb, E. Gander, et al. “IPO: a tool for automated optimization of XCMS parameters”. In: BMC Bioinformatics 16.1 (2015), pp. 1-10. ISSN: 1471-2105. DOI: 10.1186/s12859-015-0562-8. .

[9] N. G. Mahieu, J. L. Genenbacher and G. J. Patti. A roadmap for the XCMS family of software solutions in metabolomics. 2016. DOI: 10.1016/j.cbpa.2015.11.009. .

Versions used for the this presentation

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252   
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                   
[5] LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scales_0.4.1        xcms_2.99.3         MSnbase_2.3.6      
 [4] ProtGenerics_1.6.0  mzR_2.8.1           Rcpp_0.12.11       
 [7] BiocParallel_1.8.2  Biobase_2.34.0      BiocGenerics_0.20.0
[10] knitcitations_1.0.7 knitr_1.15.1       

loaded via a namespace (and not attached):
 [1] splines_3.3.2          lattice_0.20-35        colorspace_1.3-2      
 [4] stats4_3.3.2           vsn_3.42.3             XML_3.98-1.7          
 [7] survival_2.41-3        rlang_0.1.1            withr_1.0.2           
[10] affy_1.52.0            RColorBrewer_1.1-2     affyio_1.44.0         
[13] foreach_1.4.3          plyr_1.8.4             mzID_1.12.0           
[16] stringr_1.2.0          zlibbioc_1.20.0        munsell_0.4.3         
[19] pcaMethods_1.66.0      gtable_0.2.0           devtools_1.12.0       
[22] memoise_1.1.0          codetools_0.2-15       evaluate_0.10         
[25] IRanges_2.8.2          doParallel_1.0.10      BiocInstaller_1.24.0  
[28] curl_2.6               MassSpecWavelet_1.40.0 highr_0.6             
[31] preprocessCore_1.36.0  limma_3.30.13          S4Vectors_0.12.2      
[34] RANN_2.5.1             impute_1.48.0          ggplot2_2.2.1         
[37] packrat_0.4.8-1        digest_0.6.12          stringi_1.1.5         
[40] RJSONIO_1.3-0          grid_3.3.2             bibtex_0.4.0          
[43] tools_3.3.2            bitops_1.0-6           magrittr_1.5          
[46] lazyeval_0.2.0         RCurl_1.95-4.8         tibble_1.3.3          
[49] RefManageR_0.13.1      MASS_7.3-45            Matrix_1.2-8          
[52] lubridate_1.6.0        httr_1.2.1             iterators_1.0.8       
[55] R6_2.2.0               MALDIquant_1.16.2      multtest_2.30.0