Combines features in an MSnSet
object
combineFeatures.Rd
This function combines the features in an
"MSnSet"
instance applying a summarisation
function (see fun
argument) to sets of features as defined by a
factor (see fcol
argument). Note that the feature names are
automatically updated based on the groupBy
parameter.
The coefficient of variations are automatically computed and collated
to the featureData slot. See cv
and cv.norm
arguments
for details.
If NA values are present, a message will be shown. Details on how missing value impact on the data aggregation are provided below.
Arguments
- object
An instance of class
"MSnSet"
whose features will be summerised.- groupBy
A
factor
,character
,numeric
or alist
of the above defining how to summerise the features. The list must be of lengthnrow(object)
. Each element of the list is a vector describing the feature mapping. If the list can be named, its names must matchfetureNames(object)
. Seeredundancy.handler
for details about the latter.- fun
Deprecated; use
method
instead.- method
The summerising function. Currently, mean, median, weighted mean, sum, median polish, robust summarisation (using
MASS::rlm
, implemented inMsCoreUtils::robustSummary()
), iPQF (seeiPQF
for details) and NTR (seeNTR
for details) are implemented, but user-defined functions can also be supplied. Note that the robust menthods assumes that the data are already log-transformed.- fcol
Feature meta-data label (fData column name) defining how to summerise the features. It must be present in
fvarLabels(object)
and, if present, will be used to definedgroupBy
asfData(object)[, fcol]
. Note thatfcol
is ignored ifgroupBy
is present.- redundancy.handler
If
groupBy
is alist
, one of"unique"
(default) or"multiple"
(ignored otherwise) defining how to handle peptides that can be associated to multiple higher-level features (proteins) upon combination. Using"unique"
will only consider uniquely matching features (features matching multiple proteins will be discarded)."multiple"
will allow matching to multiple proteins and each feature will be repeatedly tallied for each possible matching protein.- cv
A
logical
defining if feature coefficients of variation should be computed and stored as feature meta-data. Default isTRUE
.- cv.norm
A
character
defining how to normalise the feature intensitites prior to CV calculation. Default issum
. Usenone
to keep intensities as is. SeefeatureCV
for more details.- verbose
A
logical
indicating whether verbose output is to be printed out.- ...
Additional arguments for the
fun
function.
Value
A new "MSnSet"
instance is returned with
ncol
(i.e. number of samples) is unchanged, but nrow
(i.e. the number od features) is now equals to the number of levels in
groupBy
. The feature metadata (featureData
slot) is
updated accordingly and only the first occurrence of a feature in the
original feature meta-data is kept.
Details
Missing values have different effect based on the aggregation method employed, as detailed below. See also examples below.
When using either
"sum"
,"mean"
,"weighted.mean"
or"median"
, any missing value will be propagated at the higher level. Ifna.rm = TRUE
is used, then the missing value will be ignored.Missing values will result in an error when using
"medpolish"
, unlessna.rm = TRUE
is used.When using robust summarisation (
"robust"
), individual missing values are excluded prior to fitting the linear model by robust regression. To remove all values in the feature containing the missing values, usefilterNA
.The
"iPQF"
method will fail with an error if missing value are present, which will have to be handled explicitly. See below.
More generally, missing values often need dedicated handling such as
filtering (see filterNA
) or imputation (see
impute
).
Author
Laurent Gatto with contributions from Martina Fischer for iPQF and Ludger Goeminne, Adriaan Sticker and Lieven Clement for robust.
References
iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification. Fischer M, Renard BY. Bioinformatics. 2016 Apr 1;32(7):1040-7. doi:10.1093/bioinformatics/btv675. Epub 2015 Nov 20. PubMed PMID:26589272.
Examples
data(msnset)
msnset <- msnset[11:15, ]
exprs(msnset)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> X19 32838.044 37066.058 41429.627 39700.475
#> X2 3715.089 4254.323 4748.462 5249.904
#> X20 34509.686 34928.747 41911.032 42843.839
#> X21 21262.148 23168.729 25407.068 25949.954
#> X22 8635.316 10036.529 9254.432 7769.749
## arbitrary grouping into two groups
grp <- as.factor(c(1, 1, 2, 2, 2))
msnset.comb <- combineFeatures(msnset, groupBy = grp, method = "sum")
dim(msnset.comb)
#> [1] 2 4
exprs(msnset.comb)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 41320.38 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 76563.54
fvarLabels(msnset.comb)
#> [1] "spectrum" "ProteinAccession" "ProteinDescription"
#> [4] "PeptideSequence" "file" "retention.time"
#> [7] "precursor.mz" "precursor.intensity" "charge"
#> [10] "peaks.count" "tic" "ionCount"
#> [13] "ms.level" "acquisition.number" "collision.energy"
#> [16] "CV.iTRAQ4.114" "CV.iTRAQ4.115" "CV.iTRAQ4.116"
#> [19] "CV.iTRAQ4.117"
## grouping with a list
grpl <- list(c("A", "B"), "A", "A", "C", c("C", "B"))
## optional naming
names(grpl) <- featureNames(msnset)
exprs(combineFeatures(msnset, groupBy = grpl, method = "sum", redundancy.handler = "unique"))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> A 38224.78 39183.07 46659.49 48093.74
#> C 21262.15 23168.73 25407.07 25949.95
exprs(combineFeatures(msnset, groupBy = grpl, method = "sum", redundancy.handler = "multiple"))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> A 71062.82 76249.13 88089.12 87794.22
#> B 41473.36 47102.59 50684.06 47470.22
#> C 29897.46 33205.26 34661.50 33719.70
## missing data
exprs(msnset)[4, 4] <-
exprs(msnset)[2, 2] <- NA
exprs(msnset)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> X19 32838.044 37066.06 41429.627 39700.475
#> X2 3715.089 NA 4748.462 5249.904
#> X20 34509.686 34928.75 41911.032 42843.839
#> X21 21262.148 23168.73 25407.068 NA
#> X22 8635.316 10036.53 9254.432 7769.749
## NAs propagate in the 115 and 117 channels
exprs(combineFeatures(msnset, grp, "sum"))
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 NA 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 NA
## NAs are removed before summing
exprs(combineFeatures(msnset, grp, "sum", na.rm = TRUE))
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 37066.06 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 50613.59
## using iPQF
data(msnset2)
anyNA(msnset2)
#> [1] FALSE
res <- combineFeatures(msnset2,
groupBy = fData(msnset2)$accession,
redundancy.handler = "unique",
method = "iPQF",
low.support.filter = FALSE,
ratio.calc = "sum",
method.combine = FALSE)
#> The following 1 proteins are only supported by 1 or 2 peptide spectra,
#> hence, protein quantification is not reliable and can only be calculated
#> by the 'mean' in these cases, corresponding protein accessions are:
#> O95678
head(exprs(res))
#> X114.ions X115.ions X116.ions X117.ions
#> O95678 0.2404726 0.2682764 0.2584247 0.2328263
#> P01766 0.2610278 0.2467206 0.2544715 0.2377801
#> P01776 0.2678859 0.2591250 0.2423396 0.2306495
#> P02749 0.2640340 0.2523566 0.2510357 0.2325736
#> P02763 0.2503318 0.2524583 0.2501628 0.2470472
#> P07225 0.2533961 0.2506013 0.2504353 0.2455673
## using robust summarisation
data(msnset) ## reset data
msnset <- log(msnset, 2) ## log2 transform
## Feature X46, in the ENO protein has one missig value
which(is.na(msnset), arr.ind = TRUE)
#> row col
#> X2 2 2
#> X21 4 4
exprs(msnset["X46", ])
#> Error in (function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "subscript out of bounds", call = orig[[nm]][i, , ..., drop = drop], object = structure(c(15.0030805728121, 11.8591811816279, 15.0747137350561, 14.3759997002879, 13.0760333393114, 15.177811062346, NA, 15.0921272743665, 14.4998912898294, 13.2929728095096, 15.3383752007826, 12.2132444458243, 15.3550424297539, 14.6329423029583, 13.1759286811449, 15.2768686547183, 12.3580752059455, 15.3868001257316, NA, 12.9236522919322), dim = 5:4, dimnames = list( c("X19", "X2", "X20", "X21", "X22"), c("iTRAQ4.114", "iTRAQ4.115", "iTRAQ4.116", "iTRAQ4.117"))), subscript = 1L, index = "X46"), class = c("subscriptOutOfBoundsError", "error", "condition"))): error in evaluating the argument 'object' in selecting a method for function 'exprs': subscript out of bounds
## Only the missing value in X46 and iTRAQ4.116 will be ignored
res <- combineFeatures(msnset,
fcol = "ProteinAccession",
method = "robust")
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
tail(exprs(res))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> ECA1032 15.00308 15.17781 15.33838 15.27687
#> ECA1104 14.37600 14.49989 14.63294 NA
#> ECA1294 11.85918 NA 12.21324 12.35808
#> ECA3356 13.07603 13.29297 13.17593 12.92365
#> ECA4514 15.07471 15.09213 15.35504 15.38680
msnset2 <- filterNA(msnset) ## remove features with missing value(s)
res2 <- combineFeatures(msnset2,
fcol = "ProteinAccession",
method = "robust")
## Here, the values for ENO are different because the whole feature
## X46 that contained the missing value was removed prior to fitting.
tail(exprs(res2))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> ECA1032 15.00308 15.17781 15.33838 15.27687
#> ECA3356 13.07603 13.29297 13.17593 12.92365
#> ECA4514 15.07471 15.09213 15.35504 15.38680