MSnSet
objectcombineFeatures.Rd
This function combines the features in an
"MSnSet"
instance applying a summarisation
function (see fun
argument) to sets of features as defined by a
factor (see fcol
argument). Note that the feature names are
automatically updated based on the groupBy
parameter.
The coefficient of variations are automatically computed and collated
to the featureData slot. See cv
and cv.norm
arguments
for details.
If NA values are present, a message will be shown. Details on how missing value impact on the data aggregation are provided below.
An instance of class "MSnSet"
whose features will be summerised.
A factor
, character
, numeric
or a
list
of the above defining how to summerise the features. The
list must be of length nrow(object)
. Each element of the list
is a vector describing the feature mapping. If the list can be
named, its names must match fetureNames(object)
. See
redundancy.handler
for details about the latter.
Deprecated; use method
instead.
The summerising function. Currently, mean, median,
weighted mean, sum, median polish, robust summarisation (using
MASS::rlm
, implemented in
MsCoreUtils::robustSummary()
), iPQF (see iPQF
for details) and NTR (see NTR
for details) are
implemented, but user-defined functions can also be supplied. Note
that the robust menthods assumes that the data are already
log-transformed.
Feature meta-data label (fData column name) defining how
to summerise the features. It must be present in
fvarLabels(object)
and, if present, will be used to defined
groupBy
as fData(object)[, fcol]
. Note that
fcol
is ignored if groupBy
is present.
If groupBy
is a list
, one of
"unique"
(default) or "multiple"
(ignored otherwise)
defining how to handle peptides that can be associated to multiple
higher-level features (proteins) upon combination. Using
"unique"
will only consider uniquely matching features
(features matching multiple proteins will be discarded).
"multiple"
will allow matching to multiple proteins and each
feature will be repeatedly tallied for each possible matching
protein.
A logical
defining if feature coefficients of
variation should be computed and stored as feature
meta-data. Default is TRUE
.
A character
defining how to normalise the
feature intensitites prior to CV calculation. Default is
sum
. Use none
to keep intensities as is. See
featureCV
for more details.
A logical
indicating whether verbose output is
to be printed out.
Additional arguments for the fun
function.
A new "MSnSet"
instance is returned with
ncol
(i.e. number of samples) is unchanged, but nrow
(i.e. the number od features) is now equals to the number of levels in
groupBy
. The feature metadata (featureData
slot) is
updated accordingly and only the first occurrence of a feature in the
original feature meta-data is kept.
Missing values have different effect based on the aggregation method employed, as detailed below. See also examples below.
When using either "sum"
, "mean"
,
"weighted.mean"
or "median"
, any missing value will be
propagated at the higher level. If na.rm = TRUE
is used, then
the missing value will be ignored.
Missing values will result in an error when using
"medpolish"
, unless na.rm = TRUE
is used.
When using robust summarisation ("robust"
), individual
missing values are excluded prior to fitting the linear model by
robust regression. To remove all values in the feature containing
the missing values, use filterNA
.
The "iPQF"
method will fail with an error if missing
value are present, which will have to be handled explicitly. See
below.
More generally, missing values often need dedicated handling such as
filtering (see filterNA
) or imputation (see
impute
).
iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification. Fischer M, Renard BY. Bioinformatics. 2016 Apr 1;32(7):1040-7. doi:10.1093/bioinformatics/btv675. Epub 2015 Nov 20. PubMed PMID:26589272.
data(msnset)
msnset <- msnset[11:15, ]
exprs(msnset)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> X19 32838.044 37066.058 41429.627 39700.475
#> X2 3715.089 4254.323 4748.462 5249.904
#> X20 34509.686 34928.747 41911.032 42843.839
#> X21 21262.148 23168.729 25407.068 25949.954
#> X22 8635.316 10036.529 9254.432 7769.749
## arbitrary grouping into two groups
grp <- as.factor(c(1, 1, 2, 2, 2))
msnset.comb <- combineFeatures(msnset, groupBy = grp, method = "sum")
dim(msnset.comb)
#> [1] 2 4
exprs(msnset.comb)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 41320.38 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 76563.54
fvarLabels(msnset.comb)
#> [1] "spectrum" "ProteinAccession" "ProteinDescription"
#> [4] "PeptideSequence" "file" "retention.time"
#> [7] "precursor.mz" "precursor.intensity" "charge"
#> [10] "peaks.count" "tic" "ionCount"
#> [13] "ms.level" "acquisition.number" "collision.energy"
#> [16] "CV.iTRAQ4.114" "CV.iTRAQ4.115" "CV.iTRAQ4.116"
#> [19] "CV.iTRAQ4.117"
## grouping with a list
grpl <- list(c("A", "B"), "A", "A", "C", c("C", "B"))
## optional naming
names(grpl) <- featureNames(msnset)
exprs(combineFeatures(msnset, groupBy = grpl, method = "sum", redundancy.handler = "unique"))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> A 38224.78 39183.07 46659.49 48093.74
#> C 21262.15 23168.73 25407.07 25949.95
exprs(combineFeatures(msnset, groupBy = grpl, method = "sum", redundancy.handler = "multiple"))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> A 71062.82 76249.13 88089.12 87794.22
#> B 41473.36 47102.59 50684.06 47470.22
#> C 29897.46 33205.26 34661.50 33719.70
## missing data
exprs(msnset)[4, 4] <-
exprs(msnset)[2, 2] <- NA
exprs(msnset)
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> X19 32838.044 37066.06 41429.627 39700.475
#> X2 3715.089 NA 4748.462 5249.904
#> X20 34509.686 34928.75 41911.032 42843.839
#> X21 21262.148 23168.73 25407.068 NA
#> X22 8635.316 10036.53 9254.432 7769.749
## NAs propagate in the 115 and 117 channels
exprs(combineFeatures(msnset, grp, "sum"))
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 NA 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 NA
## NAs are removed before summing
exprs(combineFeatures(msnset, grp, "sum", na.rm = TRUE))
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> 1 36553.13 37066.06 46178.09 44950.38
#> 2 64407.15 68134.01 76572.53 50613.59
## using iPQF
data(msnset2)
anyNA(msnset2)
#> [1] FALSE
res <- combineFeatures(msnset2,
groupBy = fData(msnset2)$accession,
redundancy.handler = "unique",
method = "iPQF",
low.support.filter = FALSE,
ratio.calc = "sum",
method.combine = FALSE)
#> The following 1 proteins are only supported by 1 or 2 peptide spectra,
#> hence, protein quantification is not reliable and can only be calculated
#> by the 'mean' in these cases, corresponding protein accessions are:
#> O95678
head(exprs(res))
#> X114.ions X115.ions X116.ions X117.ions
#> O95678 0.2404726 0.2682764 0.2584247 0.2328263
#> P01766 0.2610278 0.2467206 0.2544715 0.2377801
#> P01776 0.2678859 0.2591250 0.2423396 0.2306495
#> P02749 0.2640340 0.2523566 0.2510357 0.2325736
#> P02763 0.2503318 0.2524583 0.2501628 0.2470472
#> P07225 0.2533961 0.2506013 0.2504353 0.2455673
## using robust summarisation
data(msnset) ## reset data
msnset <- log(msnset, 2) ## log2 transform
## Feature X46, in the ENO protein has one missig value
which(is.na(msnset), arr.ind = TRUE)
#> row col
#> X2 2 2
#> X21 4 4
exprs(msnset["X46", ])
#> Error in (function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "subscript out of bounds", call = orig[[nm]][i, , ..., drop = drop], object = structure(c(15.0030805728121, 11.8591811816279, 15.0747137350561, 14.3759997002879, 13.0760333393114, 15.177811062346, NA, 15.0921272743665, 14.4998912898294, 13.2929728095096, 15.3383752007826, 12.2132444458243, 15.3550424297539, 14.6329423029583, 13.1759286811449, 15.2768686547183, 12.3580752059455, 15.3868001257316, NA, 12.9236522919322), dim = 5:4, dimnames = list( c("X19", "X2", "X20", "X21", "X22"), c("iTRAQ4.114", "iTRAQ4.115", "iTRAQ4.116", "iTRAQ4.117"))), subscript = 1L, index = "X46"), class = c("subscriptOutOfBoundsError", "error", "condition"))): error in evaluating the argument 'object' in selecting a method for function 'exprs': subscript out of bounds
## Only the missing value in X46 and iTRAQ4.116 will be ignored
res <- combineFeatures(msnset,
fcol = "ProteinAccession",
method = "robust")
#> Your data contains missing values. Please read the relevant section in
#> the combineFeatures manual page for details on the effects of missing
#> values on data aggregation.
tail(exprs(res))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> ECA1032 15.00308 15.17781 15.33838 15.27687
#> ECA1104 14.37600 14.49989 14.63294 NA
#> ECA1294 11.85918 NA 12.21324 12.35808
#> ECA3356 13.07603 13.29297 13.17593 12.92365
#> ECA4514 15.07471 15.09213 15.35504 15.38680
msnset2 <- filterNA(msnset) ## remove features with missing value(s)
res2 <- combineFeatures(msnset2,
fcol = "ProteinAccession",
method = "robust")
## Here, the values for ENO are different because the whole feature
## X46 that contained the missing value was removed prior to fitting.
tail(exprs(res2))
#> iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
#> ECA1032 15.00308 15.17781 15.33838 15.27687
#> ECA3356 13.07603 13.29297 13.17593 12.92365
#> ECA4514 15.07471 15.09213 15.35504 15.38680