Adds Identification Data
addIdentificationData-methods.Rd
These methods add identification data to a raw MS experiment (an
"MSnExp"
object) or to quantitative data (an
"MSnSet"
object). The identification data needs
to be available as a mzIdentML
file (and passed as filenames,
or directly as identification object) or, alternatively, can be passed
as an arbitrary data.frame
. See details in the Methods
section.
Details
The featureData
slots in a "MSnExp"
or a
"MSnSet"
instance provides only one row per MS2
spectrum but the identification is not always bijective. Prior to
addition, the identification data is filtered as documented in the
filterIdentificationDataFrame
function: (1) only PSMs
matching the regular (non-decoy) database are retained; (2) PSMs of
rank greater than 1 are discarded; and (3) only proteotypic peptides
are kept.
If after filtering, more then one PSM per spectrum are still present,
these are combined (reduced, see
reduce,data.frame-method
) into a single row and
separated by a semi-colon. This has as side-effect that feature
variables that are being reduced are converted to characters. See the
reduce
manual page for examples.
See also the section about identification data in the MSnbase-demo vignette for details and additional examples.
After addition of the identification data, new feature variables are
created. The column nprot
contains the number of members in the
protein group; the columns accession
and description
contain a semicolon separated list of all matches. The columns
npsm.prot
and npep.prot
represent the number of PSMs and
peptides that were matched to a particular protein group. The column
npsm.pep
indicates how many PSMs were attributed to a peptide
(as defined by its sequence pepseq
). All these values are
re-calculated after filtering and reduction.
Methods
signature(object = "MSnExp", id = "character", ...
Adds the identification data stored in mzIdentML files to a
"MSnExp"
instance. The method handles one or multiple mzIdentML files provided viaid
.id
has to be acharacter
vector of valid filenames. See below for additional arguments.signature(object = "MSnExp", id = "mzID", ...)
Same as above but
id
is amzID
object generated bymzID::mzID
. See below for additional arguments.signature(object = "MSnExp", id = "mzIDCollection", ...)
Same as above but
id
is amzIDCollection
object. See below for additional arguments.signature(object = "MSnExp", id = "mzRident", ...
Same as above but
id
is amzRident
object generated bymzR::openIdfile
. See below for additional arguments.signature(object = "MSnExp", id = "data.frame", ...
Same as above but
id
could be adata.frame
. See below for additional arguments.signature(object = "MSnSet", id = "character", ...)
Adds the identification data stored in mzIdentML files to an
"MSnSet"
instance. The method handles one or multiple mzIdentML files provided viaid
.id
has to be acharacter
vector of valid filenames. See below for additional arguments.signature(object = "MSnSet", id = "mzID", ...)
Same as above but
id
is amzID
object. See below for additional arguments.signature(object = "MSnSet", id = "mzIDCollection", ...)
Same as above but
id
is amzIDCollection
object. See below for additional arguments.signature(object = "MSnSet", id = "data.frame", ...)
Same as above but
id
is adata.frame
. See below for additional arguments.
The methods above take the following additional argument. These need
to be set when adding identification data as a data.frame
. In
all other cases, the defaults are set automatically.
- fcol
-
The matching between the features (raw spectra or quantiative features) and identification results is done by matching columns in the featue data (the
featureData
slot) and the identification data. These values are the spectrum file index and the acquisition number, passed as acharacter
of length 2. The default values for these variables in theobject
's feature data are"spectrum.file"
and"acquisition.num"
. Values need to be provided whenid
is adata.frame
. - icol
-
The default values for the spectrum file and acquisition numbers in the identification data (the
id
argument) are"spectrumFile"
and"acquisitionNum"
. Values need to be provided whenid
is adata.frame
. - acc
-
The protein (group) accession number or identifier. Defaults are
"DatabaseAccess"
when passing filenames ormzRident
objects and"accession"
when passingmzID
ormzIDCollection
objects. A value needs to be provided whenid
is adata.frame
. - desc
-
The protein (group) description. Defaults are
"DatabaseDescription"
when passing filenames ormzRident
objects and"description"
when passingmzID
ormzIDCollection
objects. A value needs to be provided whenid
is adata.frame
. - pepseq
-
The peptide sequence variable name. Defaults are
"sequence"
when passing filenames ormzRident
objects and"pepseq"
when passingmzID
ormzIDCollection
objects. A value needs to be provided whenid
is adata.frame
. - key
-
The key to be used when the identification data need to be reduced (see details section). Defaults are
"spectrumID"
when passing filenames ormzRident
objects and"spectrumid"
when passingmzID
ormzIDCollection
objects. A value needs to be provided whenid
is adata.frame
. - decoy
-
The feature variable used to define whether the PSM was matched in the decoy of regular fasta database for PSM filtering. Defaults are
"isDecoy"
when passing filenames ormzRident
objects and"isdecoy"
when passingmzID
ormzIDCollection
objects. A value needs to be provided whenid
is adata.frame
. SeefilterIdentificationDataFrame
for details. - rank
-
The feature variable used to defined the rank of the PSM for filtering. Defaults is
"rank"
. A value needs to be provided whenid
is adata.frame
. SeefilterIdentificationDataFrame
for details. - accession
-
The feature variable used to defined the protein (groupo) accession or identifier for PSM filterin. Defaults is to use the same value as
acc
. A value needs to be provided whenid
is adata.frame
. SeefilterIdentificationDataFrame
for details. - verbose
A
logical
defining whether to print out messages or not. Default is to use the session-wide open fromisMSnbaseVerbose
.
See also
filterIdentificationDataFrame
for the function that
filters identification data, readMzIdData
to read the
identification data as a unfiltered data.frame
and
reduce,data.frame-method
to reduce it to a
data.frame
that contains only unique PSMs per row.
Examples
## find path to a mzXML file
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid")
## create basic MSnExp
msexp <- readMSData(quantFile)
## add identification information
msexp <- addIdentificationData(msexp, identFile)
## access featureData
fData(msexp)
#> spectrum acquisition.number sequence chargeState rank
#> F1.S1 1 1 VESITARHGEVLQLRPK 3 1
#> F1.S2 2 2 IDGQWVTHQWLKK 3 1
#> F1.S3 3 3 <NA> NA NA
#> F1.S4 4 4 <NA> NA NA
#> F1.S5 5 5 LVILLFR 2 1
#> passThreshold experimentalMassToCharge calculatedMassToCharge peptideRef
#> F1.S1 TRUE 645.3741 645.0375 Pep2
#> F1.S2 TRUE 546.9586 546.9633 Pep1
#> F1.S3 NA NA NA <NA>
#> F1.S4 NA NA NA <NA>
#> F1.S5 TRUE 437.8040 437.2997 Pep4
#> modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
#> F1.S1 0 FALSE A R 170 186 ECA0984 231
#> F1.S2 0 FALSE A K 50 62 ECA1028 275
#> F1.S3 NA NA <NA> <NA> NA NA <NA> NA <NA>
#> F1.S4 NA NA <NA> <NA> NA NA <NA> NA <NA>
#> F1.S5 0 FALSE L K 22 28 ECA0510 166
#> DatabaseDescription
#> F1.S1 ECA0984 DNA mismatch repair protein
#> F1.S2 ECA1028 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
#> F1.S3 <NA>
#> F1.S4 <NA>
#> F1.S5 ECA0510 putative capsular polysacharide biosynthesis transferase
#> scan.number.s. idFile MS.GF.RawScore MS.GF.DeNovoScore
#> F1.S1 1 dummyiTRAQ.mzid -39 77
#> F1.S2 2 dummyiTRAQ.mzid -30 39
#> F1.S3 NA <NA> NA NA
#> F1.S4 NA <NA> NA NA
#> F1.S5 5 dummyiTRAQ.mzid -42 5
#> MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
#> F1.S1 5.527468e-05 79.36958 <NA> <NA> NA NA
#> F1.S2 9.399048e-06 13.46615 <NA> <NA> NA NA
#> F1.S3 NA NA <NA> <NA> NA NA
#> F1.S4 NA NA <NA> <NA> NA NA
#> F1.S5 2.577830e-04 366.38422 <NA> <NA> NA NA
#> subOriginalResidue subReplacementResidue subLocation nprot npep.prot
#> F1.S1 <NA> <NA> NA 1 1
#> F1.S2 <NA> <NA> NA 1 1
#> F1.S3 <NA> <NA> NA NA NA
#> F1.S4 <NA> <NA> NA NA NA
#> F1.S5 <NA> <NA> NA 1 1
#> npsm.prot npsm.pep
#> F1.S1 1 1
#> F1.S2 1 1
#> F1.S3 NA NA
#> F1.S4 NA NA
#> F1.S5 1 1
idSummary(msexp)
#> spectrumFile idFile coverage
#> 1 dummyiTRAQ.mzXML dummyiTRAQ.mzid 0.6