1 Introduction

In this vignette, we will reanalyse single-cell proteomics data from (Ai et al. 2025).

1.1 Motivation of the original paper

iPSC-derived cardiomyocytes (iCMs) have been proposed as a model to study cardiovascular diseases. However, this requires that iCMs display the same biological features as adult cardiomyocytes (aCMs). The authors of the manuscript used MS-based single-cell proteomics to analyze metabolic changes in iPSCs during differentiation process and to compare the resulting iCMs to freshly-isolated aCMs.

Here, we will focus on the analysis of the aCMs data.

1.2 Data origin

Single cardiomyocytes were isolated from 3 human hearts using enzymatic digestion isolation technique.

Cells were isolated from 4 regions of the heart: - LVepi: left ventricle, epi-myocardium - LVendo: left ventricle, endo-myocardium - LVmid: left ventricle, mid-myocardium - RV: right ventricle

Single cell proteomes were analyzed in a label-free approach. MS data was acquired in data-independent mode (DIA-MS). Data were analyzed with DIA-NN 1.8.1 and DIA-NN 1.8.2. The DIA-NN reports were downloaded from MassIVE (MSV000094438).

1.2.0.1 Packages

The following packages were used to compile this vignette.

library(QFeatures)
library(scp)
library(tidyverse)
library(patchwork)

2 Data prepartion

We can now read in the results file from DIA-NN. We use the file report.tsv file generate by DIA-NN, that can be downloaded via the MsDataHub package1 Version 1.7.1 or later. and read it as a tibble with read_tsv().

(acmsTab <- read_tsv(MsDataHub::Ai2025_aCMs_report.tsv()))
## # A tibble: 1,031,054 × 56
##    File.Name     Run   Protein.Group Protein.Ids Protein.Names Genes PG.Quantity
##    <chr>         <chr> <chr>         <chr>       <chr>         <chr>       <dbl>
##  1 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      26834.
##  2 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      23905.
##  3 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      19715.
##  4 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4       3046.
##  5 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      22406.
##  6 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      17476.
##  7 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      18305.
##  8 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      11097.
##  9 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      20368.
## 10 "X:\\TimsTOF… CMs_… Q9NQP4        Q9NQP4      PFD4_HUMAN    PFDN4      17047.
## # ℹ 1,031,044 more rows
## # ℹ 49 more variables: PG.Normalised <dbl>, PG.MaxLFQ <dbl>,
## #   Genes.Quantity <dbl>, Genes.Normalised <dbl>, Genes.MaxLFQ <dbl>,
## #   Genes.MaxLFQ.Unique <dbl>, Modified.Sequence <chr>,
## #   Stripped.Sequence <chr>, Precursor.Id <chr>, Precursor.Charge <dbl>,
## #   Q.Value <dbl>, PEP <dbl>, Global.Q.Value <dbl>, Protein.Q.Value <dbl>,
## #   PG.Q.Value <dbl>, Global.PG.Q.Value <dbl>, GG.Q.Value <dbl>, …

We need to extract metadata from the file names, in particular: - File name - Sample name - Acquisition date - Subject number - Location in the heart from which each cell was extracted (see above) - Position in the 384 well plate - Position (BA18, …)

Our goal is then to create a data frame with 299 rows, one for each cell, with the information defined above as the columns.

## Extracting file name, sample name, date, subject, well position and
## position
tab <- tibble(File.Name = unique(acmsTab[[1]])) |>
    mutate(Sample = sub("^.+CM_PROJECT\\\\", "", File.Name)) |>
    mutate(Sample = sub("\\\\", "_", Sample)) |>
    mutate(Date = ymd(as.integer(substring(Sample, 1, 6)))) |>
    mutate(Subject = sub("^.+_(Subject[0-9])_.+$", "\\1", Sample)) |>
    mutate(PlateWell = sub("^.+_([A-Z][0-9]+)_.+$", "\\1", Sample)) |>
    mutate(PlateRow = gsub("[0-9]*", "", PlateWell)) |>
    mutate(PlateColumn = gsub("[A-Z]*", "", PlateWell)) |>
    mutate(PlateColumn = factor(PlateColumn, levels = 1:24)) |>
    mutate(Position = sub("^.+_([A-Z]+[0-9]+)_1_[0-9]+\\.d$", "\\1", Sample)) |>
    mutate(FileIndex = sub("^.+_1_([0-9]+)\\.d$", "\\1", Sample)) |>
    mutate(FileIndex = factor(FileIndex, levels = sort(unique(FileIndex))))

## Extracting the heart locations
tab$HeartLocation <- NA
ExpectedLocations <- c("Lvendo", "Lvepi", "Lvmid", "RV", "sytox")
for (i in 1:5) {
  loc <- ExpectedLocations[i]
  tab$HeartLocation[grep(loc, tab$File.Name, ignore.case = TRUE)] <- loc
}
tab
## # A tibble: 299 × 10
##    File.Name   Sample Date       Subject PlateWell PlateRow PlateColumn Position
##    <chr>       <chr>  <date>     <chr>   <chr>     <chr>    <fct>       <chr>   
##  1 "X:\\TimsT… 22101… 2022-10-19 Subjec… B15       B        15          BB17    
##  2 "X:\\TimsT… 22101… 2022-10-19 Subjec… F13       F        13          BF14    
##  3 "X:\\TimsT… 22101… 2022-10-19 Subjec… I14       I        14          BI17    
##  4 "X:\\TimsT… 22101… 2022-10-19 Subjec… A19       A        19          BA20    
##  5 "X:\\TimsT… 22101… 2022-10-19 Subjec… B24       B        24          BB3     
##  6 "X:\\TimsT… 22101… 2022-10-19 Subjec… C22       C        22          BC23    
##  7 "X:\\TimsT… 22101… 2022-10-19 Subjec… D21       D        21          BD22    
##  8 "X:\\TimsT… 22101… 2022-10-19 Subjec… D24       D        24          BD3     
##  9 "X:\\TimsT… 22101… 2022-10-19 Subjec… E24       E        24          BE3     
## 10 "X:\\TimsT… 22101… 2022-10-19 Subjec… F19       F        19          BF2     
## # ℹ 289 more rows
## # ℹ 2 more variables: FileIndex <fct>, HeartLocation <chr>

After creating the data frame we can check that the values in the different columns.

table(tab[, c("Date","Subject")])
##             Subject
## Date         Subject3 Subject4 Subject5
##   2022-10-19      156        0        0
##   2022-10-22        0        0       21
##   2022-10-24        0      122        0
table(tab[, c("HeartLocation","Subject")])
##              Subject
## HeartLocation Subject3 Subject4 Subject5
##        Lvendo       35       34        0
##        Lvepi        62       33        0
##        Lvmid        29       23        0
##        RV           30       32        0
##        sytox         0        0       21

We can see that there are 21 cells from subject 5, annotated as “sytox”, acquired on one specific date. Sytox is a stain used for assessing cell viability. We will therefore remove these cells before we start doing statistical analysis.

To create the QFeatures object, we need to add a column runCol to properly associate the colData to each sample.

tab$runCol <- tab$File.Name

Now we can create the QFeatures object, and include the metadata that we extracted. We also use the fnames argument2 Note that the fnames requires QFeatures version 1.17.2 or later. to use percursor identifiers (i.e. peptides sequence and precursor charge) as assay rownames (this is important to correctly join the assays below).

acms <- readSCPfromDIANN(acmsTab,
                         colData = DataFrame(tab),
                         fnames = "Precursor.Id")
## Checking arguments.
## Loading data as a 'SummarizedExperiment' object.
## Splitting data in runs.
## Formatting sample annotations (colData).
## Formatting data as a 'QFeatures' object.
## Setting assay rownames.
# Setting the names of the QFeatures objects
names(acms) <- acms$Sample
acms
## An instance of class QFeatures containing 299 set(s):
##  [1] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A13_BA14_1_7445.d: SingleCellExperiment with 6634 rows and 1 columns 
##  [2] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A14_BA15_1_7446.d: SingleCellExperiment with 4326 rows and 1 columns 
##  [3] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A15_BA16_1_7447.d: SingleCellExperiment with 6263 rows and 1 columns 
##  ...
##  [297] 221024_Subject-04_CMs_Subject4_D_RV_G6_BG7_1_7841.d: SingleCellExperiment with 2108 rows and 1 columns 
##  [298] 221024_Subject-04_CMs_Subject4_D_RV_H4_BH7_1_7858.d: SingleCellExperiment with 2491 rows and 1 columns 
##  [299] 221024_Subject-04_CMs_Subject4_D_RV_I1_BI11_1_7861.d: SingleCellExperiment with 858 rows and 1 columns

The QFeatures object contains 299 assays, on per single cell.

We immediately replace 0 by NA for downstream analysis.

acms <- zeroIsNA(acms, names(acms))

3 Exploratory data analysis and quality control

The objective of quality control is to remove low-quality data. These data can be filtered based on PSM-level annotations and on cell-level annotations.

3.1 PSM filtering

We first explore the features annotation. We can retrieve all the data annotations along the quantified values in a single long table using longFormat()

lf <- longFormat(acms,
                 colvars = c("HeartLocation", "Subject", "FileIndex"),
                 rowvars = c("PEP", "RT", "Predicted.RT", "IM", "Predicted.IM")) |>
    data.frame()

We first explore the distributions of the posterior error probabilities (PEP) for precursor identification for each cell separately, stratified by subject, a potential variable for batch effects. We sort the cells according to the sequence of MS acquisition (available in the FileIndex column).

ggplot(lf) +
    aes(x = log10(PEP),
        y = FileIndex,
        colour = Subject) +
    geom_boxplot() +
    theme(axis.text.y = element_blank(),
          axis.ticks.y = element_blank())

The plot indicates that the identification errors of every precursor are stable across the experiment. We will therefore not use PEP for PSM filtering.

We now generate a similar plot, but focusing on the difference between the observed retention time (RT) and the predicted RT, where large differences may indicate wrong identification or issues during liquid chromatography. We hence plot the absolute difference between observed RT.

ggplot(lf,
       aes(x = abs(RT - Predicted.RT),
           y = FileIndex,
           colour = Subject)) +
    geom_boxplot() +
    geom_hline(yintercept = 0.18, linetype = "dashed") +
    scale_x_log10() +
    theme(axis.text.y = element_blank(),
          axis.ticks.y = element_blank())
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 1961 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Again, the plot shows stable metrics across cells, with a few outliers above 0.18 spread across the experiment. We will remove those outliers using filterFeatures() that takes the QFeatures object and an arbitrary formula to filter the PSM. Note that you can directly filter on variables from the rowData by including them in the formula.

acms <- filterFeatures(acms, ~ abs(RT - Predicted.RT) < 0.18)
## 'RT' found in 299 out of 299 assay(s)
## 'Predicted.RT' found in 299 out of 299 assay(s)

Next, we remove precursors that map on multiple proteins, as these are of limited use when interpreting downstream, we will remove them. DIA-NN generated the isProteotypic, however the table below shows that some precursors considered as proteotypic may still be part of a protein group where the Protein.Ids contain multiple proteins separated by a ;.

rd <- rbindRowData(acms, names(acms))
table(isProteotypic = rd$Proteotypic,
       isProteinGroup = grepl(";", rd$Protein.Ids))
##              isProteinGroup
## isProteotypic  FALSE   TRUE
##             0      0 100377
##             1 929019    690

We here keep all precursor that are both proteotypic and not part of a protein group.

acms <- filterFeatures(acms,
                       ~ !grepl(";", Protein.Names) &
                           Proteotypic == 1)
## 'Protein.Names' found in 299 out of 299 assay(s)
## 'Proteotypic' found in 299 out of 299 assay(s)

Finally, we add the information whether a precursor belongs to a contaminant protein or not. We retrieve this information using the cRAP database, through the camprotR package.

head(contaminants <- camprotR::get_ccp_crap())
## [1] "P00330" "P02768" "P00883" "P0DUB6" "P0DTE7" "P0DTE8"

We next retrieve all the rowData table in a single table using rbindRowData(), we add a new column isContaminant and inject it back into the QFeatures object after splitting it by assay.

rd <- rbindRowData(acms, names(acms))
rd$isContaminant <- rd$Protein.Ids %in% contaminants
rowData(acms) <- split(rd, rd$assay)

We decide not to remove contaminants as they may be used later for model exploration and diagnostics.

3.2 Sample filtering

We now remove low-quality cells. We will compute a few metrics, explore the experimental design and assess how it impacts the computed metrics and we will conclude with removing cells identified as low-quality, if any.

3.2.1 Compute QC metrics

First, we compute the median intensity within each cell. This is meant as a proxy for the amount of material that has been injected in the MS. Note that in single-cell experiments, difference in the amount of material injected are subject to technical variability, but also to biological variability (cell size, proteome complexity, cell density, …). Hence, different median intensities may not be a problematic artifact if different across heart location.

acms$MedianIntensity <- sapply(names(acms), function(i) {
    median(log2(assay(acms[[i]])), na.rm = TRUE)
})

We also compute the number of identified precursors within each cell. This relates to the number of rows in each set of the QFeatures object.

acms$TotalIds <- nrows(acms)

3.2.2 Explore the experimental design

We now explore the experimental design. First, the sample annotation contain subjects and different dates. The table below shows that each subject has been processed in a different day.

table(acms$Date, acms$Subject)
##             
##              Subject3 Subject4 Subject5
##   2022-10-19      156        0        0
##   2022-10-22        0        0       21
##   2022-10-24        0      122        0

This is not a problematic design as soon we are not interested in distinguishing between subject effects and date effects, which is the case for this experiment.

Next we will explore the plate design. We first retrieve all the available sample annotations from the colData and perform a little data cleaning for improved visualisation.

We will first explore how the data acquisition strategy, by plotting the order of the the MS runs on the plate layout.

data.frame(colData(acms)) |>
    ggplot(aes(y = PlateRow,
               x = PlateColumn,
               fill = as.numeric(FileIndex))) +
    geom_tile() +
    facet_grid(~ Date) +
    scale_fill_continuous(type = "viridis") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom")

We can see that each plate as been acquired sequentially (as already suggested by the date annotation), in order of the rows. Next, we now plot the plate layout with respect to the heart location of the cells, for each data/subject separately.

data.frame(colData(acms)) |>
    ggplot(aes(y = PlateRow,
               x = PlateColumn,
               fill = HeartLocation)) +
    geom_tile() +
    facet_grid(~ Date) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom")

Two observations can be drawn from the plot. First, all sytox cells are from one date and hence their effect cannot be estimated. According to the original study, this sample preparation for this subject failed. Hence we will remove the corresponding data later. Also cells have been assigned in block on the plate, one block for each heart location. While this is probably the consequence of practical constraints, this means we can no longer distinguish between well column effect (if any) and heart location effects. Hence, we will ignore cell column effects.

We plot the same layout, this time colouring by the number of precursors identified.

data.frame(colData(acms)) |>
    ggplot(aes(y = PlateRow,
               x = PlateColumn,
               fill = TotalIds)) +
    geom_tile() +
    facet_grid(~ Date) +
    scale_fill_continuous(type = "viridis") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom")

Columns 10-24 in the first plate seem to have more identified precursors than the other columns, potentially indicating some batch effect during cell sorting. A cell type effect may also be an explanation, but we do not see the effect in the last plate.

data.frame(colData(acms)) |>
    ggplot(aes(y = PlateRow,
               x = PlateColumn,
               fill = MedianIntensity)) +
    geom_tile() +
    facet_grid(~ Date) +
    scale_fill_continuous(type = "viridis") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom")

Interestingly, this plot shows that columns 1-9 in the first plate have lower median intensity. We now combine the information

data.frame(colData(acms)) |>
ggplot(aes(y = MedianIntensity,
           x = TotalIds,
           colour = as.numeric(FileIndex),
           shape = Subject)) +
    geom_point(size = 2) +
    scale_colour_continuous(type = "viridis")

3.2.3 Filter samples

Upon data exploration, we identified that one of the subjects should be removed.

4 Data processing

We now join the 299 single-cell assays into one containing all precursors across all cells.

(acms <- joinAssays(acms, i = names(acms), name = "precursors"))
## An instance of class QFeatures containing 300 set(s):
##  [1] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A13_BA14_1_7445.d: SingleCellExperiment with 6082 rows and 1 columns 
##  [2] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A14_BA15_1_7446.d: SingleCellExperiment with 3958 rows and 1 columns 
##  [3] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A15_BA16_1_7447.d: SingleCellExperiment with 5729 rows and 1 columns 
##  ...
##  [298] 221024_Subject-04_CMs_Subject4_D_RV_H4_BH7_1_7858.d: SingleCellExperiment with 2206 rows and 1 columns 
##  [299] 221024_Subject-04_CMs_Subject4_D_RV_I1_BI11_1_7861.d: SingleCellExperiment with 689 rows and 1 columns 
##  [300] precursors: SingleCellExperiment with 11731 rows and 299 columns

We then log-transforming our data for downstream statistical analyses. The logTransfrorm() function will create a new assay, named precursors_log.

(acms <- logTransform(acms, "precursors", "precursors_log"))
## An instance of class QFeatures containing 301 set(s):
##  [1] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A13_BA14_1_7445.d: SingleCellExperiment with 6082 rows and 1 columns 
##  [2] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A14_BA15_1_7446.d: SingleCellExperiment with 3958 rows and 1 columns 
##  [3] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A15_BA16_1_7447.d: SingleCellExperiment with 5729 rows and 1 columns 
##  ...
##  [299] 221024_Subject-04_CMs_Subject4_D_RV_I1_BI11_1_7861.d: SingleCellExperiment with 689 rows and 1 columns 
##  [300] precursors: SingleCellExperiment with 11731 rows and 299 columns 
##  [301] precursors_log: SingleCellExperiment with 11731 rows and 299 columns

While we will be performing our analyses with precursor data, below we show how to aggregate these data into peptide and protein data.

acms <- aggregateFeatures(acms,
                          i = "precursors_log",
                          name = "peptides",
                          fcol = "Modified.Sequence",
                          fun = colMedians,
                          na.rm = TRUE)
## Your quantitative data contain missing values. Please read the relevant
## section(s) in the aggregateFeatures manual page regarding the effects
## of missing values on data aggregation.
acms <- aggregateFeatures(acms,
                          i = "peptides",
                          name = "proteins",
                          fcol = "Protein.Ids",
                          fun = colMedians,
                          na.rm = TRUE)
## Your quantitative data contain missing values. Please read the relevant
## section(s) in the aggregateFeatures manual page regarding the effects
## of missing values on data aggregation.

We now have two additional assays names peptides and proteins:

acms
## An instance of class QFeatures containing 303 set(s):
##  [1] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A13_BA14_1_7445.d: SingleCellExperiment with 6082 rows and 1 columns 
##  [2] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A14_BA15_1_7446.d: SingleCellExperiment with 3958 rows and 1 columns 
##  [3] 221019_CM_SUBJECT-03_CMs_Subject3_Lvendo_A15_BA16_1_7447.d: SingleCellExperiment with 5729 rows and 1 columns 
##  ...
##  [301] precursors_log: SingleCellExperiment with 11731 rows and 299 columns 
##  [302] peptides: SingleCellExperiment with 10277 rows and 299 columns 
##  [303] proteins: SingleCellExperiment with 2044 rows and 299 columns

5 Statistical modelling

We will model the data following the scplainer pipeline (Vanderaa and Gatto 2024).

5.1 Modelling

We extract the log-transformed precursor assez and remove sytox cells (all from subject 5):

sce <- getWithColData(acms, "precursors_log")
## Warning: 'experiments' dropped; see 'drops()'
sce <- sce[, sce$HeartLocation != "sytox"]

We can now model the precursor quantities considering median intensity in each cell, plate position (column), subject and heart location as factors that influence the measure abundances.

sce <- scpModelWorkflow(
    sce,
    formula = ~ 1 + ## intercept
        ## normalisation
        MedianIntensity +
        ## batch effects
        PlateRow +
        Subject +
        ## biological variability
        HeartLocation,
    verbose = FALSE)

Note that above, we set the verbose argument to FALSE to switch the progress bar out and avoid unnecessary output in this vignette. For interactive use, we recommend to keep the default (verbose = TRUE) to monitor progress.

Here, we define the missing data filtering threshold: we want at least 3 measurement for each coefficient estimation.

scpModelFilterThreshold(sce) <- 3
scpModelFilterPlot(sce)
## To change the threshold, use:
## scpModelFilterThreshold(object, name) <- threshold

5.2 Analysis of variance

(vaRes <- scpVarianceAnalysis(sce))
## DataFrameList of length 5
## names(5): Residuals MedianIntensity PlateRow Subject HeartLocation
## Add annotations from the rowData()
vaRes <- scpAnnotateResults(
    vaRes, rowData(sce), by = "feature", by2 = "Precursor.Id"
)
scpVariancePlot(vaRes)

scpVariancePlot(
    vaRes, top = 20, by = "percentExplainedVar", effect = "Subject",
    decreasing = TRUE, combined = FALSE, fcol = "Protein.Names"
) + scpVariancePlot(
        vaRes, top = 10, by = "percentExplainedVar", effect = "HeartLocation",
        decreasing = TRUE, combined = FALSE, fcol = "Protein.Names"
    ) + plot_layout(ncol = 1, guides = "collect")

5.3 Differential abundance analysis

locations <- unique(sce$HeartLocation)
combinations <- combn(locations, 2)
contrasts <- lapply(1:ncol(combinations),
                    function(i) c("HeartLocation",
                                  combinations[, i]))

(daRes <- scpDifferentialAnalysis(
     sce, contrast = contrasts
 ))
## DataFrameList of length 6
## names(6): HeartLocation_Lvendo_vs_Lvepi ... HeartLocation_Lvmid_vs_RV
daRes[[1]]
## DataFrame with 6051 rows and 7 columns
##            feature   Estimate        SE        Df tstatistic      pvalue
##        <character>  <numeric> <numeric> <numeric>  <numeric>   <numeric>
## 1    ILNPAAIPEG...   0.133536  0.288787       256   0.462402 6.44186e-01
## 2    VFDEFKPLVE...   1.099511  0.288538       256   3.810624 1.73664e-04
## 3    VLSIGDGIAR...  -1.083659  0.262455       255  -4.128928 4.94204e-05
## 4    ELTTEIDNNI...  -0.417190  0.270743       254  -1.540908 1.24584e-01
## 5    KVPQVSTPTL...   0.174581  0.244313       254   0.714580 4.75525e-01
## ...            ...        ...       ...       ...        ...         ...
## 6047 ADAPEEEDHV...  0.8504744  0.581988        65   1.461327  0.14874334
## 6048 GGVSDHSAFL...  0.7455283  0.280467        63   2.658166  0.00994636
## 6049 VVNISSMLGR... -0.3850981  0.192371        30  -2.001847  0.05441535
## 6050 ELSGLGSALK... -0.1667748  0.194061       146  -0.859392  0.39153328
## 6051 ELSTYYQVEP... -0.0600048  0.239444        50  -0.250601  0.80314989
##            padj
##       <numeric>
## 1    0.80546425
## 2    0.00294238
## 3    0.00134100
## 4    0.28573059
## 5    0.67930494
## ...         ...
## 6047  0.3419627
## 6048  0.0419948
## 6049  0.1800259
## 6050  0.7845657
## 6051  1.0000000
daRes <-
    scpAnnotateResults(
        daRes, rowData(sce),
        by = "feature", by2 = "Precursor.Id")

scpVolcanoPlot(daRes,
               textBy = "Protein.Names",
               pointParams = list(aes(colour = Lib.Q.Value)))[[1]]

5.4 Component analysis

(caRes <- scpComponentAnalysis(
     sce, ncomp = 15, method = "APCA"))
## List of length 2
## names(2): bySample byFeature
caResCells <- caRes$bySample
sce$cell <- colnames(sce)
caResCells <- scpAnnotateResults(caResCells,
                                 colData(sce), by = "cell")


scpComponentPlot(
    caResCells,
    pointParams = list(aes(colour = HeartLocation,
                           shape = Subject),
                       size = 3)) |>
    wrap_plots() +
    plot_layout(guides = "collect")

library(scater)
## Loading required package: SingleCellExperiment
## Loading required package: scuttle
sce <-addReducedDims(sce, caRes$bySample)
sce <- runTSNE(sce, dimred = "APCA_HeartLocation")

plotTSNE(sce, colour_by = "HeartLocation") +
    plotTSNE(sce, colour_by = "Subject") +
    plotTSNE(sce, colour_by = "PlateRow") +
    plotTSNE(sce, colour_by = "MedianIntensity")

We can not extract the hear location effect matrix, removing the subject, plate position and median intensity effects for downstream analysis.

scebr <- scpRemoveBatchEffect(
  sce, effects = c("Subject", "PlateRow", "MedianIntensity"),
  intercept = TRUE)

6 Downstream analyses

Coming soon.

7 Session information

This vignette was prepared as part of the EuBIC 2025 developer hackathon.

using the follwing software

## R Under development (unstable) (2025-01-25 r87633)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /opt/Rdevel/lib/R/lib/libRblas.so 
## LAPACK: /opt/Rdevel/lib/R/lib/libRlapack.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Brussels
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.35.1               scuttle_1.17.0             
##  [3] SingleCellExperiment_1.29.1 MsDataHub_1.7.1            
##  [5] patchwork_1.3.0             lubridate_1.9.4            
##  [7] forcats_1.0.0               stringr_1.5.1              
##  [9] dplyr_1.1.4                 purrr_1.0.2                
## [11] readr_2.1.5                 tidyr_1.3.1                
## [13] tibble_3.2.1                ggplot2_3.5.1              
## [15] tidyverse_2.0.0             scp_1.17.0                 
## [17] QFeatures_1.17.2            MultiAssayExperiment_1.33.8
## [19] SummarizedExperiment_1.37.0 Biobase_2.67.0             
## [21] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
## [23] IRanges_2.41.2              S4Vectors_0.45.2           
## [25] BiocGenerics_0.53.6         generics_0.1.3             
## [27] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [29] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      jsonlite_1.9.0          magrittr_2.0.3         
##   [4] ggbeeswarm_0.7.2        magick_2.8.5            farver_2.1.2           
##   [7] MALDIquant_1.22.3       rmarkdown_2.29          vctrs_0.6.5            
##  [10] memoise_2.0.1           tinytex_0.54            htmltools_0.5.8.1      
##  [13] S4Arrays_1.7.1          BiocBaseUtils_1.9.0     AnnotationHub_3.15.0   
##  [16] curl_6.2.1              BiocNeighbors_2.1.2     SparseArray_1.7.4      
##  [19] mzID_1.45.0             sass_0.4.9              bslib_0.8.0            
##  [22] plyr_1.8.9              impute_1.81.0           cachem_1.1.0           
##  [25] igraph_2.1.4            iterators_1.0.14        mime_0.12              
##  [28] lifecycle_1.0.4         pkgconfig_2.0.3         rsvd_1.0.5             
##  [31] Matrix_1.7-2            R6_2.6.1                fastmap_1.2.0          
##  [34] GenomeInfoDbData_1.2.13 clue_0.3-66             digest_0.6.37          
##  [37] fdrtool_1.2.18          pcaMethods_1.99.0       colorspace_2.1-1       
##  [40] AnnotationDbi_1.69.0    irlba_2.3.5.1           ExperimentHub_2.15.0   
##  [43] camprotR_0.0.0.9000     lpsymphony_1.35.0       RSQLite_2.3.9          
##  [46] beachmat_2.23.6         filelock_1.0.3          labeling_0.4.3         
##  [49] timechange_0.3.0        httr_1.4.7              abind_1.4-8            
##  [52] compiler_4.5.0          doParallel_1.0.17       bit64_4.6.0-1          
##  [55] withr_3.0.2             BiocParallel_1.41.2     viridis_0.6.5          
##  [58] DBI_1.2.3               MASS_7.3-64             rappdirs_0.3.3         
##  [61] DelayedArray_0.33.4     mzR_2.41.1              tools_4.5.0            
##  [64] vipor_0.4.7             PSMatch_1.11.0          beeswarm_0.4.0         
##  [67] glue_1.8.0              grid_4.5.0              Rtsne_0.17             
##  [70] cluster_2.1.8           reshape2_1.4.4          gtable_0.3.6           
##  [73] tzdb_0.4.0              preprocessCore_1.69.0   hms_1.1.3              
##  [76] ScaledMatrix_1.15.0     BiocSingular_1.23.0     metapod_1.15.0         
##  [79] utf8_1.2.4              XVector_0.47.2          foreach_1.5.2          
##  [82] ggrepel_0.9.6           BiocVersion_3.21.1      pillar_1.10.1          
##  [85] limma_3.63.3            vroom_1.6.5             robustbase_0.99-4-1    
##  [88] BiocFileCache_2.15.1    lattice_0.22-6          bit_4.5.0.1            
##  [91] tidyselect_1.2.1        Biostrings_2.75.3       knitr_1.49             
##  [94] gridExtra_2.3           bookdown_0.42           ProtGenerics_1.39.2    
##  [97] IHW_1.35.0              xfun_0.50               statmod_1.5.0          
## [100] MSnbase_2.33.2         
##  [ reached 'max' / getOption("max.print") -- omitted 30 entries ]

References

Ai, Lizhuo, Aleksandra Binek, Vladimir Zhemkov, Jae Hyung Cho, Ali Haghani, Simion Kreimer, Edo Israely, et al. 2025. “Single Cell Proteomics Reveals Specific Cellular Subtypes in Cardiomyocytes Derived from Human iPSCs and Adult Hearts.” Mol. Cell. Proteomics, no. 100910 (January): 100910. https://doi.org/10.1016/j.mcpro.2025.100910.
Vanderaa, Christophe, and Laurent Gatto. 2024. scplainer: Using Linear Models to Understand Mass Spectrometry-Based Single-Cell Proteomics Data.” bioRxiv. https://doi.org/10.1101/2023.12.14.571792.