Chapter 3 Introductory statistics with R

Objectives

  • Randomization and basic statistics
  • Statistical hypothesis testing: t-test
  • Sample size calculation
  • Analysis for categorical data
  • Linear regression and correlation

3.1 Basic statistics

3.1.1 Randomization

3.1.1.1 Random selection of samples from a larger set

Let’s assume that we have the population with a total of 10 subjects. Suppose we label them from 1 to 10 and randomly would like to select 3 subjects we can do this using the sample function. When we run sample another time, different subjects will be selected. Try this a couple times.

sample(10, 3)
## [1]  7 10  8
sample(10, 3)
## [1] 8 2 4

Now suppose we would like to select the same randomly selected samples every time, then we can use a random seed number.

set.seed(3728)
sample(10, 3)
## [1]  1  9 10
set.seed(3728)
sample(10, 3)
## [1]  1  9 10

Let’s practice with fun example. Select two in our group member for coming early next Monday.

group.member <- c('Cyril', 'Dan', 'Kylie', 'Meena', 'Sara', 'Ting', 'Tsung-Heng', 'Tyler')
sample(group.member, 2)
## [1] "Ting" "Sara"

3.1.1.2 Completely randomized order of MS runs

Let’s load iprg data first.

load('./data/iprg.rda')

We can also create a random order using all elements of iPRG dataset. Again, we can achieve this using sample, asking for exactly the amount of samples in the subset. This time, each repetition gives us a different order of the complete set.

msrun <- unique(iprg$Run)
msrun
##  [1] "JD_06232014_sample1_B.raw" "JD_06232014_sample1_C.raw"
##  [3] "JD_06232014_sample1-A.raw" "JD_06232014_sample2_A.raw"
##  [5] "JD_06232014_sample2_B.raw" "JD_06232014_sample2_C.raw"
##  [7] "JD_06232014_sample3_A.raw" "JD_06232014_sample3_B.raw"
##  [9] "JD_06232014_sample3_C.raw" "JD_06232014_sample4_B.raw"
## [11] "JD_06232014_sample4_C.raw" "JD_06232014_sample4-A.raw"
## randomize order among all 12 MS runs
sample(msrun, length(msrun))
##  [1] "JD_06232014_sample4_C.raw" "JD_06232014_sample3_A.raw"
##  [3] "JD_06232014_sample4-A.raw" "JD_06232014_sample4_B.raw"
##  [5] "JD_06232014_sample2_C.raw" "JD_06232014_sample1_C.raw"
##  [7] "JD_06232014_sample2_A.raw" "JD_06232014_sample3_C.raw"
##  [9] "JD_06232014_sample3_B.raw" "JD_06232014_sample1-A.raw"
## [11] "JD_06232014_sample1_B.raw" "JD_06232014_sample2_B.raw"
## different order will be shown.
sample(msrun, length(msrun))
##  [1] "JD_06232014_sample1_B.raw" "JD_06232014_sample3_A.raw"
##  [3] "JD_06232014_sample1-A.raw" "JD_06232014_sample2_A.raw"
##  [5] "JD_06232014_sample3_B.raw" "JD_06232014_sample1_C.raw"
##  [7] "JD_06232014_sample2_B.raw" "JD_06232014_sample3_C.raw"
##  [9] "JD_06232014_sample4_C.raw" "JD_06232014_sample4-A.raw"
## [11] "JD_06232014_sample2_C.raw" "JD_06232014_sample4_B.raw"

3.1.1.3 Randomized block design

  • Allow to remove known sources of variability that you are not interested in.

  • Group conditions into blocks such that the conditions in a block are as similar as possible.

  • Randomly assign samples with a block.

This particular dataset contains a total of 12 MS runs across 4 conditions, 3 technical replicates per condition. Using the block.random function in the psych package, we can achieve randomized block designs! block.random function makes random assignment of n subjects with an equal number in all of N conditions.

library("psych") ## load the psych package

msrun <- unique(iprg[, c('Condition','Run')])
msrun
## # A tibble: 12 x 2
##    Condition  Run                      
##    <chr>      <chr>                    
##  1 Condition1 JD_06232014_sample1_B.raw
##  2 Condition1 JD_06232014_sample1_C.raw
##  3 Condition1 JD_06232014_sample1-A.raw
##  4 Condition2 JD_06232014_sample2_A.raw
##  5 Condition2 JD_06232014_sample2_B.raw
##  6 Condition2 JD_06232014_sample2_C.raw
##  7 Condition3 JD_06232014_sample3_A.raw
##  8 Condition3 JD_06232014_sample3_B.raw
##  9 Condition3 JD_06232014_sample3_C.raw
## 10 Condition4 JD_06232014_sample4_B.raw
## 11 Condition4 JD_06232014_sample4_C.raw
## 12 Condition4 JD_06232014_sample4-A.raw
## 4 Conditions of 12 MS runs randomly ordered
block.random(n = 12, c(Condition = 4))
##     blocks Condition
## S1       1         4
## S2       1         1
## S3       1         2
## S4       1         3
## S5       2         2
## S6       2         4
## S7       2         1
## S8       2         3
## S9       3         4
## S10      3         1
## S11      3         3
## S12      3         2
block.random(n = 12, c(Condition = 4, BioReplicate=3))
##     blocks Condition BioReplicate
## S1       1         3            3
## S2       1         1            3
## S3       1         1            2
## S4       1         4            2
## S5       1         1            1
## S6       1         3            2
## S7       1         2            2
## S8       1         2            3
## S9       1         4            3
## S10      1         2            1
## S11      1         3            1
## S12      1         4            1

3.1.2 Basic statistical summaries

library(dplyr)

3.1.2.1 Calculate simple statistics

Let’s start data with one protein as an example and calculate the mean, standard deviation, standard error of the mean across all replicates per condition. We then store all the computed statistics into a single summary data frame for easy access.

We can use the aggregate function to compute summary statistics. aggregate splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

# check what proteins are in dataset, show all protein names
head(unique(iprg$Protein))
## [1] "sp|D6VTK4|STE2_YEAST"  "sp|O13297|CET1_YEAST"  "sp|O13329|FOB1_YEAST" 
## [4] "sp|O13539|THP2_YEAST"  "sp|O13547|CCW14_YEAST" "sp|O13563|RPN13_YEAST"
length(unique(iprg$Protein))
## [1] 3027
#distinct(iprg, Protein)
n_distinct(iprg$Protein)
## [1] 3027
# Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]

# there are 12 rows in oneproteindata
oneproteindata
## # A tibble: 12 x 7
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|P44…          26.3 JD_0… Conditio…            1 82714388.
##  2 sp|P44…          26.1 JD_0… Conditio…            1 72749239.
##  3 sp|P44…          26.3 JD_0… Conditio…            1 82100518.
##  4 sp|P44…          25.8 JD_0… Conditio…            2 59219741.
##  5 sp|P44…          26.1 JD_0… Conditio…            2 72690802.
##  6 sp|P44…          26.1 JD_0… Conditio…            2 71180513.
##  7 sp|P44…          23.1 JD_0… Conditio…            3  9295260.
##  8 sp|P44…          23.3 JD_0… Conditio…            3 10505591.
##  9 sp|P44…          23.3 JD_0… Conditio…            3 10295788.
## 10 sp|P44…          20.9 JD_0… Conditio…            4  2019205.
## 11 sp|P44…          21.7 JD_0… Conditio…            4  3440629.
## 12 sp|P44…          20.3 JD_0… Conditio…            4  1248781.
## # … with 1 more variable: TechReplicate <chr>
# with dplyr
oneproteindata.bcp <- filter(iprg, Protein == "sp|P44015|VAC2_YEAST")
oneproteindata.bcp
## # A tibble: 12 x 7
##    Protein Log2Intensity Run   Condition BioReplicate Intensity
##    <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
##  1 sp|P44…          26.3 JD_0… Conditio…            1 82714388.
##  2 sp|P44…          26.1 JD_0… Conditio…            1 72749239.
##  3 sp|P44…          26.3 JD_0… Conditio…            1 82100518.
##  4 sp|P44…          25.8 JD_0… Conditio…            2 59219741.
##  5 sp|P44…          26.1 JD_0… Conditio…            2 72690802.
##  6 sp|P44…          26.1 JD_0… Conditio…            2 71180513.
##  7 sp|P44…          23.1 JD_0… Conditio…            3  9295260.
##  8 sp|P44…          23.3 JD_0… Conditio…            3 10505591.
##  9 sp|P44…          23.3 JD_0… Conditio…            3 10295788.
## 10 sp|P44…          20.9 JD_0… Conditio…            4  2019205.
## 11 sp|P44…          21.7 JD_0… Conditio…            4  3440629.
## 12 sp|P44…          20.3 JD_0… Conditio…            4  1248781.
## # … with 1 more variable: TechReplicate <chr>
# If you want to see more details,
?aggregate

3.1.2.2 Calculate mean per groups

## splits 'oneproteindata' into subsets by 'Condition',
## then, compute 'FUN=mean' of 'log2Int'
sub.mean <- aggregate(Log2Intensity ~ Condition,
                      data = oneproteindata,
                      FUN = mean)
sub.mean
##    Condition Log2Intensity
## 1 Condition1      26.23632
## 2 Condition2      26.00661
## 3 Condition3      23.25609
## 4 Condition4      20.97056
# with dplyr
sub.mean.bcp <- oneproteindata %>%
    group_by(Condition) %>%
    summarise(mean=mean(Log2Intensity))

sub.mean.bcp
## # A tibble: 4 x 2
##   Condition   mean
##   <chr>      <dbl>
## 1 Condition1  26.2
## 2 Condition2  26.0
## 3 Condition3  23.3
## 4 Condition4  21.0

3.1.2.3 Calculate SD (standard deviation) per groups

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2} \]

Challenge

Using the aggregate function above, calculate the standard deviation, by applying the median function.

## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.median <- aggregate(Log2Intensity ~ Condition,
                    data = oneproteindata, FUN = median)
sub.median
##    Condition Log2Intensity
## 1 Condition1      26.29089
## 2 Condition2      26.08498
## 3 Condition3      23.29555
## 4 Condition4      20.94536
# with dplyr
sub.median.bcp <- oneproteindata %>%
    group_by(Condition) %>%
    summarise(median=median(Log2Intensity))

sub.median.bcp
## # A tibble: 4 x 2
##   Condition  median
##   <chr>       <dbl>
## 1 Condition1   26.3
## 2 Condition2   26.1
## 3 Condition3   23.3
## 4 Condition4   20.9

Using the aggregate function above, calculate the standard deviation, by applying the sd function.

## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.sd <- aggregate(Log2Intensity ~ Condition,
                    data = oneproteindata, FUN = sd)
sub.sd
##    Condition Log2Intensity
## 1 Condition1    0.10396539
## 2 Condition2    0.16268179
## 3 Condition3    0.09467798
## 4 Condition4    0.73140174
# with dplyr
sub.sd.bcp <- oneproteindata %>%
    group_by(Condition) %>%
    summarise(sd = sd(Log2Intensity))

sub.sd.bcp
## # A tibble: 4 x 2
##   Condition      sd
##   <chr>       <dbl>
## 1 Condition1 0.104 
## 2 Condition2 0.163 
## 3 Condition3 0.0947
## 4 Condition4 0.731

3.1.2.4 Count the number of observation per groups

Challenge

Using the aggregate function above, count the number of observations per group with the length function.

## The same as mean calculation. 'FUN' is changed 'length'.
sub.len <- aggregate(Log2Intensity ~ Condition,
                     data = oneproteindata,
                     FUN = length)
sub.len
##    Condition Log2Intensity
## 1 Condition1             3
## 2 Condition2             3
## 3 Condition3             3
## 4 Condition4             3
# with dplyr
sub.len.bcp <- oneproteindata %>%
    group_by(Condition) %>%
    summarise(count = n())

sub.len.bcp
## # A tibble: 4 x 2
##   Condition  count
##   <chr>      <int>
## 1 Condition1     3
## 2 Condition2     3
## 3 Condition3     3
## 4 Condition4     3

3.1.2.5 Calculate SE (standard error of mean) per groups

\[ SE = \sqrt{\frac{s^2}{n}} \]

sub.se <- sqrt(sub.sd$Log2Intensity^2 / sub.len$Log2Intensity)
sub.se
## [1] 0.06002444 0.09392438 0.05466236 0.42227499

We can now make the summary table including the results above (mean, sd, se and length).

## paste0 : concatenate vectors after convering to character.
(grp <- paste0("Condition", 1:4))
## [1] "Condition1" "Condition2" "Condition3" "Condition4"
## It is equivalent to paste("Condition", 1:4, sep="")
summaryresult <- data.frame(Group = grp,
                            mean = sub.mean$Log2Intensity,
                            sd = sub.sd$Log2Intensity,
                            se = sub.se,
                            length = sub.len$Log2Intensity)
summaryresult
##        Group     mean         sd         se length
## 1 Condition1 26.23632 0.10396539 0.06002444      3
## 2 Condition2 26.00661 0.16268179 0.09392438      3
## 3 Condition3 23.25609 0.09467798 0.05466236      3
## 4 Condition4 20.97056 0.73140174 0.42227499      3

Challenge

Make the same table as summaryresult with dplyr package.

summaryresult.dplyr <- oneproteindata %>%
    group_by(Condition) %>%
    summarise(mean = mean(Log2Intensity),
              sd = sd(Log2Intensity),
              length = n())
summaryresult.dplyr <- mutate(summaryresult.dplyr, se=sqrt(sd^2 / length))

summaryresult.dplyr
## # A tibble: 4 x 5
##   Condition   mean     sd length     se
##   <chr>      <dbl>  <dbl>  <int>  <dbl>
## 1 Condition1  26.2 0.104       3 0.0600
## 2 Condition2  26.0 0.163       3 0.0939
## 3 Condition3  23.3 0.0947      3 0.0547
## 4 Condition4  21.0 0.731       3 0.422

3.1.3 Visualization with error bars for descriptive purpose

error bars can have a variety of meanings or conclusions if what they represent is not precisely specified. Below we provide some examples of which types of error bars are common. We’re using the summary of protein sp|P44015|VAC2_YEAST from the previous section and the ggplot2 package as it provides a convenient way to make easily adaptable plots.

library(ggplot2)
# means without any errorbar
p <- ggplot(aes(x = Group, y = mean, colour = Group),
            data = summaryresult)+
    geom_point(size = 3)
p

Let’s change a number of visual properties to make the plot more attractive.

  • Let’s change the labels of x-axis and y-axis and title: labs(title="Mean", x="Condition", y='Log2(Intensity)')
  • Let’s change background color for white: theme_bw()
  • Let’s change size or color of labels of axes and title, text of x-axis by using a theme
  • Let’s change the position of legend (use 'none' to remove it)
  • Let’s make the box for legend
  • Let’s remove the box for legend key.

See also this post for options of theme, post for complete theme.

p2 <- p + labs(title = "Mean", x = "Group", y = 'Log2(Intensity)') +
    theme_bw() +
    theme(plot.title = element_text(size = 25, colour = "darkblue"),
          axis.title.x = element_text(size = 15),
          axis.title.y = element_text(size = 15),
          axis.text.x = element_text(size = 13),
          legend.position = 'bottom',
          legend.background = element_rect(colour = 'black'),
          legend.key = element_rect(colour = 'white'))
p2

Let’s now add the standard deviation:

# mean with SD
p2 + geom_errorbar(aes(ymax = mean + sd, ymin = mean - sd), width = 0.1) +
      labs(title="Mean with SD")

Challenge

Add the standard error of the mean. Which one is smaller?

# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1) +
    labs(title="Mean with SE")

## The SE is narrow than the SD!

Challenge

Add the standard error of the mean with black color.

# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1, color='black') +
    labs(title="Mean with SE")

3.1.4 Working with statistical distributions

For each statistical distribution, we have function to compute

  • density
  • distribution function
  • quantile function
  • random generation

For the normale distribution norm, these are respectively

  • dnorm
  • pnorm
  • qnorm
  • rnorm

Let’s start by sampling 1000000 values from a normal distribution \(N(0, 1)\):

xn <- rnorm(1e6)
hist(xn, freq = FALSE)
rug(xn)
lines(density(xn), lwd = 2)

By definition, the area under the density curve is 1. The area at the left of 0, 1, and 2 are respectively:

pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(2)
## [1] 0.9772499

To ask the inverse question, we use the quantile function. The obtain 0.5, 0.8413447 and 0.9772499 of our distribution, we need means of:

qnorm(0.5)
## [1] 0
qnorm(pnorm(1))
## [1] 1
qnorm(pnorm(2))
## [1] 2
qnorm(0.05)
## [1] -1.644854

Finally, the density function gives us the height at which we are for a given mean:

hist(xn, freq = FALSE)
lines(density(xn), lwd = 2)
points(0, dnorm(0), pch = 19, col = "red")
points(1, dnorm(1), pch = 1, col = "blue")
points(2, dnorm(2), pch = 4, col = "orange")

3.1.5 Calculate the confidence interval

Now that we’ve covered the standard error of the mean and the standard deviation, let’s investigate how we can add custom confidence intervals (CI) for our measurement of the mean. We’ll add these CI’s to the summary results we previously stored for protein sp|P44015|VAC2_YEAST.

Confidence interval:

\[\mbox{mean} \pm (SE \times \frac{\alpha}{2} ~ \mbox{quantile of t distribution})\]

To calculate the 95% confident interval, we need to be careful and set the quantile for two-sided. We need to divide by two for error. For example, 95% confidence interval, right tail is 2.5% and left tail is 2.5%.

summaryresult$ciw.lower.95 <- summaryresult$mean -
    qt(0.975, summaryresult$len - 1) * summaryresult$se
summaryresult$ciw.upper.95 <- summaryresult$mean +
    qt(0.975, summaryresult$len - 1) * summaryresult$se
summaryresult
##        Group     mean         sd         se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444      3     25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438      3     25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236      3     23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499      3     19.15366
##   ciw.upper.95
## 1     26.49458
## 2     26.41073
## 3     23.49128
## 4     22.78746

summaryresult.dplyr %>% mutate(ciw.lower.95 = mean - qt(0.975, length-1)*se,
                               ciw.upper.95 = mean + qt(0.975, length-1)*se)
## # A tibble: 4 x 7
##   Condition   mean     sd length     se ciw.lower.95 ciw.upper.95
##   <chr>      <dbl>  <dbl>  <int>  <dbl>        <dbl>        <dbl>
## 1 Condition1  26.2 0.104       3 0.0600         26.0         26.5
## 2 Condition2  26.0 0.163       3 0.0939         25.6         26.4
## 3 Condition3  23.3 0.0947      3 0.0547         23.0         23.5
## 4 Condition4  21.0 0.731       3 0.422          19.2         22.8
summaryresult.dplyr
## # A tibble: 4 x 5
##   Condition   mean     sd length     se
##   <chr>      <dbl>  <dbl>  <int>  <dbl>
## 1 Condition1  26.2 0.104       3 0.0600
## 2 Condition2  26.0 0.163       3 0.0939
## 3 Condition3  23.3 0.0947      3 0.0547
## 4 Condition4  21.0 0.731       3 0.422

# mean with 95% two-sided confidence interval
ggplot(aes(x = Group, y = mean, colour = Group),
       data = summaryresult) +
    geom_point() +
    geom_errorbar(aes(ymax = ciw.upper.95, ymin = ciw.lower.95), width = 0.1) +
    labs(title="Mean with 95% confidence interval", x="Condition", y='Log2(Intensity)') +
    theme_bw() +
    theme(plot.title = element_text(size=25, colour="darkblue"),
          axis.title.x = element_text(size=15),
          axis.title.y = element_text(size=15),
          axis.text.x = element_text(size=13),
          legend.position = 'bottom',
          legend.background = element_rect(colour = 'black'),
          legend.key = element_rect(colour='white'))

Challenges

Replicate the above for the 99% two-sided confidence interval.

# mean with 99% two-sided confidence interval
summaryresult$ciw.lower.99 <- summaryresult$mean - qt(0.995,summaryresult$len-1) * summaryresult$se
summaryresult$ciw.upper.99 <- summaryresult$mean + qt(0.995,summaryresult$len-1) * summaryresult$se
summaryresult
##        Group     mean         sd         se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444      3     25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438      3     25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236      3     23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499      3     19.15366
##   ciw.upper.95 ciw.lower.99 ciw.upper.99
## 1     26.49458     25.64058     26.83205
## 2     26.41073     25.07442     26.93879
## 3     23.49128     22.71357     23.79860
## 4     22.78746     16.77955     25.16157
ggplot(aes(x = Group, y = mean, colour = Group),
       data = summaryresult) +
    geom_point() +
    geom_errorbar(aes(ymax = ciw.upper.99, ymin=ciw.lower.99), width=0.1) +
    labs(title="Mean with 99% confidence interval", x="Condition", y='Log2(Intensity)') +
    theme_bw()+
    theme(plot.title = element_text(size=25, colour="darkblue"),
          axis.title.x = element_text(size=15),
          axis.title.y = element_text(size=15),
          axis.text.x = element_text(size=13),
          legend.position = 'bottom',
          legend.background = element_rect(colour='black'),
          legend.key = element_rect(colour='white'))

Some comments

  • Error bars with SD and CI are overlapping between groups!

  • Error bars for the SD show the spread of the population while error bars based on SE reflect the uncertainty in the mean and depend on the sample size.

  • Confidence intervals of n on the other hand mean that the intervals capture the population mean n percent of the time.

  • When the sample size increases, CI and SE are getting closer to each other.

3.1.6 Saving our results

We have two objects that contain all the information that we have generated so far:

  • The summaryresults and summaryresults.dplyr objects, that contains all the summary statistics.
save(summaryresult, file = "./data/summaryresults.rda")
save(summaryresult.dplyr, file = "./data/summaryresults.dplyr.rda")

We can also save the summary result as a csv file using the write.csv function:

write.csv(sumamryresult, file = "./data/summary.csv")

Tip: Exporting to csv is useful to share your work with collaborators that do not use R, but for many continous work in R, to assure data validity accords platforms, the best format is rda.

3.2 Statistical hypothesis test

First, we are going to prepare the session for further analyses.

load("./data/summaryresults.rda")
load("./data/iprg.rda")

3.2.1 Two sample t-test for one protein with one feature

Now, we’ll perform a t-test whether protein sp|P44015|VAC2_YEAST has a change in abundance between Condition 1 and Condition 2.

Hypothesis

  • \(H_0\): no change in abundance, mean(Condition1) - mean(Condition2) = 0
  • \(H_a\): change in abundance, mean(Condition1) - mean(Condition 2) \(\neq\) 0

Statistics

  • Observed \(t = \frac{\mbox{difference of group means}}{\mbox{estimate of variation}} = \frac{(mean_{1} - mean_{2})}{SE} \sim t_{\alpha/2, df}\)
  • Standard error, \(SE=\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}\)

with

  • \(n_{i}\): number of replicates
  • \(s_{i}^2 = \frac{1}{n_{i}-1} \sum (Y_{ij} - \bar{Y_{i \cdot}})^2\): sample variance

Data preparation

## Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]

## Then, get two conditions only, because t.test only works for two groups (conditions).
oneproteindata.condition12 <- oneproteindata[oneproteindata$Condition %in%
                                             c('Condition1', 'Condition2'), ]
oneproteindata.condition12
## # A tibble: 6 x 7
##   Protein Log2Intensity Run   Condition BioReplicate Intensity
##   <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
## 1 sp|P44…          26.3 JD_0… Conditio…            1 82714388.
## 2 sp|P44…          26.1 JD_0… Conditio…            1 72749239.
## 3 sp|P44…          26.3 JD_0… Conditio…            1 82100518.
## 4 sp|P44…          25.8 JD_0… Conditio…            2 59219741.
## 5 sp|P44…          26.1 JD_0… Conditio…            2 72690802.
## 6 sp|P44…          26.1 JD_0… Conditio…            2 71180513.
## # … with 1 more variable: TechReplicate <chr>
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
##             BioReplicate
## Condition    1 2
##   Condition1 3 0
##   Condition2 0 3

## with dplyr
## Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- filter(iprg, Protein == "sp|P44015|VAC2_YEAST")

## Then, get two conditions only, because t.test only works for two groups (conditions).
oneproteindata.subset <- filter(oneproteindata,
                                Condition %in% c('Condition1', 'Condition2'))
oneproteindata.subset
## # A tibble: 6 x 7
##   Protein Log2Intensity Run   Condition BioReplicate Intensity
##   <chr>           <dbl> <chr> <chr>            <dbl>     <dbl>
## 1 sp|P44…          26.3 JD_0… Conditio…            1 82714388.
## 2 sp|P44…          26.1 JD_0… Conditio…            1 72749239.
## 3 sp|P44…          26.3 JD_0… Conditio…            1 82100518.
## 4 sp|P44…          25.8 JD_0… Conditio…            2 59219741.
## 5 sp|P44…          26.1 JD_0… Conditio…            2 72690802.
## 6 sp|P44…          26.1 JD_0… Conditio…            2 71180513.
## # … with 1 more variable: TechReplicate <chr>
table(oneproteindata.subset[, c("Condition", "BioReplicate")])
##             BioReplicate
## Condition    1 2
##   Condition1 3 0
##   Condition2 0 3

If we want to remove the levels that are not relevant anymore, we can use droplevels:

oneproteindata.subset <- droplevels(oneproteindata.subset)
table(oneproteindata.subset[, c("Condition", "BioReplicate")])
##             BioReplicate
## Condition    1 2
##   Condition1 3 0
##   Condition2 0 3

To perform the t-test, we use the t.test function. Let’s first familiarise ourselves with it by looking that the manual

?t.test

And now apply to to our data

# t test for different abundance (log2Int) between Groups (Condition)
result <- t.test(Log2Intensity ~ Condition,
                 data = oneproteindata.subset,
                 var.equal = FALSE)

result
## 
##  Welch Two Sample t-test
## 
## data:  Log2Intensity by Condition
## t = 2.0608, df = 3.4001, p-value = 0.1206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1025408  0.5619598
## sample estimates:
## mean in group Condition1 mean in group Condition2 
##                 26.23632                 26.00661

Challenge

Repeat the t-test above but with calculating a 90% confidence interval for the log2 fold change.

3.2.2 The htest class

The t.test function, like other hypothesis testing function, return a result of a type we haven’t encountered yet, the htest class:

class(result)
## [1] "htest"

which stores typical results from such tests. Let’s have a more detailed look at what information we can learn from the results our t-test. When we type the name of our result object, we get a short textual summary, but the object contains more details:

names(result)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

and we can access each of these by using the $ operator, like we used to access a single column from a data.frame, but the htest class is not a data.frame (it’s actually a list). For example, to access the group means, we would use

result$estimate
## mean in group Condition1 mean in group Condition2 
##                 26.23632                 26.00661

Challenge

  • Calculate the (log2-transformed) fold change between groups
  • Extract the value of the t-statistics
  • Calculate the standard error (fold-change/t-statistics)
  • Extract the degrees of freedom (parameter)
  • Extract the p values
  • Extract the 95% confidence intervals

We can also manually compute our t-test statistic using the formulas we descibed above and compare it with the summaryresult.

Recall the summaryresult we generated last section.

summaryresult
##        Group     mean         sd         se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444      3     25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438      3     25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236      3     23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499      3     19.15366
##   ciw.upper.95 ciw.lower.99 ciw.upper.99
## 1     26.49458     25.64058     26.83205
## 2     26.41073     25.07442     26.93879
## 3     23.49128     22.71357     23.79860
## 4     22.78746     16.77955     25.16157
summaryresult12 <- summaryresult[1:2, ]

## test statistic, It is the same as 'result$statistic' above.
diff(summaryresult12$mean) ## different sign, but absolute values are same as result$estimate[1]-result$estimate[2]
## [1] -0.2297095
sqrt(sum(summaryresult12$sd^2/summaryresult12$length)) ## same as stand error
## [1] 0.1114662
## the t-statistic : sign is different
diff(summaryresult12$mean)/sqrt(sum(summaryresult12$sd^2/summaryresult12$length))
## [1] -2.060799

3.2.3 Re-calculating the p values

Referring back to our t-test results above, we can manually calculate the one- and two-side tests p-values using the t-statistics and the test parameter (using the pt function).

Our result t statistic was 2.0607988 (accessible with result$statistic). Let’s start by visualising it along a t distribution. Let’s create data from such a distribution, making sure we set to appropriate parameter.

## generate 10^5 number with the same degree of freedom for distribution.
xt <- rt(1e5, result$parameter)
plot(density(xt), xlim = c(-10, 10))
abline(v = result$statistic, col = "red") ## where t statistics are located.
abline(h = 0, col = "gray") ## horizontal line at 0

The area on the left of that point is given by pt(result$statistic, result$parameter), which is 0.939693. The p-value for a one-sided test, which is ** the area on the right** of red line, is this given by

1 - pt(result$statistic, result$parameter)
##          t 
## 0.06030697

And the p-value for a two-sided test is

2 * (1 - pt(result$statistic, result$parameter))
##         t 
## 0.1206139

which is the same as the one calculated by the t-test.

3.2.4 Choosing a model

The decision of which statistical model is appropriate for a given set of observations depends on the type of data that have been collected.

  • Quantitative response with quantitative predictors : regression model

  • Categorical response with quantitative predictors : logistic regression model for bivariate categorical response (e.g., Yes/No, dead/alive), multivariate logistic regression model when the response variable has more than two possible values.

  • Quantitative response with categorical predictors : ANOVA model (quantitative response across several populations defined by one or more categorical predictor variables)

  • Categorical response with categorical predictors : contingency table that can be used to draw conclusions about the relationships between variables.

See also Bremer & Doerge, Using R at the Bench : Step-by-Step Data Analytics for Biologists, cold Spring Harbor LaboratoryPress, 2015.

3.3 Sample size calculation

To calculate the required sample size, you’ll need to know four things:

  • \(\alpha\): confidence level
  • \(power\): 1 - \(\beta\), where \(\beta\) is probability of a true positive discovery
  • \(\Delta\): anticipated fold change
  • \(\sigma\): anticipated variance

R code

Assuming equal varaince and number of samples across groups, the following formula is used for sample size estimation:

\[\frac{2{\sigma}^2}{n}\leq(\frac{\Delta}{z_{1-\beta}+z_{1-\alpha/2}})^2\]

library("pwr")

## ?pwr.t.test

# Significance level alpha
alpha <- 0.05

# Power = 1 - beta
power <- 0.95

# anticipated log2 fold change
delta <- 1

# anticipated variability
sigma <- 0.9

# Effect size
# It quantifies the size of the difference between two groups
d <- delta/sigma

#Sample size estimation
pwr.t.test(d = d, sig.level = alpha, power = power, type = 'two.sample')
## 
##      Two-sample t test power calculation 
## 
##               n = 22.06036
##               d = 1.111111
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Challenge

  • Calculate power with 10 samples and the same parameters as above.

Let’s investigate the effect of required fold change and variance on the sample size estimation.

# anticipated log2 fold change
delta <- seq(0.1, 0.7, .1)
nd <- length(delta)

# anticipated variability
sigma <- seq(0.1,0.5,.1)
ns <- length(sigma)

# obtain sample sizes
samsize <- matrix(0, nrow=ns*nd, ncol = 3)
counter <- 0
for (i in 1:nd){
  for (j in 1:ns){
    result <- pwr.t.test(d = delta[i] / sigma[j],
                         sig.level = alpha,
                         power = power,
                         type = "two.sample")
    counter <- counter + 1
    samsize[counter, 1] <- delta[i]
    samsize[counter, 2] <- sigma[j]
    samsize[counter, 3] <- ceiling(result$n)
  }
}
colnames(samsize) <- c("desiredlog2FC", "variability", "samplesize")

## visualization
samsize <- as.data.frame(samsize)
samsize$variability <- as.factor(samsize$variability)
ggplot(data=samsize, aes(x=desiredlog2FC, y=samplesize, group = variability, colour = variability)) +
  geom_line() +
  geom_point(size=2, shape=21, fill="white") +
  labs(title="Significance level=0.05, Power=0.95", x="Anticipated log2 fold change", y='Sample Size (n)') +
  theme(plot.title = element_text(size=20, colour="darkblue"),
        axis.title.x = element_text(size=15),
        axis.title.y = element_text(size=15),
        axis.text.x = element_text(size=13))

3.4 Linear models and correlation

When considering correlations and modelling data, visualization is key.

Let’s use the famous Anscombe’s quartet data as a motivating example. This data is composed of 4 pairs of values, \((x_1, y_1)\) to \((x_4, y_4)\):

x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89

Each of these \(x\) and \(y\) sets have the same variance, mean and correlation:

1 2 3 4
var(x) 11.0000000 11.0000000 11.0000000 11.0000000
mean(x) 9.0000000 9.0000000 9.0000000 9.0000000
var(y) 4.1272691 4.1276291 4.1226200 4.1232491
mean(y) 7.5009091 7.5009091 7.5000000 7.5009091
cor(x,y) 0.8164205 0.8162365 0.8162867 0.8165214

But…

While the residuals of the linear regression clearly indicate fundamental differences in these data, the most simple and straightforward approach is visualisation to highlight the fundamental differences in the datasets.

See also another, more recent example: The Datasaurus Dozen dataset.

3.4.1 Correlation

Here is an example where a wide format comes very handy. We are going to convert our iPRG data using the spread function from the tidyr package:

library("tidyr")
iprg2 <- spread(iprg[, 1:3], Run, Log2Intensity)
rownames(iprg2) <- iprg2$Protein
iprg2 <- iprg2[, -1]

And lets focus on the 3 runs, i.e. 2 replicates from condition 1 and 1 replicate from condition 4.

x <- iprg2[, c(1, 2, 10)]
head(x)
## # A tibble: 6 x 3
##   JD_06232014_sample1_B.r… JD_06232014_sample1_C.r… JD_06232014_sample4_B.…
##                      <dbl>                    <dbl>                   <dbl>
## 1                     26.8                     26.6                    26.6
## 2                     24.7                     24.7                    24.6
## 3                     23.4                     24.0                    23.2
## 4                     27.5                     27.4                    26.7
## 5                     27.2                     26.8                    27.0
## 6                     26.1                     26.3                    26.1
pairs(x)

We can use the cor function to calculate the Pearson correlation between two vectors of the same length (making sure the order is correct), or a dataframe. But, we have missing values in the data, which will stop us from calculating the correlation:

cor(x)
##                           JD_06232014_sample1_B.raw
## JD_06232014_sample1_B.raw                         1
## JD_06232014_sample1_C.raw                        NA
## JD_06232014_sample4_B.raw                        NA
##                           JD_06232014_sample1_C.raw
## JD_06232014_sample1_B.raw                        NA
## JD_06232014_sample1_C.raw                         1
## JD_06232014_sample4_B.raw                        NA
##                           JD_06232014_sample4_B.raw
## JD_06232014_sample1_B.raw                        NA
## JD_06232014_sample1_C.raw                        NA
## JD_06232014_sample4_B.raw                         1

We first need to omit the proteins/rows that contain missing values

x2 <- na.omit(x)
cor(x2)
##                           JD_06232014_sample1_B.raw
## JD_06232014_sample1_B.raw                 1.0000000
## JD_06232014_sample1_C.raw                 0.9722642
## JD_06232014_sample4_B.raw                 0.9702758
##                           JD_06232014_sample1_C.raw
## JD_06232014_sample1_B.raw                 0.9722642
## JD_06232014_sample1_C.raw                 1.0000000
## JD_06232014_sample4_B.raw                 0.9585676
##                           JD_06232014_sample4_B.raw
## JD_06232014_sample1_B.raw                 0.9702758
## JD_06232014_sample1_C.raw                 0.9585676
## JD_06232014_sample4_B.raw                 1.0000000

3.4.2 A note on correlation and replication

It is often assumed that high correlation is a halmark of good replication. Rather than focusing on the correlation of the data, a better measurement would be to look a the log2 fold-changes, i.e. the distance between or repeated measurements. The ideal way to visualise this is on an MA-plot:

par(mfrow = c(1, 2))
r1 <- x2[[1]]
r2 <- x2[[2]]
M <- r1 - r2
A <- (r1 + r2)/2
plot(A, M); grid()
suppressPackageStartupMessages(library("affy"))
affy::ma.plot(A, M)

See also this post on the Simply Statistics blog.

3.4.3 Linear modelling

abline(0, 1) can be used to add a line with intercept 0 and slop 1. It we want to add the line that models the data linearly, we can calculate the parameters using the lm function:

lmod <- lm(r2 ~ r1)
summary(lmod)
## 
## Call:
## lm(formula = r2 ~ r1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2943 -0.1321 -0.0111  0.1103  3.4976 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.834552   0.105239    7.93 3.05e-15 ***
## r1          0.965778   0.004225  228.56  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3758 on 3023 degrees of freedom
## Multiple R-squared:  0.9453, Adjusted R-squared:  0.9453 
## F-statistic: 5.224e+04 on 1 and 3023 DF,  p-value: < 2.2e-16

which can be used to add the adequate line that reflects the (linear) relationship between the two data

plot(r1, r2)
abline(lmod, col = "red")

As we have seen in the beginning of this section, it is essential not to rely solely on the correlation value, but look at the data. This also holds true for linear (or any) modelling, which can be done by plotting the model:

par(mfrow = c(2, 2))
plot(lmod)

  • Cook’s distance is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis and can be used to highlight points that particularly influence the regression.

  • Leverage quantifies the influence of a given observation on the regression due to its location in the space of the inputs.

See also ?influence.measures.

Challenge

  1. Take any of the iprg2 replicates, model and plot their linear relationship. The iprg2 data is available as an rda file, or regenerate it as shown above.
  2. The Anscombe quartet is available as anscombe. Load it, create a linear model for one \((x_i, y_i)\) pair of your choice and visualise/check the model.

x3 <- anscombe[, 3]
y3 <- anscombe[, 7]
lmod <- lm(y3 ~ x3)
summary(lmod)
## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
par(mfrow = c(2, 2))
plot(lmod)

Finally, let’s conclude by illustrating how ggplot2 can very elegantly be used to produce similar plots, with useful annotations:

library("ggplot2")
dfr <- data.frame(r1, r2, M, A)
p <- ggplot(aes(x = r1, y = r2), data = dfr) + geom_point()
p + geom_smooth(method = "lm") +
    geom_quantile(colour = "red")
## Warning: Computation failed in `stat_quantile()`:
## Package `quantreg` required for `stat_quantile`.
## Please install and try again.

Challenge

Replicate the MA plot above using ggplot2. Then add a non-parametric lowess regression using geom_smooth().

p <- ggplot(aes(x = A, y = M), data = dfr) + geom_point()
p + geom_smooth() + geom_quantile(colour = "red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_quantile()`:
## Package `quantreg` required for `stat_quantile`.
## Please install and try again.

Supplementary information: Working with statistical distributions

For each statistical distribution, we have function to compute

  • density
  • distribution function
  • quantile function
  • random generation

For the normale distribution norm, these are respectively

  • dnorm
  • pnorm
  • qnorm
  • rnorm

Let’s start by sampling 10000 values from a normal distribution \(N(0, 1)\):

xn <- rnorm(1e6)
hist(xn, freq = FALSE)
rug(xn)
lines(density(xn), lwd = 2)

By definition, the area under the density curve is 1. The area at the left of 0, 1, and 2 are respectively:

pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(2)
## [1] 0.9772499

To ask the inverse question, we use the quantile function. The obtain 0.5, 0.8413447 and 0.9772499 of our distribution, we need means of:

qnorm(0.5)
## [1] 0
qnorm(pnorm(1))
## [1] 1
qnorm(pnorm(2))
## [1] 2

Finally, the density function gives us the height at which we are for a given mean:

hist(xn, freq = FALSE)
lines(density(xn), lwd = 2)
points(0, dnorm(0), pch = 19, col = "red")
points(1, dnorm(1), pch = 19, col = "red")
points(2, dnorm(2), pch = 19, col = "red")