The count data were processed using scPipe and available in a SingleCellExperiment object, along with their metadata (see data/SCEs/C075_Grant_Coultas.scPipe.SCE.rds
).
Cell-based annotations are included in the colData of the SingleCellExperiment. For our analysis, we will only retain relevant metadata fields.
In particular, we keep the plate number (plate_number
), well position (well_position
), the sample type (sample_type
), the genotype (genotype
), and the mouse (mouse
), the sex of the mouse (sex
), and the sequencing run (sequencing_run
).
We also include the FACS data used for gating and index sorting.
This dataset contains samples from 7 384-well plates. Figure 1 shows that most plates have cells from a knockout and a wildtype mouse with each pair of mice on a plate are from the same sibship#1824
acts as a control sample on two plates.
Figure 1: Breakdown of of the samples by plate.
The complete metadata table (excluding FACS data) is shown below.
plate_number <fctr> | well_position <fctr> | sample_type <fctr> | genotype <fctr> | ||
---|---|---|---|---|---|
LC279_A10 | LC279 | A10 | single cell | Control | |
LC279_A11 | LC279 | A11 | single cell | Control | |
LC279_A12 | LC279 | A12 | single cell | Control | |
LC279_A13 | LC279 | A13 | single cell | Control | |
LC279_A14 | LC279 | A14 | single cell | Control | |
LC279_A15 | LC279 | A15 | single cell | Control | |
LC279_A16 | LC279 | A16 | single cell | Control | |
LC279_A17 | LC279 | A17 | single cell | Control | |
LC279_A18 | LC279 | A18 | single cell | Control | |
LC279_A19 | LC279 | A19 | single cell | Control |
I used the Mus.musculus and EnsDb.Mmusculus.v79 packages, which respectively cover the NCBI/RefSeq and Ensembl databases, to obtain gene-based annotations, such as the chromosome and gene symbol.
Having quantified gene expression against the GENCODE gene annotation, we have Ensembl-style identifiers for the genes. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. Henceforth, we will use gene symbols (where available) to refer to genes in our analysis and otherwise use the Ensembl-style gene identifiers
Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics to measure the quality of the cells:
sum
: This measures the library size of the cells, which is the total sum of counts across both genes and spike-in transcripts. We want cells to have high library sizes as this means more RNA has been successfully captured during library preparation.detected
: This is the number of expressed featuresaltexps_ERCC_detected
: This measures the proportion of reads which are mapped to spike-in transcripts relative to the library size of each cell. High proportions are indicative of poor-quality cells, where endogenous RNA has been lost during processing (e.g., due to cell lysis or RNA degradation). The same amount of spike-in RNA to each cell, so an enrichment in spike-in counts is symptomatic of loss of endogenous RNA.subsets_Mt_percent
: This measures the proportion of reads which are mapped to mitochondrial RNA. If there is a higher than expected proportion of mitochondrial RNA this is often symptomatic of a cell which is under stress and is therefore of low quality and will not be used for the analysis.For CEL-Seq2 data, we typically observe library sizes that are in the tens of thousands
The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions. Such cells can interfere with downstream analyses, e.g., by forming distinct clusters that complicate interpretation of the results.
The plate-level summaries of selected QC metrics are given in the table below.
plate_number | median_sum | mean_sum | median_detected | mean_detected |
---|---|---|---|---|
LC279 | 8710.0 | 10087.41 | 2976.0 | 3032.666 |
LC280 | 12534.0 | 15580.59 | 3693.0 | 3760.372 |
LC294 | 11148.5 | 12731.91 | 3519.0 | 3603.109 |
LC358 | 15565.5 | 16878.27 | 4105.0 | 4095.272 |
LC392 | 12384.5 | 13682.66 | 3919.0 | 3956.351 |
LC396 | 11935.0 | 14663.24 | 3682.5 | 3760.493 |
LC398 | 8642.0 | 12098.46 | 3060.0 | 3169.820 |
overall | 11787.0 | 13805.72 | 3628.5 | 3658.871 |
The distributions of these metrics are shown in Figure 2, stratified by plate and mouse.
Figure 2: Distributions of various QC metrics for all cells in the data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.
Figure 3 breaks the data down by whether a sample is a ‘20 cells’ or a ‘single cell’.
Figure 3: Distributions of various QC metrics for all cells in the data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.
It is also valuable to examine how the QC metrics behave with respect to each other (Figure 4). Generally, they will be in rough agreement, i.e., cells with low total counts will also have low numbers of expressed features and high ERCC/mitochondrial proportions. Clear discrepancies may correspond to technical differences between batches of cells or genuine biological differences in RNA content.
Figure 4: Behaviour of each QC metric compared to the total number of expressed features. Each point represents a cell in the data set.
Overall, most single-cell samples have performed similarly in terms of library size, number of genes detected, and percentage of reads coming from the ERCC spike-ins and mitochondria transcripts. As expected, the 20 cell samples typically have larger library sizes, more genes detected, and a lower percentage of reads coming from ERCC spike-ins and mitochondrial transcripts. For now, however, we focus on the single cell data and remove the 20-cell samples.
There are some plates with a higher average percentage of reads coming from the ERCC spike-ins (LC294 and LC358) or with a subset of samples with notably larger values (LC398). Similarly, there are some plates with a higher average percentage of reads coming from the mitochondrial transcripts (LC294 and LC358). We will investigate these further in [Investigating plates with large ERCC percentages].
Figure 5 shows the percentage of reads coming from the ERCC spike-ins by plate position, which can sometimes help identify whether a plate has come out of alignment during the sort.
Figure 5: Percentage of reads coming from ERCC spike-ins for all single-cell samples in the dataset plotted by plate position. The colour scale runs from dark blue (0% of reads coming from ERCC spike-ins) to bright yellow (50% of reads coming from ERCC spike-ins) with mutant cells drawn as triangles and control cells as circles.
For plate LC398, we see that there is a systematic trend of mutant cells on the right hand side of the plate having higher percentage of reads coming from the ERCC spike-ins. This may be due to the plate coming out of alignment and means that caution is warranted when interpreting mutant cells from this plate.
Figure 6 shows the percentage of reads coming from mitochondrial transcripts by plate position.
Figure 6: Percentage of reads coming from mitochondrial transcripts for all single-cell samples in the dataset plotted by plate position. The colour scale runs from dark blue (0% of reads coming from mitochondrial transcripts) to bright yellow (50% of reads coming from mitochondrial transcripts) with mutant cells drawn as triangles and control cells as circles.
There is no systematic trend by plate position for the percentage of reads coming from mitochondrial transcripts.
Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove small and large outliers for the library size and the number of expressed features, and large outliers for the spike-in proportions. Removal of low-quality cells is then performed by combining the filters for all of the metrics.
Due to the differences in the QC metrics by plate, we will compute our outlier thresholds at the plate-level. However, if an entire plate failed, outlier detection will not be able to act as an appropriate QC filter for that plate. In this case, it is generally better to compute a shared median and MAD from the other plates and use those estimates to obtain an appropriate filter threshold for cells in the problematic plates.
The following plates are excluded when computing the relevant QC metric thresholds:
sum
: LC398detected
: LC398altexps_ERCC_percent
: Nonesubsets_Mt_percent
: NoneThe following table summarises the QC cutoffs:
batch | total counts | total features | %ERCC | %mito |
---|---|---|---|---|
LC279 | 1732.4 | 1099.3 | 15.3 | 10.3 |
LC280 | 2094.3 | 1263.5 | 9.3 | 7.9 |
LC294 | 3100.2 | 1538.7 | 22.1 | 17.1 |
LC358 | 4036.9 | 1730.3 | 23.4 | 28.4 |
LC392 | 2991.7 | 1699.9 | 15.3 | 8.7 |
LC396 | 1763.7 | 1220.7 | 21.7 | 10.2 |
LC398 | 2617.8 | 1450.4 | 42.1 | 9.3 |
The table below summarises the number of cells per plate left following removal of outliers based on the QC metrics. The vast majority of cells are retained for most plates, with the exception of plate LC398. More cells are removed based high percentages of ERCC transcripts or mitochondrial RNA than on low library size and number of expressed genes. In total, we remove 250 cells based on these QC metrics, and retain 2086 cells.
batch | ByLibSize | ByFeature | BySpike | ByMito | Remaining | PercRemaining |
---|---|---|---|---|---|---|
LC358 | 3 | 7 | 4 | 12 | 346 | 94.5 |
LC294 | 4 | 6 | 7 | 16 | 347 | 93.8 |
LC279 | 6 | 8 | 13 | 18 | 297 | 92.0 |
LC392 | 2 | 4 | 12 | 21 | 339 | 91.6 |
LC280 | 4 | 7 | 25 | 28 | 329 | 88.9 |
LC396 | 5 | 10 | 20 | 22 | 258 | 86.9 |
LC398 | 59 | 63 | 2 | 11 | 170 | 70.8 |
The biggest practical concern during QC is whether an entire cell type is inadvertently discarded. There is always some risk of this occurring as the QC metrics are never fully independent of biological state. We can diagnose cell type loss by looking for systematic differences in gene expression between the discarded and retained cells.
Figure 7 shows the result of this analysis, highlighting that there are few genes with a large logFC between ‘lost’ and ‘kept’ cells (those few genes with larger logFC have low average expression). This suggests that the QC step did not inadvertently filter out an entire biologically relevant subpopulation.
Figure 7: Log-fold change in expression in the discarded cells compared to the retained cells. Each point represents a gene with mitochondrial transcripts in blue and ribosomal protein genes in orange. Dashed red lines indicate $|logFC| = 1
If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes.
Another concern is whether the cells removed during QC preferentially derive from particular experimental groups. Reassuringly, Figure 8 shows that this is mostly not the case. The exception is plate LC398, which has systematically fewer cells to start with and for which the quality is lower in the part of the plate containing the Mutant
cells.
Figure 8: Cells removed during QC, stratified by plate_number
and genotype
.
Figure 9 compares the QC metrics of the discarded and retained cells
Figure 9: Distribution of QC metrics for each plate in the dataset. Each point represents a cell and is colored according to whether it was discarded during the QC process. Note that a cell will only be kept if it passes the relevant threshold for all four QC metrics.
To conclude, Figure 10 shows that post-QC that most samples have similar QC metrics, as is to be expected, and Figure 11 summarises the experimental design following QC.
Figure 10: Distributions of various QC metrics for all cells that passed quality control in the data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.
Figure 11: Breakdown of of the samples by plate following QC.
Figure 12 shows that Kat7 is not detected in most samples (89.6% of cells have zero counts) and that there is no striking difference between Control
and Mutant
samples.
Figure 12: Raw counts of Kat7 in each sample.
Figure 13 shows the most highly expressed genes in the dataset. Many of these genes are mitochondrial genes, ribosomal protein, and pseudogenes.
Figure 13: Percentage of total counts assigned to the top 50 most highly-abundant features in the data set. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.
Figure 14 shows the most highly expressed after excluding the mitochondrial genes, ribosomal protein, and pseudogenes.
Figure 14: Percentage of total counts assigned to the top 50 most highly-abundant features (after excluding mitochondrial genes, ribosomal protein, and pseudogenes) in the data set. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.
Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference
The ‘optimal’ choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness (e.g., for normalization) compared to that required for removing underpowered tests. For hypothesis testing, the filter statistic should also be independent of the test statistic under the null hypothesis. Thus, we (or the relevant function) will filter at each step as needed, rather than applying a single filter for the entire analysis.
Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the data set. We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes (Figure 15).
Figure 15: Histogram of log-average counts for all genes in the combined data set.
We remove 607 genes that are not expressed in any cell. Such genes provide no information and would be removed by any filtering strategy. We retain 33655 for downstream analysis.
Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data
We focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor”
For this analysis we consider two forms of scaling normalization:
For most scRNA-seq datasets we use normalization by deconvolution. However, for this dataset there are good reasons for considering normalization by spike-ins.
Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. In this case, the mutants have defective Kat7 (aka Hbo1) which may lead to the mutant cells having lower total RNA content than the control cells. According to Zoe and Anne
HBO1 is required for histone H3 lysine 14 acetylation throughout the genome, in genic and intergenic regions. It is possible that HBO1 is required broadly for gene activity and the very many genes are not expressed at normal levels in its absence. Therefore, the assumption that most genes are equally expressed in mutant and controls may be false. It is possible, that the upregulated genes are actually either not differentially expressed or even downregulated - just not as much as the genes already identified as downregulated.
For this dataset we have ERCC spike-ins, so it is both worthwhile and possible to consider normalization by spike-ins for this dataset.
Composition biases will be present when any unbalanced differential expression exists between samples
In both cases, the net effect is that non-DE genes in A will incorrectly appear to be downregulated compared to B.
The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix
function in the DESeq2 package calcNormFactors
function
However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation calculateSumFactors()
function from the scran package.
We check that the size factors are roughly aligned with the total library sizes (Figure 16). Strong deviations from the diagonal would correspond to composition biases due to differential expression between cell subpopulations.
Figure 16: Size factors from deconvolution, plotted against library sizes for all cells in each data set. Axes are shown on a log-scale.
The assumptions of the deconvolution size factors are satisfied and so we may use deconvolution size factors for normalization.
Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell
Compared to Normalization by deconvolution, spike-in normalization requires no assumption about the biology of the system (i.e. the absence of many DE genes). Instead, it assumes that:
We can empirically investigate (1), which we do in the next section, whereas (2) is difficult to verify in practice.
We can use the percentage of counts coming from the spike-ins to investigate if assumption (1) is met (i.e. were the spike-in transcripts added at a constant level to each cell).
To help understand whether spike-in normalization is appropriate for the current dataset, we make use of two datasets for which the assumptions of spike-in normalization are met
Figure 17 plots the spike-in counts as a percentage of the library size for each cell in the datasets from Lun (2017) and Richard (2018) alongside the current dataset. We see that for both Lun (2017) and Richard (2018) the assumption that spike-in transcripts added at a constant level appears to be valid. In contrast, for the current dataset there is considerable inter-plate variability of the percentage of counts deriving from the spike-in transcripts.
Figure 17: Percentage of counts coming from ERCC spike-in transcripts per sample. On the top is data from Lun (2017) and Richard (2018), two datasets where spike-in normalization was successfully applied, and on the bottom is data from the current experiment. Lun (2017) are stratified by block
, Richard (2018) are stratified by stimulus
, and C075 by plate_number
.
There is considerable inter-plate variability of the percentage of counts deriving from the spike-in transcripts in the current dataset. This violates assumption (1) (i.e. that the spike-in transcripts were added at a constant level to each cell) and so we may use not spike-in size factors for normalization.
Regardless of the type of size factors we use for the endogenous genes (e.g., deconvolution size factors or spike-in size factors), it is critical that the spike-in transcripts themselves are normalized using the spike-in size factors. Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell
We can now use the size factors to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell and then taking the logarithmlogcounts
Zoe requested a violin plot of Pglyrp1 expression in all cells, which is available as output/violin_plots/Pglyrp1.violin_plot.logcounts.pdf
.
We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like dimensionality reduction and clustering compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behaviour of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure. It also reduces the size of the dataset to improve computational efficiency of later steps.
The simplest approach to quantifying per-gene variation is to simply compute the variance of the log-normalized expression values (referred to as “log-counts” for simplicity) for each gene across all cells in the population
Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship.
To account for the mean-variance relationship, we fit a trend to the variance with respect to abundance across the ERCC spike-in transcripts. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene.
Data containing multiple batches will often exhibit batch effects. We are usually not interested in highly variable genes (HVGs) that are driven by batch effects. Rather, we want to focus on genes that are highly variable within each batch. This is naturally achieved by performing trend fitting and variance decomposition separately for each batch.
The use of a batch-specific trend fit is useful as it accommodates differences in the mean-variance trends between batches. This is especially important if batches exhibit systematic technical differences, e.g., differences in coverage or in the amount of spike-in RNA added.
Figure 18 visualizes the quality of the batch-specific trend fits and Figure 19 highlights the need for batch-specific estimates of these fits. The analysis of each plate yields estimates of the biological and technical components for each gene, which are averaged across plates to take advantage of information from multiple batches.
Figure 18: Variance of normalized log-expression values for each gene in each plate, plotted against the mean log-expression. The coloured line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).
Figure 19: An overlay of the trend fits from the previous figure, highlighting the need for the batch-specific trend fits. Each line is a the trend line for a particular batch (with colours matching the previous plot).
Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. For example, heterogeneity in T cell activation responses is an interesting phenomena but may be irrelevant noise in studies that only care about distinguishing the major immunophenotypes.
We opt to only remove the obviously uninteresting genes with variances below the trend. By doing so, we avoid the need to make any judgement calls regarding what level of variation is interesting enough to retain. This approach represents one extreme of the bias-variance trade-off where bias is minimized at the cost of maximizing noise.
We find that the most highly variable genes in this dataset are somewhat enriched for pseudogenes (n = 1034), with Figure 20 showing more than 20% of the top-250 HVGs are pseudogenes. Figure 20 also shows the percentage of the top-k HVGs that are ribosomal protein genes
Figure 20: Percentage of top-K HVGs that are genes of a given class.
Zoe and Anne advised that pseudogenes, ribosomal protein genes, genes on the sex chromosomes, and mitochondrial genes are of lesser biological relevance to this study, so it was decided to exclude them from the HVGs. This means that these genes can no longer directly influence some of the subsequent steps in the analysis, including:
Although the exclusion of these genes from the HVGs prevents them from directly influencing these analyses, they may still indirectly associated with these steps or their outcomes. For example, if there is a set of (non ribosomal protein) genes that are strongly associated with ribosomal protein gene expression, then we may still see a cluster associated with ribosomal protein gene expression.
Finally, and to emphasise, we have only excluded these gene sets from the HVGs, i.e. we have not excluded them entirely from the dataset. In particular, this means that these genes may appear in downstream results (e.g., gene lists resulting from a differential expression analysis between cells in different clusters or with different genotypes).
We are left with 16927 HVGs by this approach
Figure 21: Violin plots of normalized log-expression values for the top-10 HVGs. Each point represents the log-expression value in a single cell.
Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. In these applications, each individual gene represents a dimension of the data. If we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. Dimensionality reduction extends this idea to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space.
As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension. This reduces computational work in downstream analyses, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions.
Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. In PCA, the first axis (or “principal component”, PC) is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is orthogonal to the first and captures the greatest remaining amount of variation, and so on.
By definition, the top PCs capture the dominant factors of heterogeneity in the data set. Thus, we can perform dimensionality reduction by restricting downstream analyses to the top PCs. This strategy is simple, highly effective and widely used throughout the data sciences.
When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behaviour of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.
We perform the PCA on the log-normalized expression values. PCA is generally robust to random noise but an excess of it may cause the earlier PCs to capture noise instead of biological structure. This effect can be avoided - or at least mitigated - by restricting the PCA to a subset of HVGs, as done in Feature selection.
The choice of the number of PCs is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will avoid discarding biological signal in later PCs, at the cost of retaining more noise.
We use the strategy of retaining all PCs until the percentage of total variation explained reaches some threshold. We derive a suitable value for this threshold by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the the variance modelling results from Quantifying per-gene variation.
This retains 27 dimensions, which represents the lower bound on the number of PCs required to retain all biological variation. Any fewer PCs will definitely discard some aspect of biological signal
We use the uniform manifold approximation and projection (UMAP) method
Figure 22 visualizes the cells using the UMAP co-ordinates.
Figure 22: UMAP plot of the dataset. Each point represents a cell and is coloured by plate_number
. Each panel highlights cells from a particular combination of genotype
and sex
.
Although Figure 22 is only a preliminary summary of the data, there are few points worth highlighting:
LC294
and LC358
are very different from the majority of plates.We will seek to mitigate the plate-specific differences in downstream analyses so that we might better investigate any genotype-specific differences.
The processed SingleCellExperiment object is available (see data/SCEs/C075_Grant_Coultas.preprocessed.SCE.rds
). This will be used in downstream analyses, e.g., selecting biologically relevant cells.
The following are available on request:
─ Session info ───────────────────────────────────────────────────── setting value version R version 4.0.0 (2020-04-24) os CentOS Linux 7 (Core) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz Australia/Melbourne date 2020-10-01 ─ Packages ───────────────────────────────────────────────────────── ! package * version date lib P AnnotationDbi * 1.50.1 2020-06-29 [?] P AnnotationFilter * 1.12.0 2020-04-27 [?] P AnnotationHub 2.20.0 2020-04-27 [?] P askpass 1.1 2019-01-13 [?] P assertthat 0.2.1 2019-03-21 [?] P backports 1.1.8 2020-06-17 [?] P beeswarm 0.2.3 2016-04-25 [?] P Biobase * 2.48.0 2020-04-27 [?] P BiocFileCache 1.12.0 2020-04-27 [?] P BiocGenerics * 0.34.0 2020-04-27 [?] P BiocManager 1.30.10 2019-11-16 [?] P BiocNeighbors 1.6.0 2020-04-27 [?] P BiocParallel * 1.22.0 2020-04-27 [?] P BiocSingular 1.4.0 2020-04-27 [?] P BiocStyle * 2.16.0 2020-04-27 [?] P BiocVersion 3.11.1 2019-11-13 [?] P biomaRt 2.44.1 2020-06-17 [?] P Biostrings 2.56.0 2020-04-27 [?] P bit 1.1-15.2 2020-02-10 [?] P bit64 0.9-7.1 2020-07-15 [?] P bitops 1.0-6 2013-08-17 [?] P blob 1.2.1 2020-01-20 [?] P cli 2.0.2 2020-02-28 [?] P colorspace 1.4-1 2019-03-18 [?] P cowplot * 1.0.0 2019-07-11 [?] P crayon 1.3.4 2017-09-16 [?] P curl 4.3 2019-12-02 [?] P DBI 1.1.0 2019-12-15 [?] P dbplyr 1.4.4 2020-05-27 [?] P DelayedArray * 0.14.1 2020-07-14 [?] P DelayedMatrixStats 1.10.1 2020-07-03 [?] P digest 0.6.25 2020-02-23 [?] P distill 0.8 2020-06-04 [?] P dplyr 1.0.0 2020-05-29 [?] P dqrng 0.2.1 2019-05-17 [?] P edgeR * 3.30.3 2020-06-02 [?] P ellipsis 0.3.1 2020-05-15 [?] P EnsDb.Mmusculus.v79 * 2.99.0 2020-07-15 [?] P ensembldb * 2.12.1 2020-05-06 [?] P evaluate 0.14 2019-05-28 [?] P ExperimentHub 1.14.0 2020-04-27 [?] P fansi 0.4.1 2020-01-08 [?] P farver 2.0.3 2020-01-16 [?] P fastmap 1.0.1 2019-10-08 [?] P FNN 1.1.3 2019-02-15 [?] P generics 0.0.2 2018-11-29 [?] P GenomeInfoDb * 1.24.2 2020-06-15 [?] P GenomeInfoDbData 1.2.3 2020-05-18 [?] P GenomicAlignments 1.24.0 2020-04-27 [?] P GenomicFeatures * 1.40.1 2020-07-08 [?] P GenomicRanges * 1.40.0 2020-04-27 [?] P ggbeeswarm 0.6.0 2017-08-07 [?] P ggplot2 * 3.3.2 2020-06-19 [?] P Glimma * 1.16.0 2020-04-27 [?] P glue 1.4.1 2020-05-13 [?] P GO.db * 3.11.4 2020-07-15 [?] P graph 1.66.0 2020-04-27 [?] P gridExtra 2.3 2017-09-09 [?] P gtable 0.3.0 2019-03-25 [?] P here * 0.1 2017-05-28 [?] P highr 0.8 2019-03-20 [?] P hms 0.5.3 2020-01-08 [?] P htmltools 0.5.0 2020-06-16 [?] P httpuv 1.5.4 2020-06-06 [?] P httr 1.4.1 2019-08-05 [?] P igraph 1.2.5 2020-03-19 [?] P interactiveDisplayBase 1.26.3 2020-06-02 [?] P IRanges * 2.22.2 2020-05-21 [?] P irlba 2.3.3 2019-02-05 [?] P janitor * 2.0.1 2020-04-12 [?] P jsonlite 1.7.0 2020-06-25 [?] P knitr 1.29 2020-06-23 [?] P labeling 0.3 2014-08-23 [?] P later 1.1.0.1 2020-06-05 [?] P lattice 0.20-41 2020-04-02 [?] P lazyeval 0.2.2 2019-03-15 [?] P lifecycle 0.2.0 2020-03-06 [?] P limma * 3.44.3 2020-06-12 [?] P locfit 1.5-9.4 2020-03-25 [?] P lubridate 1.7.9 2020-06-08 [?] P magrittr 1.5 2014-11-22 [?] P Matrix * 1.2-18 2019-11-27 [?] P matrixStats * 0.56.0 2020-03-13 [?] P memoise 1.1.0 2017-04-21 [?] P mime 0.9 2020-02-04 [?] P msigdbr * 7.1.1 2020-05-14 [?] P munsell 0.5.0 2018-06-12 [?] P Mus.musculus * 1.3.1 2020-07-15 [?] P openssl 1.4.2 2020-06-27 [?] P org.Mm.eg.db * 3.11.4 2020-06-02 [?] P OrganismDbi * 1.30.0 2020-04-27 [?] P patchwork * 1.0.1 2020-06-22 [?] P pillar 1.4.6 2020-07-10 [?] P pkgconfig 2.0.3 2019-09-22 [?] P png 0.1-7 2013-12-03 [?] P Polychrome 1.2.5 2020-03-29 [?] P prettyunits 1.1.1 2020-01-24 [?] P progress 1.2.2 2019-05-16 [?] P promises 1.1.1 2020-06-09 [?] P ProtGenerics 1.20.0 2020-04-27 [?] P purrr 0.3.4 2020-04-17 [?] P R6 2.4.1 2019-11-12 [?] P rappdirs 0.3.1 2016-03-28 [?] P RBGL 1.64.0 2020-04-27 [?] P RColorBrewer 1.1-2 2014-12-07 [?] P Rcpp 1.0.5 2020-07-06 [?] P RCurl 1.98-1.2 2020-04-18 [?] P rlang 0.4.7 2020-07-09 [?] P rmarkdown * 2.3 2020-06-18 [?] P rprojroot 1.3-2 2018-01-03 [?] P Rsamtools 2.4.0 2020-04-27 [?] P RSpectra 0.16-0 2019-12-01 [?] P RSQLite 2.2.0 2020-01-07 [?] P rsvd 1.0.3 2020-02-17 [?] P rtracklayer 1.48.0 2020-04-27 [?] P S4Vectors * 0.26.1 2020-05-16 [?] P scales 1.1.1 2020-05-11 [?] P scater * 1.16.2 2020-06-26 [?] P scatterplot3d 0.3-41 2018-03-14 [?] P scran * 1.16.0 2020-04-27 [?] P scRNAseq * 2.2.0 2020-05-07 [?] P sessioninfo 1.1.1 2018-11-05 [?] P shiny 1.5.0 2020-06-23 [?] P SingleCellExperiment * 1.10.1 2020-04-28 [?] P snakecase 0.11.0 2019-05-25 [?] P statmod 1.4.34 2020-02-17 [?] P stringi 1.4.6 2020-02-17 [?] P stringr 1.4.0 2019-02-10 [?] P SummarizedExperiment * 1.18.2 2020-07-09 [?] P tibble 3.0.3 2020-07-10 [?] P tidyselect 1.1.0 2020-05-11 [?] P TxDb.Mmusculus.UCSC.mm10.knownGene * 3.10.0 2020-07-15 [?] P uwot * 0.1.8 2020-03-16 [?] P vctrs 0.3.2 2020-07-15 [?] P vipor 0.4.5 2017-03-22 [?] P viridis 0.5.1 2018-03-29 [?] P viridisLite 0.3.0 2018-02-01 [?] P withr 2.2.0 2020-04-20 [?] P xfun 0.15 2020-06-21 [?] P XML 3.99-0.4 2020-07-05 [?] P xtable 1.8-4 2019-04-21 [?] P XVector 0.28.0 2020-04-27 [?] P yaml 2.2.1 2020-02-01 [?] P zlibbioc 1.34.0 2020-04-27 [?] source Bioconductor Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor Bioconductor Bioconductor CRAN (R 4.0.0) Bioconductor Bioconductor Bioconductor Bioconductor Bioconductor Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor Bioconductor Bioconductor Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) Bioconductor Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) CRAN (R 4.0.0) Bioconductor CRAN (R 4.0.0) Bioconductor [1] /stornext/General/data/user_managed/grpu_mritchie_1/hickey/SCORE/C075_Grant_Coultas/renv/library/R-4.0/x86_64-pc-linux-gnu [2] /tmp/RtmpImljK7/renv-system-library [3] /stornext/System/data/apps/R/R-4.0.0/lib64/R/library P ── Loaded and on-disk path mismatch.
#1824
acts as a control sample on two plates.[↩]