Preparing the data

We start from the preprocessed SingleCellExperiment object created in ‘Selection of biologically relevant cells for the Grant (C075) retinal epithelial cells data set’.

Motivation

The most challenging task in scRNA-seq data analysis is arguably the annotation of cells with meaningful labels. We might attempt to label cells individually or to label clusters (with cells assigned to those clusters ‘inheriting’ the cluster-level label).

In ‘Selection of biologically relevant cells for the Grant (C075) retinal epithelial cells data set’, we obtained one set of labels:

‘cluster’ labels using clustering. Obtaining clusters of cells is fairly straightforward, but it is more difficult to determine what biological state is represented by each of those clusters.

In this section we use Pre-defined marker genes and perform Cluster marker gene detection to construct de novo marker gene lists that can be used to aid the interpretation of the clustering results. The de novo marker genes lists use a data driven approach to identify those genes that drive separation between clusters whereas the pre-defined marker genes are a more targeted approach to cluster annotation. These two approaches complement one another.

As a reminder, Figure 1 shows the clustering results overlaid on the UMAP plot and Figure 2 breaks down the clusters by key experimental factors.

Figure 1: UMAP plot, where each point represents a cell and is coloured according to the legend.

Figure 2: Breakdown of clusters by experimental factors.

We conclude by performing Cell cycle assignment and by describing the choice of Visualization options for the plots used in this report, as well instructions for how to make your own plots.

Pre-defined marker genes

We examine Zoe’s gene list and also perform a simple visualization of the FACS markers.

Zoe’s gene list

Motivation

Zoe sent me a list of marker genes to help characterise the cells¹.

Analysis

Figure 3 is a heatmap of Zoe’s marker genes. It is challenging to immediately associate clusters with a cell type on the basis of these marker genes, but some patterns are clear, such as:

Artery marker genes are highly expressed in cluster 7.
Artery, tip marker genes are highly expressed in clusters 5 and 7.
Tip marker genes are highly expressed in cluster 5.

The expressions patterns for the other marker gene sets are more complex, such as a subset of the gene set being not expressed, restricted to particular cluster(s), or ubiquitously expressed across clusters.

Violin plots of per-gene expression measurements and heatmaps (logcounts and reconstructed) can be found in the output/zoes_marker_genes/ directory.

Heatmap of row-normalized `logcounts` for Zoe's marker genes. Each column is a sample, each row a gene. Samples are ordered by `cluster` then by `genotype`.

Figure 3: Heatmap of row-normalized logcounts for Zoe’s marker genes. Each column is a sample, each row a gene. Samples are ordered by cluster then by genotype.

FACS markers

Several FACS markers were collected for these cells: FSC_A, FSC_W, FSC_H, SSC_A, SSC_W, SSC_H, R660_20_A_APC, B530_30_A_GFP, Y582_15_A_td_Tomato, V450_50_A_DAPI, and Y780_60_A_PE_Cy7.

Absolute FACS measurements are generally not comparable across samples for a number of reasons² and normalization methods are underdeveloped. We apply a simple ‘scaled rank’ normalization method, which is to rank within each plate the FACS measurements between 0 and 1 (0 being the lowest and 1 being the highest). This is a very drastic normalization that removes much of the quantitative information potentially available in the FACS data.

Figure 4 presents a simple summary of the FACS data by overlaying it on the UMAP plot.

Overlay of index sorting data on UMAP plot. For each marker, the left-hand plot shows the 'raw' or 'pseudo-logged' fluorescence intensity and the right-side plots the 'scaled rank' of the raw intensity. The pseudo-log transformation is a transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0. This transformation is commonly used when plotting fluorescence intensities from FACS. The scaled rank is applied within each `plate_number` and assigns the maximum fluorescence intensity a value of one and the minimum fluorescence intensities a value of zero. It can be thought of as a crude normalization of the FACS data that allows us to compare fluorescence intensities from different plates.

Figure 4: Overlay of index sorting data on UMAP plot. For each marker, the left-hand plot shows the ‘raw’ or ‘pseudo-logged’ fluorescence intensity and the right-side plots the ‘scaled rank’ of the raw intensity. The pseudo-log transformation is a transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0. This transformation is commonly used when plotting fluorescence intensities from FACS. The scaled rank is applied within each plate_number and assigns the maximum fluorescence intensity a value of one and the minimum fluorescence intensities a value of zero. It can be thought of as a crude normalization of the FACS data that allows us to compare fluorescence intensities from different plates.

Cluster marker gene detection

Motivation

To interpret our clustering results, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to more subtle differences in activation status or differentiation state.

Identification of marker genes is usually based around the retrospective detection of differential expression between clusters³. Genes that are more strongly DE are more likely to have driven cluster separation in the first place. The top DE genes are likely to be good candidate markers as they can effectively distinguish between cells in different clusters.

Statistical methodology and considerations

It is important to have some understanding of the statistical methodology used for marker gene detection. This section gives an overview.

Gene lists

For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. Other statistics are also reported for each gene, including the adjusted p-values⁴ and the log-fold changes relative to every other cluster.

Use of pairwise comparisons

We intentionally use pairwise comparisons between clusters rather than comparing each cluster to the average of all other cells. The latter approach is sensitive to the population composition, potentially resulting in wildly different sets of markers when cell type abundances change in different contexts. In the worst case, the presence of a single dominant subpopulation will drive the selection of top markers for every other cluster, pushing out useful genes that can resolve the various minor subpopulations. Moreover, pairwise comparisons naturally provide more information to interpret of the utility of a marker, e.g., by providing log-fold changes to indicate which clusters are distinguished by this gene.

Blocking

We perform intra-plate comparisons by blocking on the plate_number to avoid confounding effects from differential expression between plates⁵.

Choice of test statistic

The Welch \(t\)-test is an obvious choice of statistical method to test for differences in expression between clusters. It is quickly computed and has good statistical properties for large numbers of cells (Soneson and Robinson 2018).

Alternatively, we could consider the Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test). Its strength lies in the fact that it directly assesses separation between the expression distributions of different clusters. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the \(t\)-test only does so indirectly via the difference in the means and the intra-group variance.

Another alternative, the binomial test, identifies genes that differ in the proportion of expressing cells between clusters⁶. This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle ‘tuning’ effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate.

The Welch \(t\)-test is our default method for identifying marker genes.

Direction and magnitude of the log-fold change

There are 3 ways of specifying the direction of the log-fold change used in the statistical test when comparing cluster \(X\) to cluster \(Y\):

direction = "any": Identify genes that are upregulated or downregulated in \(X\) compared to \(Y\).
direction = "up": Identify genes that are upregulated in \(X\) compared to \(Y\).
direction = "down": Identify genes that are downregulated in \(X\) compared to \(Y\).

Generally speaking, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. We therefore mostly focus on identifying genes that are upregulated (i.e. direction = "up") in the chosen cluster relative to any/some/all other clusters . Of course, this increased stringency is not without cost. If only upregulated genes are requested then any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its gene list. This is occasionally relevant for subtypes or other states that are distinguished by high versus low expression of particular genes.

Choice of combining pairwise DE results into a marker list

There are 3 ways of combining the pairwise DE results to obtain the per-cluster marker gene lists:

pval.type = "any": Consolidating with DE against any other cluster
pval.type = "all": Consolidating with DE against all other clusters
pval.type = "some": Consolidating with DE against some other clusters

It is important to understand the differences between these strategies, which are described below in some detail.

`pval.type = "any"`: Consolidating with DE against any other cluster

If pval.type = "any", the null hypothesis is that the gene is not DE in any pairwise comparison. The genes in each cluster’s gene list are sorted by the minimum rank (by significance) across all pairwise comparisons (called the Top value). Taking all rows with Top values less than or equal to \(T\) yields a marker set containing the top \(T\) genes from each pairwise comparison.

This strategy guarantees the inclusion of genes that can distinguish between any two clusters.

To demonstrate, let us define a marker set with a \(T\) of 1 for a given cluster. The set of genes with Top\(\leq1\) will contain the top gene from each pairwise comparison to every other cluster. If \(T\) is instead, say, \(5\), the set will consist of the union of the top 5 genes from each pairwise comparison. Obviously, multiple genes can have the same Top as different genes may have the same rank across different pairwise comparisons. Conversely, the marker set may be smaller than the product of Top and the number of other clusters, as the same gene may be shared across different comparisons.

This approach does not explicitly favour genes that are uniquely expressed in a cluster. Rather, it focuses on combinations of genes that - together - drive separation of a cluster from the others. This is more general and robust but tends to yield a less focused marker set compared to the other pval.type settings.

For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the lowest p-value. The combined p-value is computed by applying Simes’ method to all p-values. Neither of these values are directly used for ranking and are only reported for the sake of the user.

`pval.type = "all"`: Consolidating with DE against all other clusters

If pval.type = "all", the null hypothesis is that the gene is not DE in all pairwise comparisons. A combined p-value for each gene is computed using Berger’s intersection union test (IUT). Ranking based on the IUT p-value will focus on genes that are DE in that cluster compared to all other clusters.

This strategy is particularly effective when dealing with distinct clusters that have a unique expression profile. In such cases, it yields a highly focused marker set that concisely captures the differences between clusters.

However, it can be too stringent if the cluster’s separation is driven by combinations of gene expression. For example, consider a situation involving four clusters expressing each combination of two marker genes A and B. With pval.type = "all", neither A nor B would be detected as markers as it is not uniquely defined in any one cluster. This is especially detrimental with overclustering where an otherwise acceptable marker is discarded if it is not DE between two adjacent clusters.

For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the largest p-value. This reflects the fact that, with this approach, a gene is only as significant as its weakest DE. Again, this value is not directly used for ranking and are only reported for the sake of the user.

`pval.type = "some"`: Consolidating with DE against some other clusters

If pval.type = "some", the null hypothesis is that the gene is not DE in some of the pairwise comparisons. Thus, pval.type = "some" serves as a compromise between the pval.type = "all" and pval.type = "any" strategies. The definition of ‘some’ is formalized by the minimum proportion (min.prop) of significant comparisons per gene and can be tuned to the specifics of the dataset.

For example, suppose we require that the gene is significant in at least min.prop = 0.5 (i.e. 50%) of comparisons. A combined p-value is calculated by taking the middlemost value of the Holm-corrected p-values for each gene. Here, the null hypothesis is that the gene is not DE in at least half of the contrasts.

Genes are then ranked by the combined p-value. The aim is to provide a more focused marker set without being overly stringent, However, a downside is that it loses the theoretical guarantees of the pval.type = "all" and pval.type = "any" strategies. For example, there is no guarantee that the top set contains genes that can distinguish a cluster from any other cluster, which would have been possible with pval.type = "any".

For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the min.prop-smallest p-value. This mirrors the p-value calculation but, again, is reported only for the benefit of the user.

Analysis

CSV files of the cluster marker gene lists and PDFs files of heatmaps for selected genes from these cluster marker gene lists can be found in the output/cluster_marker_genes/ directory.

These files are organised as follows:

prefix	`test.type`	`direction`	`pval.type`
`t_up_all`	t	up	all
`t_up_some`	t	up	some
`t_any_any`	t	any	any