Reproduces the Cytocipher pancreas analysis

Environment setup

The below was tested using python 3.8.12, with a new conda environment in a new empty folder, as shown below

mkdir cytocipher_test
cd cytocipher_test
conda create -n cytocipher_test python=3.8.12
conda activate cytocipher_test

Cytocipher installation

!pip install scvelo==0.2.4
!pip install cytocipher

import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt

import cytocipher as cc

Downloading data & preprocessing

Following the magnificent scvelo tutorial here: https://scvelo.readthedocs.io/VelocityBasics/

data = scv.datasets.pancreas()

Preprocessing

scv.pp.filter_genes(data, min_shared_counts=4)
scv.pp.normalize_per_cell(data)
scv.pp.log1p(data)

print(data.shape)

Filtered out 18138 genes that are detected 4 counts (shared).
Normalized count data: X, spliced, unspliced.
(3696, 9860)

Reproduce scvelo velocity embedding for reference

scv.pp.moments(data, n_pcs=30, n_neighbors=30)
scv.tl.velocity(data)
scv.tl.velocity_graph(data)

scv.pl.velocity_embedding_stream(data, basis='umap', figsize=(4,3))

computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

Over-clustering

sc.tl.leiden(data, resolution=3.5)

_, ax = plt.subplots(figsize=(4,3))
sc.pl.umap(data, color='leiden', ax=ax)

Running Cytocipher to determine significantly different clusters

First confirm over-clustering with code-scoring

NOTE on determining if your data is over-clustered

Practically, Leiden clustering at a resolution between 2 and 4 produces over-clusters.
UMAP and the Cytocipher enrichment heatmap can be utilised to assess over-clustering.
In the UMAP, over-clusters will display as subclusters of larger populations with no apparent separation.
In the enrichment heatmap, cross-scoring is indicated by scores along each column split amongst multiple rows (indicating cells in separate clusters having similar scores).
The below heatmap is a good example, though still has some cross-scoring after significance analysis due to the continuous nature of developmental scRNA-seq data.

NOTE on cluster pairs with fully overlapping markers genes

Full overlap of marker genes between a pair of clusters (e.g. cluster i and cluster j) results in no set difference between them, and therefore negative gene set subtraction with respect to cluster j is not performed for cluster i.
Clusters with full overlap are scored/merged/kept separate the same as clusters with no overlaps (albeit scored for the same set of genes in the bi-directional test).
Optionally, the user can run with squash_exception=False when running cc.tl.merge_clusters or cc.tl.code_score, to receive an error message if full gene set overlap occurs between two cluster pairs. This prompts the user to increase the number of marker genes per cluster, in-case additional genes discriminate the cluster pair.

### Can check documentation with:
?cc.tl.code_enrich

sc.pp.highly_variable_genes(data, min_disp=.2)

cc.tl.get_markers(data, 'leiden', var_groups='highly_variable', n_top=6)
cc.tl.code_enrich(data, 'leiden', n_cpus=2)

WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
Added data.uns['leiden_markers']
Added data.obsm['leiden_enrich_scores']

cc.pl.enrich_heatmap(data, 'leiden', figsize=(4,4))

Determine significantly different clusters

?cc.tl.merge_clusters

cc.tl.merge_clusters(data, 'leiden', n_cpus=2, p_cut=.045)

Initial merge.
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
Added data.obs[f'leiden_merged']
Exiting due to reaching max_iter 0

#### Asthetic colours
colors = ['#f7b6d2','#d62728','#aec7e8','#9edae5','#2ca02c','#ff9896','#ffbb78',
          '#8c564b','#1f77b4','#c5b0d5','#bcbd22','#c7c7c7','#17becf','#e377c2']
orig_colors = {str(i): color for i, color in enumerate(colors)}
clust_map = {'0': '5', '1': '0', '2': '2', '3': '1', '4': '3', '5': '4', '6': '10',
            '7': '6', '8': '8', '9': '9', '10': '11', '11': '12', '12': '14', '13': '15'}
new_colors = {clust_map[clust]: orig_colors[clust] for clust in orig_colors}
new_colors['7'] = 'wheat'
new_colors['13'] = 'salmon'

data.uns['leiden_merged_colors'] = [new_colors[clust] for clust in list(data.obs['leiden_merged'].cat.categories)]

Visualise the merged clusters

Important to make sure the cluster merging makes sense; for instance, we should not see clusters merged that are highly distint in the UMAP space, and the cluster code scores should look distinct for each cluster.

Below, we can see cytocipher cluster merge has identified interesting intermediate states in pancreas differentiation.

_, ax = plt.subplots(figsize=(4,3))
sc.pl.umap(data, color=f'leiden_merged', ax=ax)

cc.pl.enrich_heatmap(data, 'leiden_merged', figsize=(4,4), scale_cols=True)

Can visualise which clusters were merged.

?cc.pl.merge_sankey

cc.pl.merge_sankey(data, 'leiden')

Checking the p-value cutoff

Too relaxed (high p_cut), then clearly different clusters will be merged.

Too stringent (low p_cut), then clusters which are very similar may not be merged.

In the volcano plot below, the p-value cutoff of .05 appears to create a good separation of significant vs non-significant cluster pairs such that most of the non-significant cluster pairs also have a much lower log-FC for the enrichment scores.

cc.pl.volcano(data, 'leiden', p_cut=.045, p_adjust=True, figsize=(4,4))

If we want to adjust this, can do so quickly by running:

?cc.tl.merge

cc.tl.merge(data, 'leiden', p_cut=0.045, use_p_adjust=True)

Added data.uns['leiden_mutualpairs']
Added data.obs['leiden_merged']

Need to update the marker genes and the scores to ensure these are in-sync with the updated merged cluster labels:

cc.tl.get_markers(data, 'leiden_merged', var_groups='highly_variable', n_top=6)
cc.tl.code_enrich(data, 'leiden_merged', n_cpus=2)

WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
Added data.uns['leiden_merged_markers']
Added data.obsm['leiden_merged_enrich_scores']

This is also evident when examining the distributions for the comparisons made when testing for cluster significant difference.

'Bottom significant' has a low p-value for one direction of cluster comparion (p=0.008), while the 'Bottom non-significant' has high p-values when comparing clusters in both directions (p=0.1089 & p=0.1297).

Thus indicating a good separaton between significant versus non-significant pairs.

cc.pl.sig_cluster_diagnostics(data, 'leiden')

Printing top and bottom most significant/non-significant clusters.

Top significant
p=2.373953787511573e-21 (9 cells; 9 scores) vs (1 cells; 9 scores)
p=1.6886623129604666e-10 (9 cells; 1 scores) vs (1 cells; 1 scores)

Bottom significant
p=0.9330751553043308 (26 cells; 26 scores) vs (20 cells; 26 scores)
p=1.786090242867995e-09 (26 cells; 20 scores) vs (20 cells; 20 scores)

Top non-significant
p=0.9959347222234254 (17 cells; 17 scores) vs (4 cells; 17 scores)
p=0.5147088387085947 (17 cells; 4 scores) vs (4 cells; 4 scores)

Bottom non-significnat
p=0.043690278748269896 (24 cells; 24 scores) vs (25 cells; 24 scores)
p=0.04461398211148797 (24 cells; 25 scores) vs (25 cells; 25 scores)

Checking for cluster size bias

Ideally, we do not want the number of cells in each cluster to effect the significance.

Below we plot the log-FC between the number of cells in the clusters being compared on the x-axis, with the -log10(p-values) for the pair significance comparison as the y-axis.

If a correlation is observed, this could indicate something has gone wrong in the cluster merging function above.

ρ indicates the Spearman correlation on a scale of -1 to 1, with 0 indicating no bias/correlation. In this case, ρ=0.02 indicates negligible correlation between cluster pair significance & cell abundance, & therefore no bias.

fig, ax = plt.subplots(ncols=2, figsize=(10,3))
cc.pl.check_abundance_bias(data, 'leiden', p_cut=.045, p_adjust=True, show_legend=False, ax=ax[0], show=False)
cc.pl.check_total_abundance_bias(data, 'leiden', p_cut=.045, p_adjust=True, show_legend=False, ax=ax[1], show=False)
plt.show()

Examining the expression of the top marker genes for the merged clusters

We can also look at the top marker genes for each of the merged clusters.

When visualising these & reading literature on the role of these genes in pancreas development, it's clear the new states identified correspond to important intermediate states and branch points in pancreatic cell fate determination.

markers = data.uns['leiden_merged_markers']
markers

{'0': array(['Chgb', 'Fev', 'Hmgn3', 'Chga', 'Cpe', 'Bex2'], dtype=object),
 '1': array(['Spp1', 'Sparc', 'Clu', 'Dbi', 'Mgst1', 'Anxa2'], dtype=object),
 '2': array(['1700086L19Rik', 'Rbp4', 'Cpe', 'Rap1b', 'Pcsk1n', 'Chgb'],
       dtype=object),
 '3': array(['Btbd17', 'Mdk', 'Tmsb4x', 'Neurog3', 'Gadd45a', 'Btg2'],
       dtype=object),
 '4': array(['Tuba1b', 'Hmgb2', 'Tubb5', 'H2afz', '2810417H13Rik', 'Spp1'],
       dtype=object),
 '5': array(['Cpe', 'Pcsk1n', 'Tmem27', 'Pcsk2', 'Gpx3', 'Slc38a5'],
       dtype=object),
 '6': array(['Neurog3', 'Cd24a', 'Btbd17', 'Gadd45a', 'Mdk', 'Serpinh1'],
       dtype=object),
 '7': array(['Isl1', 'Fam183b', 'Aplp1', 'Rbp4', 'Bex2', 'Gch1'], dtype=object),
 '8': array(['Pebp1', 'Gapdh', 'Wfdc2', 'Spp1', 'Rpl12', 'Mgst1'], dtype=object),
 '9': array(['Rbp4', 'Isl1', 'Hmgn3', 'Hhex', 'Fam183b', 'Hadh'], dtype=object),
 '10': array(['Rbp4', 'Isl1', 'Aplp1', 'Ghrl', 'Bex2', 'Tmed10'], dtype=object),
 '11': array(['Cdkn1a', 'Rgs17', 'Isl1', 'Krt7', 'Cck', 'Btg2'], dtype=object),
 '12': array(['Hmgb2', 'H2afv', 'Mdk', 'Tubb5', 'Cdk1', 'Birc5'], dtype=object),
 '13': array(['Cpe', 'Pcsk1n', 'Tmem27', 'Fam183b', 'Aplp1', 'Meis2'],
       dtype=object),
 '14': array(['Krt7', 'Cck', 'Pax4', 'Map1b', 'Aplp1', 'Btg2'], dtype=object),
 '15': array(['Ins2', 'Nnat', 'Sec61b', 'Ins1', 'Ppp1r1a', 'Sdf2l1'],
       dtype=object)}

interesting_markers = ['Tuba1b', 'Anxa2', 'Cd24a', 'Btbd17', 'Isl1', 'Pax4', 'Hhex', 'Ins2', 'Gcg']
clusters = ['4', '1', '6', '6', '11', '14', '9', '15', '5']
for i in range(len(clusters)):
    sc.pl.umap(data, color=['leiden_merged', interesting_markers[i]], groups=[clusters[i]])

Thanks for taking a look, please feel free to add an issue to the github page if you find problems!

https://github.com/BradBalderson/Cytocipher