Reproduces the Cytocipher pancreas analysis
Environment setup
The below was tested using python 3.8.12, with a new conda environment in a new empty folder, as shown below
mkdir cytocipher_test
cd cytocipher_test
conda create -n cytocipher_test python=3.8.12
conda activate cytocipher_test
Cytocipher installation
!pip install scvelo==0.2.4
!pip install cytocipher
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt
import cytocipher as cc
Downloading data & preprocessing
Following the magnificent scvelo tutorial here: https://scvelo.readthedocs.io/VelocityBasics/
data = scv.datasets.pancreas()
Preprocessing
scv.pp.filter_genes(data, min_shared_counts=4)
scv.pp.normalize_per_cell(data)
scv.pp.log1p(data)
print(data.shape)
Reproduce scvelo velocity embedding for reference
scv.pp.moments(data, n_pcs=30, n_neighbors=30)
scv.tl.velocity(data)
scv.tl.velocity_graph(data)
scv.pl.velocity_embedding_stream(data, basis='umap', figsize=(4,3))
Over-clustering
sc.tl.leiden(data, resolution=3.5)
_, ax = plt.subplots(figsize=(4,3))
sc.pl.umap(data, color='leiden', ax=ax)
Running Cytocipher to determine significantly different clusters
First confirm over-clustering with code-scoring
NOTE on determining if your data is over-clustered
- Practically, Leiden clustering at a resolution between 2 and 4 produces over-clusters.
- UMAP and the Cytocipher enrichment heatmap can be utilised to assess over-clustering.
- In the UMAP, over-clusters will display as subclusters of larger populations with no apparent separation.
- In the enrichment heatmap, cross-scoring is indicated by scores along each column split amongst multiple rows (indicating cells in separate clusters having similar scores).
- The below heatmap is a good example, though still has some cross-scoring after significance analysis due to the continuous nature of developmental scRNA-seq data.
NOTE on cluster pairs with fully overlapping markers genes
-
Full overlap of marker genes between a pair of clusters (e.g. cluster i and cluster j) results in no set difference between them, and therefore negative gene set subtraction with respect to cluster j is not performed for cluster i.
-
Clusters with full overlap are scored/merged/kept separate the same as clusters with no overlaps (albeit scored for the same set of genes in the bi-directional test).
-
Optionally, the user can run with squash_exception=False when running cc.tl.merge_clusters or cc.tl.code_score, to receive an error message if full gene set overlap occurs between two cluster pairs. This prompts the user to increase the number of marker genes per cluster, in-case additional genes discriminate the cluster pair.
### Can check documentation with:
?cc.tl.code_enrich
sc.pp.highly_variable_genes(data, min_disp=.2)
cc.tl.get_markers(data, 'leiden', var_groups='highly_variable', n_top=6)
cc.tl.code_enrich(data, 'leiden', n_cpus=2)
cc.pl.enrich_heatmap(data, 'leiden', figsize=(4,4))
Determine significantly different clusters
?cc.tl.merge_clusters
cc.tl.merge_clusters(data, 'leiden', n_cpus=2, p_cut=.045)
#### Asthetic colours
colors = ['#f7b6d2','#d62728','#aec7e8','#9edae5','#2ca02c','#ff9896','#ffbb78',
'#8c564b','#1f77b4','#c5b0d5','#bcbd22','#c7c7c7','#17becf','#e377c2']
orig_colors = {str(i): color for i, color in enumerate(colors)}
clust_map = {'0': '5', '1': '0', '2': '2', '3': '1', '4': '3', '5': '4', '6': '10',
'7': '6', '8': '8', '9': '9', '10': '11', '11': '12', '12': '14', '13': '15'}
new_colors = {clust_map[clust]: orig_colors[clust] for clust in orig_colors}
new_colors['7'] = 'wheat'
new_colors['13'] = 'salmon'
data.uns['leiden_merged_colors'] = [new_colors[clust] for clust in list(data.obs['leiden_merged'].cat.categories)]
Visualise the merged clusters
Important to make sure the cluster merging makes sense; for instance, we should not see clusters merged that are highly distint in the UMAP space, and the cluster code scores should look distinct for each cluster.
Below, we can see cytocipher cluster merge has identified interesting intermediate states in pancreas differentiation.
_, ax = plt.subplots(figsize=(4,3))
sc.pl.umap(data, color=f'leiden_merged', ax=ax)
cc.pl.enrich_heatmap(data, 'leiden_merged', figsize=(4,4), scale_cols=True)
Can visualise which clusters were merged.
?cc.pl.merge_sankey
cc.pl.merge_sankey(data, 'leiden')
Checking the p-value cutoff
Too relaxed (high p_cut), then clearly different clusters will be merged.
Too stringent (low p_cut), then clusters which are very similar may not be merged.
In the volcano plot below, the p-value cutoff of .05 appears to create a good separation of significant vs non-significant cluster pairs such that most of the non-significant cluster pairs also have a much lower log-FC for the enrichment scores.
cc.pl.volcano(data, 'leiden', p_cut=.045, p_adjust=True, figsize=(4,4))
If we want to adjust this, can do so quickly by running:
?cc.tl.merge
cc.tl.merge(data, 'leiden', p_cut=0.045, use_p_adjust=True)
Need to update the marker genes and the scores to ensure these are in-sync with the updated merged cluster labels:
cc.tl.get_markers(data, 'leiden_merged', var_groups='highly_variable', n_top=6)
cc.tl.code_enrich(data, 'leiden_merged', n_cpus=2)
This is also evident when examining the distributions for the comparisons made when testing for cluster significant difference.
'Bottom significant' has a low p-value for one direction of cluster comparion (p=0.008), while the 'Bottom non-significant' has high p-values when comparing clusters in both directions (p=0.1089 & p=0.1297).
Thus indicating a good separaton between significant versus non-significant pairs.
cc.pl.sig_cluster_diagnostics(data, 'leiden')
Checking for cluster size bias
Ideally, we do not want the number of cells in each cluster to effect the significance.
Below we plot the log-FC between the number of cells in the clusters being compared on the x-axis, with the -log10(p-values) for the pair significance comparison as the y-axis.
If a correlation is observed, this could indicate something has gone wrong in the cluster merging function above.
ρ indicates the Spearman correlation on a scale of -1 to 1, with 0 indicating no bias/correlation. In this case, ρ=0.02 indicates negligible correlation between cluster pair significance & cell abundance, & therefore no bias.
fig, ax = plt.subplots(ncols=2, figsize=(10,3))
cc.pl.check_abundance_bias(data, 'leiden', p_cut=.045, p_adjust=True, show_legend=False, ax=ax[0], show=False)
cc.pl.check_total_abundance_bias(data, 'leiden', p_cut=.045, p_adjust=True, show_legend=False, ax=ax[1], show=False)
plt.show()
Examining the expression of the top marker genes for the merged clusters
We can also look at the top marker genes for each of the merged clusters.
When visualising these & reading literature on the role of these genes in pancreas development, it's clear the new states identified correspond to important intermediate states and branch points in pancreatic cell fate determination.
markers = data.uns['leiden_merged_markers']
markers
interesting_markers = ['Tuba1b', 'Anxa2', 'Cd24a', 'Btbd17', 'Isl1', 'Pax4', 'Hhex', 'Ins2', 'Gcg']
clusters = ['4', '1', '6', '6', '11', '14', '9', '15', '5']
for i in range(len(clusters)):
sc.pl.umap(data, color=['leiden_merged', interesting_markers[i]], groups=[clusters[i]])
Thanks for taking a look, please feel free to add an issue to the github page if you find problems!
https://github.com/BradBalderson/Cytocipher