0.1 Single-Cell RNA-Seq Analysis with XYomics

Authors: Sophie Le Bars, Mohamed Soudy, and Enrico Glaab

1 Introduction

This vignette provides a guide for using the XYomics package to analyze sex-related patterns in single-cell RNA-sequencing (scRNA-seq) data. We will walk through a complete workflow using simulated data demonstrating how to identify and interpret sex-specific gene expression changes at the cell-type level. To ensure fast execution, all heavy computations are precomputed and loaded from the package.

The tutorial covers: 1. load a simulated scRNA-seq dataset. 2. Standard preprocessing using the Seurat package. 3. Performing differential expression analysis using different strategies: sex-stratified analysis and interaction analysis. 4. Detecting and categorizing sex-specific DEGs for each cell type. 5. Visualizing gene expression using dot plots. 6. Performing pathway enrichment analysis to uncover biological functions. 7. Constructing and visualizing protein-protein networks.

2 Data Simulation and Preprocessing

2.1 Simulating Single-Cell Data

We start by load a simulated scRNA-seq dataset containing 10 simulated samples. The code for generating this simulated dataset can be found in the Gitlab in the data_vignette folder.

library(Seurat)
library(XYomics)
library(dplyr)

sim.seurat <- readRDS(
  system.file("extdata", "sc_toy_seurat.rds", package = "XYomics")
)

2.2 Preprocessing and Clustering

We performed standard scRNA-seq preprocessing, including normalization, feature selection, and scaling. We then apply dimensionality reduction (PCA and UMAP) and clustering to identify cell populations.

# Precomputed during data generation

# sim.seurat <- NormalizeData(sim.seurat)
# sim.seurat <- FindVariableFeatures(sim.seurat)
# sim.seurat <- ScaleData(sim.seurat)
# sim.seurat <- RunPCA(sim.seurat)
# sim.seurat <- FindNeighbors(sim.seurat)
# sim.seurat <- FindClusters(sim.seurat)
# sim.seurat <- RunUMAP(sim.seurat)
DimPlot(sim.seurat)

2.3 Annotating Cell Types and Conditions

For the analysis, the Seurat object must contain metadata columns for cell type (cell_type), sex (sex), and condition (status). Here, we added mock annotations to our simulated data.

table(sim.seurat$cell_type)
## 
## cell type 1 cell type 2 cell type 3 cell type 4 cell type 5 
##         202         206         184         211         197
table(sim.seurat$sex)
## 
##   F   M 
## 500 500
table(sim.seurat$status)
## 
##  TG  WT 
## 600 400

3 Differential Expression Analysis Strategies

When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.

Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.

Interaction analysis formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.

The XYomics package provides functions for both types of analyses.

3.1 Method 1: Sex-Stratified Analysis

We use sex_stratified_analysis_sc() to identify DEGs between conditions separately for males and females within each cell type. We use sex_stratified_pipeline_sc() to obtain the categorized DEGs for each cell type in our seurat object.

# Run for all cell types
# if you want to have the female and male degs 

results_degs <- sex_stratified_analysis_sc(sim.seurat) 
# The differential expression method can be specified using the `test.use` parameter.
# Available options include "MAST", "bimod", "poisson", "LR", and any other
# method supported by Seurat::FindMarkers(). 
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the analysis. The default value is 3.

#
# Internal QC results generated by XYomics can be accessed via:
# results_degs$validation
#
# This QC output includes assessment of sex and phenotype balance, per-group cell
# counts, detection of under-represented groups below the minimum cell threshold,
# and overall design imbalance warnings, as well as dataset sparsity metrics.

# if you want the categorized DEGs by cell type more lenient than the default thesholds as it is simulated data 

results <- sex_stratified_pipeline_sc(sim.seurat, min_abs_logfc = 0.1,  alpha = 0.1) 
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also run on one or multiple cell types, for example:
# target_cell_type_flag = "cell type 1", "cell type 2"
results_degs <- readRDS(
  system.file("extdata", "sc_results_degs_small.rds", package = "XYomics")
)

results <- readRDS(
  system.file("extdata", "sc_results_small.rds", package = "XYomics")
)

head(results$`cell type 1`)
##                     DEG_Type Gene_Symbols Male_avg_logFC     Male_FDR
## male-specific1 male-specific       EXOSC1      2.8642113 6.875386e-09
## male-specific2 male-specific        NINJ1      2.0378847 1.905962e-06
## male-specific3 male-specific         NT5C      2.1459254 3.396873e-04
## male-specific4 male-specific       RAB22A      1.8442810 1.534290e-03
## male-specific5 male-specific       NIF3L1      1.2654157 4.691819e-02
## male-specific6 male-specific        NLRP3      0.9819638 7.825030e-02
##                Female_avg_logFC Female_FDR
## male-specific1       -0.2228460          1
## male-specific2        0.1816786          1
## male-specific3        0.2135438          1
## male-specific4        0.3144854          1
## male-specific5       -0.0471756          1
## male-specific6        0.2874025          1

3.2 Method 2: Sex-Phenotype Interaction Analysis

We use sex_interaction_pipeline_sc() to perform a formal interaction analysis, directly comparing the condition effect between sexes for all cell types.

# Example for one cell type (e.g., "cell type 1")

interaction_results <- sex_interaction_pipeline_sc(sim.seurat)
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 20.
interaction_results <- readRDS(
  system.file("extdata", "sc_interaction_results_small.rds", package = "XYomics")
)

head(interaction_results$`cell type 1`$summary_stats)# signficant sex-modulated DEGs: interaction_results$`cell type 1`$sig_results 
##     cell_type n_total_genes n_sig_genes
## 1 cell type 1           100          29

4 Visualization of Gene Expression

Dot plots are an effective way to visualize gene expression in scRNA-seq data. Here, we show the expression of the top male-specific, female-specific, sex-neutral and sex-dimorphic genes across all cell types, split by sex and condition.

# Get top gene from each category for 'cell type 1'
top_male_gene <- results$`cell type 1` %>% filter(DEG_Type == "male-specific") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_female_gene <- results$`cell type 1`  %>% filter(DEG_Type == "female-specific") %>% top_n(1, -Female_FDR) %>% pull(Gene_Symbols)
top_dimorphic_gene <- results$`cell type 1`  %>% filter(DEG_Type == "sex-dimorphic") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_neutral_gene <- results$`cell type 1`  %>% filter(DEG_Type == "sex-neutral") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)

top_genes <- c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene)

generate_dotplot_sc(sim.seurat, top_genes, target_cell_type_flag = "cell type 1")

# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")

5 Pathway Enrichment Analysis

We perform pathway enrichment analysis to identify biological pathways over-represented in our DEG categories. The categorized_enrich() function automates this for all categories.

# Run for all cell types
pathway_category <- lapply(
  results,
  function(cell) categorized_enrich(cell, enrichment_db = "GO")
)
# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned.
# Run for one specific cell type
# pathway_category_one <- categorized_enrich(results$`cell type 1`, return_df =T, enrichment_db = "GO")
pathway_category <- readRDS(
  system.file("extdata", "sc_pathway_single_small.rds", package = "XYomics")
)

pathway_category_one <- readRDS(
  system.file("extdata", "sc_pathway_single_small_df.rds", package = "XYomics")
)

head(pathway_category_one$`female-specific`)
##                    ID
## GO:1904507 GO:1904507
## GO:0060382 GO:0060382
## GO:1904505 GO:1904505
## GO:0033617 GO:0033617
## GO:2000819 GO:2000819
## GO:0008535 GO:0008535
##                                                                      Description
## GO:1904507 positive regulation of telomere maintenance in response to DNA damage
## GO:0060382                                   regulation of DNA strand elongation
## GO:1904505          regulation of telomere maintenance in response to DNA damage
## GO:0033617                           mitochondrial cytochrome c oxidase assembly
## GO:2000819                              regulation of nucleotide-excision repair
## GO:0008535                                 respiratory chain complex IV assembly
##            GeneRatio  BgRatio RichFactor FoldEnrichment   zScore      pvalue
## GO:1904507       1/3 15/18860 0.06666667       419.1111 20.43257 0.002384231
## GO:0060382       1/3 17/18860 0.05882353       369.8039 19.18795 0.002701842
## GO:1904505       1/3 19/18860 0.05263158       330.8772 18.14516 0.003019386
## GO:0033617       1/3 27/18860 0.03703704       232.8395 15.20524 0.004288885
## GO:2000819       1/3 29/18860 0.03448276       216.7816 14.66765 0.004606092
## GO:0008535       1/3 31/18860 0.03225806       202.7957 14.18283 0.004923231
##              p.adjust     qvalue geneID Count
## GO:1904507 0.04561299 0.01356908 ACTL6A     1
## GO:0060382 0.04561299 0.01356908 ACTL6A     1
## GO:1904505 0.04561299 0.01356908 ACTL6A     1
## GO:0033617 0.04561299 0.01356908  COX20     1
## GO:2000819 0.04561299 0.01356908 ACTL6A     1
## GO:0008535 0.04561299 0.01356908  COX20     1

You can then visualize them using plot_enrichment_dotplots() function.

# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_category)
print(plot)

$all

6 Protein-protein interaction Network Analysis

Finally, we construct a protein-protein interaction network to explore the interactions between our DEGs.

6.1 Building the Network

We first fetch a protein-protein interaction network from the STRING database or directly without the package (hormonal_network_metacore.rds). Then, we use the PCSF algorithm via ppi_pipeline() to extract a relevant subnetwork based on “prizes” derived from our DEG p-values for all cell types.

# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)

# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".
### Run for all cell type
network_results <- ppi_pipeline(results, g)

# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")

# Example for one specific cell type


network_results_one <- ppi_pipeline(results,g , target_cell_type_flag = "cell type 2")
network_results <- readRDS(
  system.file("extdata", "sc_network_results_small.rds", package = "XYomics")
)

6.2 Visualizing the Network

We use plot_network() to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.

female_network <- network_results$`cell type 2`$female_network

#Generate network visualization
plot_network(female_network, "female-specific", results$`cell type 2`, "Cell type 2", show_barplot = T) 

#Generate all_plots for all categories and all cell types

all_plots <- plot_network_pipeline(network_results, results)

7 Generating a Report

The generate_cat_report function can be used to compile all results into a single HTML report.

# This command generates a comprehensive HTML report for single-cell dataset
generate_cat_report(results, pathway_category, network_results)

8 Notes