Influenza A Bulk RNA-seq Data
There is a tutorial for the CRISGITime package using the GSE30550 dataset, which contains bulk RNA-seq data from patients infected with the H3N2 influenza virus. The dataset includes samples collected at different time points and clinical symptoms.
Installation
pip install git+https://github.com/compbioclub/CRISGI.git@main
Importing Required Libraries
from crisgi import CRISGITime
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
Loading Data & Preprocessing
First, we need to load the raw data and preprocess it.
The basic information like the number of samples, genes, and time points can be obtained from the raw data.
The meta_data contains the sample information, including the time points and detailed clinical symptoms.
# Load raw data and meta data
adata = ad.read_h5ad("./data/GSE30550_raw.adata")
meta_data = pd.read_csv("./data/GSE30550_meta.csv", index_col=0)
# There is a default background network in the raw data, which is not needed for the analysis
del adata.varm["bg_net"]
# Preprocess the data
adata.layers["log1p"] = adata.X
adata.obs["symptom"] = adata.obs["clinic_pheno:ch1"]
adata.obs["subject"] = adata.obs["person"]
# The baseline time point samples are regarded as the reference samples
adata.obs["type"] = adata.obs["time"].apply(
lambda x: "Ref" if x == "Baseline" else "Test"
)
adata.obs["timepoint"] = adata.obs.time
# Extract the time number from the timepoint column
adata.obs["time"] = adata.obs.timepoint.apply(
lambda x: -24 if x == "Baseline" else x.replace("Hour ", " ")
).astype(int)
# Combine the subject and time columns to create a unique identifier for each sample
adata.obs["test"] = (
adata.obs["subject"].astype(str) + " " + adata.obs["time"].astype(str)
)
adata.obs["clinical_symptoms"] = meta_data.loc[
adata.obs.index, "clinical_symptoms"
].values
# Remove the original columns that are no longer needed
del adata.obs["clinic_pheno:ch1"]
We can save the processed data in a hdf5 file for later use.
adata.write_h5ad("./data/GSE30550_H3N2.h5ad")
Or load the preprocessed data from the hdf5 file.
adata = ad.read_h5ad("./data/GSE30550_H3N2.h5ad")
Creating CRISGITIME Object
Next, we can create a CRISGITIME object using the preprocessed data. This object will be used for further analysis.
# Define the calculating interaction methods (list format)
interaction_methods = ["prod"]
# Create CRISGITime object
crisgi_obj = CRISGITime(
adata=adata,
interaction_methods=interaction_methods,
n_hvg=5000,
organism="human",
n_threads=1,
dataset="GSE30550_H3N2",
class_type="time",
out_dir="./out",
)
Reference and Test Populations
We filter the samples to create reference and test populations.
The baseline samples are used as the reference population ref_obs
, while the test population test_obss
consists of samples from different time points (all samples).
ref_obs = adata.obs[adata.obs.type == "Ref"].index.to_list()
test_obss = [[x] for x in adata.obs.index.to_list()]
Calculating Entropy
Parameter groupby
is used to specify the grouping of the test samples.
In crisgi_obj.adata.obs
, we can find the symptom
column, which contains the label (Symptomatic
/Asymptomatic
) for each sample.
Filling the proper parameters in crisgi_obj.calculate_entropy()
, we can calculate the entropy for the test population.
# Set the groupby column
groupby = "symptom"
# Calculate the entropy
crisgi_obj.calculate_entropy(
ref_obs=ref_obs,
test_obss=test_obss,
groupby=groupby,
ref_time="Ref",
layer="log1p",
)
Setting DER/TER Parameters
Here we set the parameters for the DER/TER analysis.
n_top_interactions
is the number of top interactions to be considered. If set to None
, all interactions will be considered.
test_type
is used to specify the type of test to be performed. The options are DER
(differentially expressed interactions) and TER
(trend-expressed interactions). DER
and TER
are two types of interactions that can be adopted in finding key interactions in the CRISGITime package. TER
is a subset of DER
, and it is used to find the interactions that are differentially expressed in a specific trend (up or down) across time points.
target_group
is used to specify the target group for the analysis, which should be an element in the groupby
column. In this case, we are interested in the Symptomatic
group, which contains samples from patients with symptoms.
n_top_interactions = 100
test_type = "TER"
target_group = "Symptomatic"
Calulating DER and TER
There are three steps to calculate DER and TER. First, we use test_DER
to calculate DER score for every interaction. Then, we use get_DER
with specific parameters to get the DER results. Finally, based on the DER results, we use test_TER
to get the TER results.
The results of DER and TER will be saved in the crisgi_obj
object and output as CSV files.
crisgi_obj.test_DER(
groupby="symptom",
target_group=target_group,
)
crisgi_obj.get_DER(
target_group=target_group,
n_top_interactions=n_top_interactions,
p_adjust=True,
p_cutoff=0.05,
fc_cutoff=1,
sortby="pvals_adj",
)
crisgi_obj.test_TER(target_group=target_group)
DER/TER Results Output
Here we can use pl.interaction_score_line
to get a general overview of the DER/TER results for the target_group
. This function will plot the average interaction scores for the target_group
and ref_group
, allowing us to visualize the differences between the symptomatic and asymptomatic groups.
The pl.get_interaction_score
and pl.generate_interaction_score_images
functions can generate every sample's DER/TER result, and it will be output as a CSV file and a heatmap.
The method
parameter should be an element in the interaction_methods
list.
# Import required libraries
import crisgi.plotting_crisgi_time as pl
interaction_method = "prod"
# Plot general DER/TER results
pl.interaction_score_line(
crisgi_obj,
target_group=target_group,
method=interaction_method,
test_type=test_type,
unit_header=None,
)
# Generate interaction score matrix for each subject
pl.get_interaction_score(
crisgi_obj,
target_group=target_group,
groupby=groupby,
method=interaction_method,
test_type=test_type,
subject_header="subject",
out_dir=None,
)
# Generate interaction score images
pl.generate_interaction_score_images(
folder_path=crisgi_obj.out_dir,
output_path=crisgi_obj.out_dir,
rep_n=1,
robust=True,
scale=False,
)
Startpoint Detection
The startpoint_detection
function is used to detect the start points (early-warning signals) in the time series data. The results will be saved in the crisgi_obj.edata.obs.CT_time
.
crisgi_obj.detect_startpoint(symptom_types = ["Symptomatic"])
print(crisgi_obj.edata.obs.CT_time)
Training and Predicting with CRISGITime
We can use the train
& predict
functions to train and predict the CRISGITime model by using the TER/DER results as the feature matrix.
The train
function will train the model using the training data, and the predict
function will use the trained/loaded model to predict the test data.
# Import required libraries
import torch
import os
from torchvision import transforms
from torch.utils.data import DataLoader
from crisgi.util import ImageDataset
import crisgi.plotting_crisgi_time as pl
# Check if GPU is available and set device
crisgi_obj.device = "cuda" if torch.cuda.is_available() else "cpu"
# Transform for image preprocessing
transform = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()])
image_dir = os.path.join(crisgi_obj.out_dir, "images")
if not os.path.exists(image_dir):
os.makedirs(image_dir)
# Generate interaction score images
pl.plot_interaction_score(
crisgi_obj,
output_path=image_dir,
robust=True,
scale=False,
rep_n=2,
label=True,
)
# Create a dataset and dataloader
dataset = ImageDataset(
image_dir=image_dir,
label_csv=os.path.join(image_dir, "labels.csv"),
transform=transform,
return_label=True,
)
loader = DataLoader(dataset, batch_size=16, shuffle=True)
Choosing/Loading the Model
There are three different models that can be used in set_model_type
: cnn
(3L-CNN), simple_cnn
(1L-CNN), and logistic
(logistic regression).
The model_type
parameter is used to specify the model type. You can load the model from a file in ae_path
and mlp_path
if you want to use a pre-trained model, or train a new model with None
.
# 3L-CNN model
crisgi_obj.set_model_type(
model_type="cnn",
# ae_path="./data/model/3L-CNN/GSE30550_H3N2_ae_model.pth",
# mlp_path="./data/model/3L-CNN/GSE30550_H3N2_mlp_model.pth",
device=crisgi_obj.device,
)
# 1L-CNN model
crisgi_obj.set_model_type(
model_type="simple_cnn",
# ae_path="./data/model/1L-CNN/GSE30550_H3N2_ae_single_model.pth",
# mlp_path="./data/model/1L-CNN/GSE30550_H3N2_mlp_single_model.pth",
device=crisgi_obj.device,
)
# Logistic regression model
crisgi_obj.set_model_type(model_type='logistic',
# model_path="./data/model/logistic/GSE30550_H3N2_log_model.pth",
device = crisgi_obj.device,
)
Training the Model
crisgi_obj.train(train_loader=loader, epochs=20)
Predicting the Model
predictions = crisgi_obj.predict(loader)
print(predictions)
Saving the Model
torch.save(crisgi_obj.model.ae, "./data/GSE30550_H3N2_ae_model.pth")
torch.save(crisgi_obj.model.mlp, "./data/GSE30550_H3N2_mlp_model.pth")
Enrichment Analysis
We have four different levels of enrichment analysis for you to choose: cohort_level_top_n_ORA
, Phenotype-level Accumulated Top n ORA
, Phenotype-level CT-rank enrichment
, and Observation-level CT-rank enrichment
.
Cohort-level Top n ORA
To capture CT interactions that are consistently strong across a cohort, we sum up the CRISGI scores for each interaction across all observations. The top-n interactions from this aggregate ranking were then used for over-representation analysis (ORA), identifying pathways enriched among globally dominant CT interactions across the population.
# The result will be generated in ./{out_dir}/{interaction_method}_cohort_enrich.csv
crisgi_obj.cohort_level_top_n_ORA(
n_top_interactions=30,
method="prod",
gene_sets=[
"KEGG_2021_Human",
"GO_Molecular_Function_2023",
"GO_Cellular_Component_2023",
"GO_Biological_Process_2023",
"MSigDB_Hallmark_2020",
],
background=None,
organism="human",
plot=True,
)
# Also, the results can be accessed through the edata object
crisgi_obj.edata.uns[f"{interaction_method}_cohort_enrich_df"]
Phenotype-level Accumulated Top n ORA
We can perform ORA on incremental top-n CT interactions, ranked by CRISGI z-scores from differential testing, using a predefined step size. For each significantly enriched pathway, we can compute a normalized Top Ratio (nTopRatio
), defined as the fraction of pathway-associated CT interactions within each top-n subset.
Pathways are ranked by cumulative nTopRatio
across all increments to highlight consistently enriched pathways across high-ranking CTs.
crisgi_obj.n_threads = 1
interaction_method = "prod"
# The result will be generated in {interaction_method}_{groupby}_{target_group}_{test_type}_enrich.csv in out_dir
crisgi_obj.pheno_level_accumulated_top_n_ORA(
target_group=target_group,
n_top_interactions=30,
n_space=10,
method=interaction_method,
test_type=test_type,
)
# Plot the phenotype-level accumulated top n ORA results
pl.pheno_level_accumulated_top_n_ORA(
crisgi_obj,
target_group=target_group,
p_adjust=True,
p_cutoff=0.05,
n_top_interactions=30,
method=interaction_method,
test_type=test_type,
)
# Also, the results can be accessed through the edata object
crisgi_obj.edata.uns[f"{interaction_method}_{groupby}_{target_group}_{test_type}_enrich_df"]
Phenotype-level CT-rank enrichment
Using Kolmogorov-Smirnov (KS)-like statistic, we can compute enrichment scores over the full ranking of CT genes based on their phenotype-specific CRISGI z-scores derived from differential test (phenotype A vs. phenotype B). This reveals pathways with directional fluctuation or stability trends across the ranked list.
Positive enrichment scores indicate pathway in phenotype A exhibits elevated fluctuations (i.e., more prominent CT signals); negative scores imply relative stability compared to phenotype B.
# Set the reference and target groups for the CT rank test
ref_group = "Asymptomatic"
target_group = "Symptomatic"
# Results will be saved in the folder ./out_dir/{prefix}_{ref_group}_{target_group}
crisgi_obj.pheno_level_CT_rank(
ref_group=ref_group,
target_group=target_group,
n_top_interactions=n_top_interactions,
sortby="pvals_adj",
gene_sets=[
"KEGG_2021_Human",
"GO_Molecular_Function_2023",
"GO_Cellular_Component_2023",
"GO_Biological_Process_2023",
"MSigDB_Hallmark_2020",
],
prefix="pheno_level_CT_rank",
min_size=5,
max_size=1000,
permutation_num=1000,
seed=0,
)
# The results can be accessed through the gp_res object
crisgi_obj.gp_res.results
Observation-level CT-rank enrichment
We extend the KS-like enrichment analysis to the observation level using CRISGI scores specific to each observation.
A positive enrichment score reflects pathway actively fluctuating in that observation, while a negative score indicates stability.
# Parameters for the observation-level CT rank test
enrich_dataset = "GSE30550_H3N2"
interaction_method = "prod"
# Set the
interactions = crisgi_obj.edata.uns[
f"{interaction_method}_{groupby}_{target_group}_{test_type}"
][0:n_top_interactions]
gene_sets = {f"{enrich_dataset} {test_type}{n_top_interactions}": interactions}
# The results will be generated in ./out_dir/{prefix} folder
crisgi_obj.obs_level_CT_rank(
gene_sets=gene_sets,
prefix="obs_level_CT_rank",
min_size=5,
)
# The results can be accessed through the gp_es object
crisgi_obj.gp_es.results