pheno_prerank_enrich() Enrichment: Gene List with Rank Scores for Phenotypes
Step 1: Prepare rank_df
Begin by preparing a list of genes along with their preranked scores. Store this information in a DataFrame called rank_df.
- The index of
rank_dfshould be the gene symbols. - The columns of
rank_dfshould correspond to one target phenotype (e.g., Case vs. Control).
This format allows pheno_prerank_enrich to perform enrichment analysis using phenotype-level ranked gene-level statistics.
%load_ext autoreload
import pandas as pd
rank_df = pd.read_csv("grea/db/ageing_muscle_gtex.tsv", index_col=0)
rank_df.columns = ['Case vs. Control']
rank_df.head()
Step 2: Preparing Gene Set Libraries
There are several ways to prepare gene set libraries for use in GREA:
Option 1: Use Built-in Libraries
Simply specify the libraries you're interested in as a list. For example:
libraries = ['KEGG_2021_Human', 'MSigDB_Hallmark_2020']
You can use the grea.library.list_libraries() function to view all available pathway libraries included in GREA.
Option 2: Load from GMT File
You can load external gene set libraries from .gmt files using:
libraries = read_gmt('your_library_file.gmt')
Option 3: Define a Custom Library
Create your own gene set library using a Python dictionary, where each key is a pathway name and the corresponding value is a list of genes:
libraries = {
'term1': ['GOLGA4', 'GID4'],
'term2': ['ADO', 'CHUK']
}
%autoreload
from grea.library import list_libraries
print(list_libraries())
Step 3: Run Enrichment
To perform enrichment analysis, call the grea.pheno_prerank_enrich(rank_df, libraries) function. You can customize the analysis using the following arguments:
prob_method: Method for p-value calculation. Currently supports'perm'for permutation-based testing.n_perm: Number of permutations to use for estimating the null distribution.
The function returns a GREA object containing all enrichment results, including enrichment scores and statistical significance for each library term.
%autoreload
from grea import grea
libraries = ['MSigDB_Hallmark_2020']
n_perm = 10
prob_method = 'perm'
obj = grea.pheno_prerank_enrich(rank_df, libraries,n_perm=n_perm, prob_method=prob_method)
obj
Step 4: Check Enrichment Results
The GREA object stores all enrichment results, including enrichment scores and statistical significance for each library term. GREA supports three types of enrichment scores, each reflecting a different scoring strategy:
'KS-ES': Kolmogorov–Smirnov-based Enrichment Score, capturing the peak deviation between hit and miss distributions.'KS-ESD': KS-based enrichment Score Difference, the sum of the maximum positive and negative deviations from the running score.'RC-AUC': Area Under the Recovery Curve, summarizing early enrichment of target genes along the ranking.'RC-nAUC': The normalized Area Under the Recovery Curve, summarizing early enrichment of target genes along the ranking.'nRC-AUC': Area Under the normalized Recovery Curve, summarizing early enrichment of target genes along the ranking, ranges from 0 to 1.
You can select the appropriate metric depending on your analysis goal or data characteristics.
To retrieve the enrichment results as a long DataFrame, use the get_enrich_results(metric) function.
%autoreload
df = obj.get_enrich_results(metric='KS-ES')
df.head()
%autoreload
df = obj.get_enrich_results(metric='KS-ESD')
df.head()
%autoreload
df = obj.get_enrich_results(metric='RC-AUC')
df.head()
%autoreload
df = obj.get_enrich_results(metric='RC-nAUC')
df.head()
%autoreload
df = obj.get_enrich_results(metric='nRC-AUC')
df.head()
Step 5: Visualize Enrichment Results
To visualize the enrichment results, use the pl_running_sum(metric, term, pheno_id) function by specifying the desired metric, term, and target phenotype.
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ES', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ESD', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('RC-AUC', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('RC-nAUC', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('nRC-AUC', term, pheno_id)
fig