pheno_prerank_enrich()
Enrichment: Gene List with Rank Scores for Phenotypes
Step 1: Prepare rank_df
Begin by preparing a list of genes along with their preranked scores. Store this information in a DataFrame called rank_df
.
- The index of
rank_df
should be the gene symbols. - The columns of
rank_df
should correspond to one target phenotype (e.g., Case vs. Control).
This format allows pheno_prerank_enrich
to perform enrichment analysis using phenotype-level ranked gene-level statistics.
%load_ext autoreload
import pandas as pd
rank_df = pd.read_csv("grea/data/ageing_muscle_gtex.tsv", index_col=0)
rank_df.columns = ['Case vs. Control']
rank_df.head()
Step 2: Preparing Gene Set Libraries
There are several ways to prepare gene set libraries for use in GREA:
Option 1: Use Built-in Libraries
Simply specify the libraries
you're interested in as a list. For example:
libraries = ['KEGG_2021_Human', 'MSigDB_Hallmark_2020']
You can use the grea.library.list_libraries()
function to view all available pathway libraries included in GREA.
Option 2: Load from GMT File
You can load external gene set libraries from .gmt
files using:
libraries = read_gmt('your_library_file.gmt')
Option 3: Define a Custom Library
Create your own gene set library using a Python dictionary, where each key is a pathway name and the corresponding value is a list of genes:
libraries = {
'term1': ['GOLGA4', 'GID4'],
'term2': ['ADO', 'CHUK']
}
from grea.library import list_libraries
print(list_libraries())
Step 3: Run Enrichment
To perform enrichment analysis, call the grea.pheno_prerank_enrich(rank_df, libraries)
function. You can customize the analysis using the following arguments:
prob_method
: Method for p-value calculation. Currently supports'perm'
for permutation-based testing.n_perm
: Number of permutations to use for estimating the null distribution.
The function returns a GREA object containing all enrichment results, including enrichment scores and statistical significance for each library term.
%autoreload
from grea import grea
libraries = ['MSigDB_Hallmark_2020']
n_perm = 10
prob_method = 'perm'
obj = grea.pheno_prerank_enrich(rank_df, libraries,n_perm=n_perm, prob_method=prob_method)
obj
Step 4: Check Enrichment Results
The GREA
object stores all enrichment results, including enrichment scores and statistical significance for each library term. GREA supports three types of enrichment scores, each reflecting a different scoring strategy:
'KS-ES'
: Kolmogorov–Smirnov-based Enrichment Score, capturing the peak deviation between hit and miss distributions.'KS-ESD'
: KS-based enrichment Score Difference, the sum of the maximum positive and negative deviations from the running score.'RC-AUC'
: Area Under the Recovery Curve, summarizing early enrichment of target genes along the ranking.
You can select the appropriate metric depending on your analysis goal or data characteristics.
To retrieve the enrichment results as a long DataFrame, use the get_enrich_results(metric)
function.
%autoreload
df = obj.get_enrich_results(metric='KS-ES')
df.head()
%autoreload
df = obj.get_enrich_results(metric='KS-ESD')
df.head()
%autoreload
df = obj.get_enrich_results(metric='RC-AUC')
df.head()
Step 5: Visualize Enrichment Results
To visualize the enrichment results, use the pl_running_sum(metric, term, pheno_id)
function by specifying the desired metric, term, and target phenotype.
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ES', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ESD', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('RC-AUC', term, pheno_id)
fig