pheno_prerank_enrich()
Enrichment: Gene List with Rank Scores for Phenotypes
Step 1: Prepare rank_df
Begin by preparing a list of genes along with their preranked scores. Store this information in a DataFrame called rank_df
.
- The index of
rank_df
should be the gene symbols. - The columns of
rank_df
should correspond to one target phenotype (e.g., Case vs. Control).
This format allows pheno_prerank_enrich
to perform enrichment analysis using phenotype-level ranked gene-level statistics.
%load_ext autoreload
import pandas as pd
rank_df = pd.read_csv("grea/db/ageing_muscle_gtex.tsv", index_col=0)
rank_df.columns = ['Case vs. Control']
rank_df.head()
Step 2: Preparing Gene Set Libraries
There are several ways to prepare gene set libraries for use in GREA:
Option 1: Use Built-in Libraries
Simply specify the libraries
you're interested in as a list. For example:
libraries = ['KEGG_2021_Human', 'MSigDB_Hallmark_2020']
You can use the grea.library.list_libraries()
function to view all available pathway libraries included in GREA.
Option 2: Load from GMT File
You can load external gene set libraries from .gmt
files using:
libraries = read_gmt('your_library_file.gmt')
Option 3: Define a Custom Library
Create your own gene set library using a Python dictionary, where each key is a pathway name and the corresponding value is a list of genes:
libraries = {
'term1': ['GOLGA4', 'GID4'],
'term2': ['ADO', 'CHUK']
}
%autoreload
from grea.library import list_libraries
print(list_libraries())
Step 3: Run Enrichment
To perform enrichment analysis, call the grea.pheno_prerank_enrich(rank_df, libraries)
function. You can customize the analysis using the following arguments:
prob_method
: Method for p-value calculation. Currently supports'perm'
for permutation-based testing.n_perm
: Number of permutations to use for estimating the null distribution.
The function returns a GREA object containing all enrichment results, including enrichment scores and statistical significance for each library term.
%autoreload
from grea import grea
libraries = ['MSigDB_Hallmark_2020']
n_perm = 10
prob_method = 'perm'
obj = grea.pheno_prerank_enrich(rank_df, libraries,n_perm=n_perm, prob_method=prob_method)
obj
Step 4: Check Enrichment Results
The GREA
object stores all enrichment results, including enrichment scores and statistical significance for each library term. GREA supports three types of enrichment scores, each reflecting a different scoring strategy:
'KS-ES'
: Kolmogorov–Smirnov-based Enrichment Score, capturing the peak deviation between hit and miss distributions.'KS-ESD'
: KS-based enrichment Score Difference, the sum of the maximum positive and negative deviations from the running score.'RC-AUC'
: Area Under the Recovery Curve, summarizing early enrichment of target genes along the ranking.'RC-nAUC'
: The normalized Area Under the Recovery Curve, summarizing early enrichment of target genes along the ranking.'nRC-AUC'
: Area Under the normalized Recovery Curve, summarizing early enrichment of target genes along the ranking, ranges from 0 to 1.
You can select the appropriate metric depending on your analysis goal or data characteristics.
To retrieve the enrichment results as a long DataFrame, use the get_enrich_results(metric)
function.
%autoreload
df = obj.get_enrich_results(metric='KS-ES')
df.head()
%autoreload
df = obj.get_enrich_results(metric='KS-ESD')
df.head()
%autoreload
df = obj.get_enrich_results(metric='RC-AUC')
df.head()
%autoreload
df = obj.get_enrich_results(metric='RC-nAUC')
df.head()
%autoreload
df = obj.get_enrich_results(metric='nRC-AUC')
df.head()
Step 5: Visualize Enrichment Results
To visualize the enrichment results, use the pl_running_sum(metric, term, pheno_id)
function by specifying the desired metric, term, and target phenotype.
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ES', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('KS-ESD', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('RC-AUC', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('RC-nAUC', term, pheno_id)
fig
%autoreload
term = 'MSigDB_Hallmark_2020|Hedgehog Signaling'
pheno_id = 'Case vs. Control'
fig = obj.pl_running_sum('nRC-AUC', term, pheno_id)
fig