Mirrorsubclone
Function
mirrorsubclone(
self,
tool_cna_files: List[str],
tool_names: List[str],
changes_file: str,
profile_bin_size=100000,
outfile: str = "mirror_subclone_result.csv",
) -> pd.DataFrame
This function evaluates tool performance on Mirrored-Subclonal CNA (MSCNA) events by comparing predicted haplotype CNAs for two paired clones against a ground-truth MSCNA table.
Given:
- a ground-truth MSCNA event table (
changes_file) describing two related subclones (Clone1/Clone2) and their haplotype CN states, - and each tool’s inferred CNA matrix,
it:
- Converts MSCNA event intervals into a
regionkey and re-bins toprofile_bin_size. - Re-bins tool predictions to the same resolution.
- Merges GT MSCNA events with tool predictions using
_mirror_merge. - Computes:
- ACC: fraction of events where all four haplotype CN states (Clone1 hap1/hap2 + Clone2 hap1/hap2) match exactly.
- RMSE: mean per-event haplotype error aggregated across the four haplotype states.
Results are summarized per tool and written to disk.
Parameters
| Name | Type | Description |
|---|---|---|
tool_cna_files |
List[str] |
List of tool CNA profile CSV files. Must align with tool_names. |
tool_names |
List[str] |
Tool names used in result rows. |
changes_file |
str |
Path to the mirrored-subclone ground-truth table (MSCNA events). |
profile_bin_size |
int |
Bin size used to split/standardize both GT events and predictions. Default: 100000 (100kb). |
outfile |
str |
Output filename for the summary table. Default: "mirror_subclone_result.csv". |
Input File Format
changes_file (MSCNA ground-truth table)
Loaded via:
change_df_r = read_and_drop_empty(changes_file)
Expected columns include at least:
ChromosomeStartEnd
These are combined into a genomic interval string:
change_df_r["region"] = f"{Chromosome}:{Start}-{End}"
Additional columns must exist that are required by _mirror_merge(...) and subsequent computations.
From the downstream code, the merged table is expected to contain (after _mirror_merge) fields like:
Clone1_CNA,Clone2_CNA(GT haplotype CNA encoded as"hap1|hap2")Clone1_predict_CNA,Clone2_predict_CNA(predicted haplotype CNA encoded as"hap1|hap2")
GT events are re-binned:
change_truth_r = split_all_regions(change_df_r.set_index("region"), profile_bin_size)
change_truth_r = change_truth_r.reset_index().rename(columns={"index": "region"})
Tool CNA files (tool_cna_files)
Expected to be CNA matrices with:
regioncolumn- cell/clone columns depending on your pipeline
- CNA values encoded as haplotype strings like
"a|b"(required for hap split)
Tool predictions are re-binned similarly:
pred = split_all_regions(pred.set_index("region"), profile_bin_size).reset_index()
Processing Steps
For each tool:
- Merge GT MSCNA table with predictions:
combined = self._mirror_merge(change_df, pred)
If combined is empty, RMSE and ACC are set to NaN.
- Split haplotype CNA strings for both clones:
For each side in ["Clone1", "Clone2"], the function derives:
{side}_hap1_CNA,{side}_hap2_CNA(GT){side}_predict_hap1_CNA,{side}_predict_hap2_CNA(prediction)
by splitting "hap1|hap2".
Metrics
1) ACC (Exact match accuracy)
An event is counted as correct only if all four haplotype values match exactly:
- Clone1 hap1 and hap2
- Clone2 hap1 and hap2
acc = mean(
Clone1_pred_h1 == Clone1_gt_h1 AND
Clone1_pred_h2 == Clone1_gt_h2 AND
Clone2_pred_h1 == Clone2_gt_h1 AND
Clone2_pred_h2 == Clone2_gt_h2
)
So ACC is the fraction of MSCNA events fully matched across both clones and both haplotypes.
2) RMSE (Aggregated per-event haplotype error)
Per haplotype value, squared error is computed after coercing to numeric:
se = (pred - gt)^2
Then per-event RMSE-like aggregate is computed as:
RMSE_result =
sqrt(Clone1_hap1_error) +
sqrt(Clone1_hap2_error) +
sqrt(Clone2_hap1_error) +
sqrt(Clone2_hap2_error)
Finally, the reported RMSE is the mean of RMSE_result across events.
Note: This is not a conventional single RMSE over all values; it is a sum of per-haplotype absolute errors (since sqrt((a-b)^2) = |a-b|) averaged across events.
Output
A single CSV summary is written to self.output_dir:
os.path.join(self.output_dir, outfile)
Default:
{self.output_dir}/mirror_subclone_result.csv
Output Table Schema
One row per tool:
| Column | Meaning |
|---|---|
Tool |
Tool name |
RMSE |
Mean aggregated haplotype absolute error across Clone1/Clone2 (as defined above) |
ACC |
Exact-match accuracy requiring all four haplotype values to match |
Return Value
Returns a pd.DataFrame with columns:
ToolRMSEACC
Example
from hcbench.gtbench.gtbench import GTBench
bench = GTBench(output_dir="out/gt_output")
df = bench.mirrorsubclone(
tool_cna_files=[
"/path/to/chisel/haplotype_combined.csv",
"/path/to/signals/haplotype_combined.csv",
],
tool_names=["CHISEL", "SIGNALS"],
changes_file="/path/to/gt/mirrored_subclone_events.csv",
profile_bin_size=100000,
outfile="mirror_subclone_result.csv",
)
print(df)