Mirrorsubclone

Function

mirrorsubclone(
    self,
    tool_cna_files: List[str],
    tool_names: List[str],
    changes_file: str,
    profile_bin_size=100000,
    outfile: str = "mirror_subclone_result.csv",
) -> pd.DataFrame

This function evaluates tool performance on Mirrored-Subclonal CNA (MSCNA) events by comparing predicted haplotype CNAs for two paired clones against a ground-truth MSCNA table.

Given:

a ground-truth MSCNA event table (changes_file) describing two related subclones (Clone1/Clone2) and their haplotype CN states,
and each tool’s inferred CNA matrix,

it:

Converts MSCNA event intervals into a region key and re-bins to profile_bin_size.
Re-bins tool predictions to the same resolution.
Merges GT MSCNA events with tool predictions using _mirror_merge.
Computes:
ACC: fraction of events where all four haplotype CN states (Clone1 hap1/hap2 + Clone2 hap1/hap2) match exactly.
RMSE: mean per-event haplotype error aggregated across the four haplotype states.

Results are summarized per tool and written to disk.

Parameters

Name	Type	Description
`tool_cna_files`	`List[str]`	List of tool CNA profile CSV files. Must align with `tool_names`.
`tool_names`	`List[str]`	Tool names used in result rows.
`changes_file`	`str`	Path to the mirrored-subclone ground-truth table (MSCNA events).
`profile_bin_size`	`int`	Bin size used to split/standardize both GT events and predictions. Default: `100000` (100kb).
`outfile`	`str`	Output filename for the summary table. Default: `"mirror_subclone_result.csv"`.

Input File Format

`changes_file` (MSCNA ground-truth table)

Loaded via:

change_df_r = read_and_drop_empty(changes_file)

Expected columns include at least:

Chromosome
Start
End

These are combined into a genomic interval string:

change_df_r["region"] = f"{Chromosome}:{Start}-{End}"

Additional columns must exist that are required by _mirror_merge(...) and subsequent computations. From the downstream code, the merged table is expected to contain (after _mirror_merge) fields like:

Clone1_CNA, Clone2_CNA (GT haplotype CNA encoded as "hap1|hap2")
Clone1_predict_CNA, Clone2_predict_CNA (predicted haplotype CNA encoded as "hap1|hap2")

GT events are re-binned:

change_truth_r = split_all_regions(change_df_r.set_index("region"), profile_bin_size)
change_truth_r = change_truth_r.reset_index().rename(columns={"index": "region"})

Tool CNA files (`tool_cna_files`)

Expected to be CNA matrices with:

region column
cell/clone columns depending on your pipeline
CNA values encoded as haplotype strings like "a|b" (required for hap split)

Tool predictions are re-binned similarly:

pred = split_all_regions(pred.set_index("region"), profile_bin_size).reset_index()

Processing Steps

For each tool:

Merge GT MSCNA table with predictions:

combined = self._mirror_merge(change_df, pred)

If combined is empty, RMSE and ACC are set to NaN.

Split haplotype CNA strings for both clones:

For each side in ["Clone1", "Clone2"], the function derives:

{side}_hap1_CNA, {side}_hap2_CNA (GT)
{side}_predict_hap1_CNA, {side}_predict_hap2_CNA (prediction)

by splitting "hap1|hap2".

Metrics

1) ACC (Exact match accuracy)

An event is counted as correct only if all four haplotype values match exactly:

Clone1 hap1 and hap2
Clone2 hap1 and hap2

acc = mean(
    Clone1_pred_h1 == Clone1_gt_h1 AND
    Clone1_pred_h2 == Clone1_gt_h2 AND
    Clone2_pred_h1 == Clone2_gt_h1 AND
    Clone2_pred_h2 == Clone2_gt_h2
)

So ACC is the fraction of MSCNA events fully matched across both clones and both haplotypes.

2) RMSE (Aggregated per-event haplotype error)

Per haplotype value, squared error is computed after coercing to numeric:

se = (pred - gt)^2

Then per-event RMSE-like aggregate is computed as:

RMSE_result =
    sqrt(Clone1_hap1_error) +
    sqrt(Clone1_hap2_error) +
    sqrt(Clone2_hap1_error) +
    sqrt(Clone2_hap2_error)

Finally, the reported RMSE is the mean of RMSE_result across events.

Note: This is not a conventional single RMSE over all values; it is a sum of per-haplotype absolute errors (since sqrt((a-b)^2) = |a-b|) averaged across events.

Output

A single CSV summary is written to self.output_dir:

os.path.join(self.output_dir, outfile)

Default:

{self.output_dir}/mirror_subclone_result.csv

Output Table Schema

One row per tool:

Column	Meaning
`Tool`	Tool name
`RMSE`	Mean aggregated haplotype absolute error across Clone1/Clone2 (as defined above)
`ACC`	Exact-match accuracy requiring all four haplotype values to match

Return Value

Returns a pd.DataFrame with columns:

Tool
RMSE
ACC

Example

from hcbench.gtbench.gtbench import GTBench

bench = GTBench(output_dir="out/gt_output")

df = bench.mirrorsubclone(
    tool_cna_files=[
        "/path/to/chisel/haplotype_combined.csv",
        "/path/to/signals/haplotype_combined.csv",
    ],
    tool_names=["CHISEL", "SIGNALS"],
    changes_file="/path/to/gt/mirrored_subclone_events.csv",
    profile_bin_size=100000,
    outfile="mirror_subclone_result.csv",
)

print(df)