hcPhasing

Function

hcPhasing(
    self,
    tool_hap1_cna_files: List[str],
    tool_hap2_cna_files: List[str],
    tool_names: List[str],
    ground_truth_hap1_file: str,
    ground_truth_hap2_file: str,
    outprefix="hcPhasing",
    profile_bin_size=100000,
    mode="heterozygous-only",
    is_clone=False,
) -> pd.DataFrame

This function evaluates haplotype phasing accuracy of inferred haplotype-specific CNAs (hap1/hap2) against ground-truth haplotype CNAs.

It converts both GT and tool haplotype CN profiles into a binary phasing representation, then computes:

mismatch error (per cell or per clone)
switch error (per cell or per clone)

It supports evaluation at cell-level (default) or clone-level (is_clone=True), and allows different evaluation modes (e.g., heterozygous-only).

Parameters

Name	Type	Description
`tool_hap1_cna_files`	`List[str]`	List of tool CNA profile CSVs for inferred haplotype 1. Must align with `tool_names`.
`tool_hap2_cna_files`	`List[str]`	List of tool CNA profile CSVs for inferred haplotype 2. Must align with `tool_names`.
`tool_names`	`List[str]`	Tool names used in output rows.
`ground_truth_hap1_file`	`str`	Path to ground-truth haplotype-1 CNA matrix CSV.
`ground_truth_hap2_file`	`str`	Path to ground-truth haplotype-2 CNA matrix CSV.
`outprefix`	`str`	Prefix for the output CSV filename. Default: `"hcPhasing"`.
`profile_bin_size`	`int`	Bin size used to split/standardize regions before alignment. Default: `100000` (100kb).
`mode`	`str`	Evaluation mode passed to mismatch/switch error functions. Default: `"heterozygous-only"`.
`is_clone`	`bool`	If `True`, compute errors per clone label; otherwise per cell. Default: `False`.

Input File Format

Ground truth files (`ground_truth_hap1_file`, `ground_truth_hap2_file`)

Loaded via read_and_drop_empty(...) and indexed by region:

g1_r = read_and_drop_empty(ground_truth_hap1_file)
g2_r = read_and_drop_empty(ground_truth_hap2_file)
g1_r.set_index("region", inplace=True)
g2_r.set_index("region", inplace=True)

Expected structure:

region column
remaining columns: cell IDs (or clone labels if evaluating clone-level elsewhere in your pipeline)
values: haplotype-specific CNA states (numeric or string convertible, depending on _phase_to_binary)

Tool prediction files (`tool_hap1_cna_files`, `tool_hap2_cna_files`)

Also expected to have:

region column
cell/clone columns
haplotype-specific CNA values

Processing Steps

For each tool:

Read predicted haplotype profiles:

t1 = read_and_drop_empty(f_h1)
t2 = read_and_drop_empty(f_h2)

Re-bin regions to profile_bin_size:

t1 = split_all_regions(t1.set_index("region"), profile_bin_size)
t2 = split_all_regions(t2.set_index("region"), profile_bin_size)

Align predictions with GT on common regions and samples:

g1, t1 = align(g1_r, t1)
g2, t2 = align(g2_r, t2)

Convert haplotype CNAs to a binary phasing representation:

g1_bin, _ = self._phase_to_binary(g1, g2)
t1_bin, _ = self._phase_to_binary(t1, t2)

_phase_to_binary is expected to encode phasing comparisons bin-wise (e.g., indicating whether hap1 > hap2, hap1 < hap2, or equal), producing matrices suitable for mismatch/switch evaluation.

Error Metrics

The function chooses evaluation routines based on is_clone:

if is_clone=False (default):
simu_cell_mismatch_error
simu_cell_switch_error
if is_clone=True:
simu_clone_mismatch_error
simu_clone_switch_error

Both mismatch and switch error functions are called with:

predicted binary matrix t1_bin
GT binary matrix g1_bin
mode (e.g., "heterozygous-only")

mismatch_error_result = eval_mismatch_fn(t1_bin, g1_bin, mode)
switch_error_result   = eval_switch_fn(t1_bin, g1_bin, mode)

The two result tables are merged on:

cell (cell-level), or
clone_label (clone-level)

label = "clone_label" if is_clone else "cell"
result = pd.merge(mismatch_error_result, switch_error_result, on=label, how="outer")
result["tool_name"] = name

Output

A single CSV is written to self.output_dir:

os.path.join(self.output_dir, f"{outprefix}_{mode}.csv")

Default example:

{self.output_dir}/hcPhasing_heterozygous-only.csv

The output is long-format, with one row per cell/clone (depending on is_clone) per tool.

Output Table Schema

Exact columns depend on what your mismatch/switch routines return, but typically include:

Column	Meaning
`cell` or `clone_label`	Identifier used for per-sample evaluation
mismatch-related columns	e.g., mismatch count / mismatch ratio
switch-related columns	e.g., switch error count / switch error ratio
`tool_name`	Tool name

Return Value

Returns a pd.DataFrame containing the merged mismatch + switch error results for all tools.

Example

from hcbench.gtbench.gtbench import GTBench

bench = GTBench(output_dir="out/gt_output")

df = bench.hcPhasing(
    tool_hap1_cna_files=[
        "/path/to/chisel/hap1.csv",
        "/path/to/signals/hap1.csv",
    ],
    tool_hap2_cna_files=[
        "/path/to/chisel/hap2.csv",
        "/path/to/signals/hap2.csv",
    ],
    tool_names=["CHISEL", "SIGNALS"],
    ground_truth_hap1_file="/path/to/gt/hap1.csv",
    ground_truth_hap2_file="/path/to/gt/hap2.csv",
    outprefix="hcPhasing",
    profile_bin_size=100000,
    mode="heterozygous-only",
    is_clone=False,
)

print(df.head())