Cnclass

Function

cnclass(
    self,
    tool_hap1_cna_files: List[str],
    tool_hap2_cna_files: List[str],
    tool_names: List[str],
    profile_hap1_cna_file: str,
    profile_hap2_cna_file: str,
    type: str = "hcCNA",
    profile_bin_size=100000,
    outfile: str = "cnclass_results.csv",
) -> pd.DataFrame

This function performs CNA class–specific evaluation by stratifying genomic bins into predefined CNA categories (e.g., gain/neutral/loss, exact CN states) and then computing clone-level metrics for each category, haplotype, and tool.

It supports both:

hcCNA: haplotype-specific copy numbers (hap1, hap2)
acCNA: allele-specific copy numbers (minor, major)

For each tool, the function:

Loads GT haplotype CNA profiles (hap1/hap2 or minor/major).
Loads tool predictions for hap1 and hap2.
Re-bins regions to a uniform resolution (profile_bin_size) and aligns GT vs predictions.
For each CNA category (condition), saves category-specific bin tables into an output folder.
Computes clone-level metrics within each category and aggregates results into one summary table.

Parameters

Name	Type	Description
`tool_hap1_cna_files`	`List[str]`	List of tool prediction CSVs for haplotype 1 (or minor allele if `type="acCNA"`). Must align with `tool_names`.
`tool_hap2_cna_files`	`List[str]`	List of tool prediction CSVs for haplotype 2 (or major allele if `type="acCNA"`). Must align with `tool_names`.
`tool_names`	`List[str]`	Tool names used to create output directories and label result rows.
`profile_hap1_cna_file`	`str`	Path to GT CNA profile CSV for haplotype 1 (or minor).
`profile_hap2_cna_file`	`str`	Path to GT CNA profile CSV for haplotype 2 (or major).
`type`	`str`	CNA type. Must be `"hcCNA"` or `"acCNA"`. Default: `"hcCNA"`.
`profile_bin_size`	`int`	Bin size used to split regions for tool predictions before alignment. Default: `100000` (100kb).
`outfile`	`str`	Output filename for the final aggregated result table. Default: `"cnclass_results.csv"`.

Input File Format

GT files: `profile_hap1_cna_file`, `profile_hap2_cna_file`

Expected to be CNA matrices with:

one column named region
remaining columns as cell IDs
values as integer CNA states or strings convertible to numeric (depends on your pipeline)

Loaded via read_and_drop_empty(...) and then indexed by "region".

Tool prediction files: `tool_hap1_cna_files`, `tool_hap2_cna_files`

Expected to be CSVs with:

region column
cell columns

Implementation details:

tool predictions are loaded with pd.read_csv(...).fillna(-1)
regions are re-binned using:

split_all_regions(df.set_index("region"), profile_bin_size)

GT and predictions are aligned by:

gt_h1, p_h1 = align(gt_h1_r, p_h1)
gt_h2, p_h2 = align(gt_h2_r, p_h2)

CNA Categories (Conditions)

The function evaluates each tool under multiple CNA “classes”, defined by conditions:

CN_Gain: >=2
CN_Neutral: =1
CN_Loss: =0
exact CN states: =2, =3, ..., =10 (folders CN_equal_2 … CN_equal_10)

Each class corresponds to a subfolder name (used in output directory structure) and a condition string (used by downstream categorization).

Haplotype Labels

Haplotype naming depends on type:

if type="hcCNA" → hap_list = ["hap1", "hap2"]
if type="acCNA" → hap_list = ["minor", "major"]

These names are passed into downstream functions and appear in the final output table.

Output

Directory structure (per tool × CNA class)

For each tool and each CNA class, the function creates:

{self.output_dir}/{tool}/{folder}/

Examples:

.../CHISEL/CN_Gain/
.../CHISEL/CN_equal_3/

Inside each folder, the function writes class-specific intermediate files generated by:

categorize_and_save(...) (for hap1/minor and hap2/major)
process_folder_for_metrics_clone(...) (reads folder contents and computes metrics)

Final aggregated CSV (`outfile`)

Saved to:

os.path.join(self.output_dir, outfile)

Default:

{self.output_dir}/cnclass_results.csv

Output Table Schema

The returned dataframe (and saved CSV) contains one row per:

(CNA class × haplotype × clone × tool)

Columns include:

Column	Meaning
`Type`	CNA class folder name (e.g., `CN_Gain`, `CN_equal_4`)
`Haplotype`	`hap1/hap2` or `minor/major` depending on `type`
`Clone`	Clone identifier returned by `process_folder_for_metrics_clone`
`Tool`	Tool name
(metrics...)	Additional metric fields returned in `metrics` dict (depends on your implementation of `process_folder_for_metrics_clone`)

Return Value

Returns a pd.DataFrame with the aggregated per-class, per-haplotype, per-clone metrics for all tools.

Example

from hcbench.gtbench.gtbench import GTBench

bench = GTBench(output_dir="out/gt_output")

df = bench.cnclass(
    tool_hap1_cna_files=[
        "/path/to/chisel/hap1.csv",
        "/path/to/signals/hap1.csv",
    ],
    tool_hap2_cna_files=[
        "/path/to/chisel/hap2.csv",
        "/path/to/signals/hap2.csv",
    ],
    tool_names=["CHISEL", "SIGNALS"],
    profile_hap1_cna_file="/path/to/gt/hap1.csv",
    profile_hap2_cna_file="/path/to/gt/hap2.csv",
    type="hcCNA",
    profile_bin_size=100000,
    outfile="cnclass_results.csv",
)

print(df.head())