Segmentation

Function

segmentation(
    self,
    gt_cna_file: str,
    tool_cna_files: List[str],
    tool_names: List[str],
    profile_bin_size=100000,
    threshold=0.95,
    outprefix: str = "segmentation_",
    level="cell",
    changes_file: Optional[str] = ""
)

This function evaluates segmentation quality of CNA calls by comparing tool predictions against GT-derived CNA segments.

It first derives CNA segments from the ground-truth CNA matrix using annotate_segments(...) (with optional CNV type detection). Then, for each tool, it converts the tool CNA matrix into a long format (region × cell), re-bins regions to a uniform resolution, and computes segmentation metrics using get_segment_metric(...).

A single metrics CSV is written to disk.

Parameters

Name	Type	Description
`gt_cna_file`	`str`	Path to GT CNA matrix CSV (regions × cells). The first column is treated as index when reading.
`tool_cna_files`	`List[str]`	List of tool CNA profile CSVs. Must align with `tool_names`.
`tool_names`	`List[str]`	Tool names used to label metric outputs.
`profile_bin_size`	`int`	Bin size used to split/standardize tool regions before metric computation. Default: `100000` (100kb).
`threshold`	`float`	Threshold passed into `annotate_segments(...)` (used for segmentation/CNV detection sensitivity). Default: `0.95`.
`outprefix`	`str`	Prefix used for the output metrics filename. Default: `"segmentation_"`.
`level`	`str`	Placeholder for evaluation level (`"cell"` or `"clone"`). Currently only `"cell"` path is active.
`changes_file`	`Optional[str]`	Reserved for clone-level evaluation (commented out in current implementation). Default: `""`.

Input File Format

`gt_cna_file` (GT CNA matrix)

Loaded as:

gt_profile = pd.read_csv(gt_cna_file, index_col=0)

So the CSV is expected to be:

rows: genomic regions (in the index)
columns: cell IDs
values: CNA states

Example:

region,cell_001,cell_002
chr1:1-100000,1|1,1|1
chr1:100001-200000,2|1,2|1

The GT matrix is transposed before segmentation annotation:

gt_annotated_df = annotate_segments(gt_profile.T, detect_cnv_type=True, threshold=threshold)

annotate_segments(...) is expected to return a dataframe that includes (at minimum):

Chrom
Start
End

A region string is then constructed:

gt_annotated_df["region"] = (
    gt_annotated_df["Chrom"] + ":" +
    gt_annotated_df["Start"].astype(str) + "-" +
    gt_annotated_df["End"].astype(str)
)

Tool CNA files (`tool_cna_files`)

Loaded via:

tool_cna_df = read_and_drop_empty(f)

Expected to include a region column plus one column per cell (wide CNA matrix).

Example:

region,cell_001,cell_002
chr1:1-100000,1|1,1|1
chr1:100001-200000,2|1,2|1

Processing Steps

For each tool:

Convert wide CNA matrix to long format (one row per region × cell):

tool_cna_df_long = (
    pd.DataFrame(tool_cna_df)
      .melt(id_vars="region", var_name="cell", value_name="value")
      .dropna(subset=["value"])
)

Re-bin/split regions to profile_bin_size resolution:

tool_cna_df_long = split_all_regions(tool_cna_df_long.set_index("region"), profile_bin_size)
tool_cna_df_long = tool_cna_df_long.reset_index().rename(columns={"index": "region"})

Compute segmentation metrics comparing tool calls to GT segments:

result2 = get_segment_metric(gt_annotated_df, tool_cna_df_long)
result2["Tool"] = name

Collect per-tool metric outputs and concatenate:

metric_df = pd.concat(metric_results, ignore_index=True)

Output

A single CSV is written to self.output_dir:

os.path.join(self.output_dir, f"{outprefix}metrics.csv")

Default:

{self.output_dir}/segmentation_metrics.csv

The output rows depend on what get_segment_metric(...) returns (e.g., overlap scores, breakpoint precision/recall, CNV-type metrics, etc.), with an added column:

Tool

Return Value

This function does not explicitly return a value (runtime return is None).

Example

from hcbench.gtbench.gtbench import GTBench

bench = GTBench(output_dir="out/gt_output")

bench.segmentation(
    gt_cna_file="/path/to/gt/haplotype_combined.csv",
    tool_cna_files=[
        "/path/to/chisel/haplotype_combined.csv",
        "/path/to/signals/haplotype_combined.csv",
    ],
    tool_names=["CHISEL", "SIGNALS"],
    profile_bin_size=100000,
    threshold=0.8,
    outprefix="segmentation_",
)