cloneSizebycluster

Function

cloneSizebycluster(
    self,
    gt_cluster_file: str,
    tool_cluster_files: List[str],
    tool_names: List[str],
    outfile: str = "clone_size_by_cluster.csv",
) -> pd.DataFrame

This function evaluates the consistency of clone size predictions across different tools compared to a Ground Truth (GT) reference.

Result: Groups the results by the GT cluster_size and calculates the mean predicted size for each group to identify scaling trends or biases.

Parameters

Name	Type	Description
`gt_cluster_file`	`str`	Path to the Ground Truth CSV file containing cluster assignments.
`tool_cluster_files`	`List[str]`	A list of file paths to the `clusters.csv` files generated by each tool.
`tool_names`	`List[str]`	A list of tool names corresponding to the `tool_cluster_files` (must match in length and order).
`outfile`	`str`	The name of the detailed output CSV file. Defaults to `"clone_size_by_cluster.csv"`.

Input File Format

Cluster Files

Both the gt_cluster_file and each entry in tool_cluster_files are expected to be CSV files.

cell_id,clone_id
cell_001,A
cell_002,A
cell_003,B

Output

The function writes two CSV files to self.output_dir:

Detailed Table: {self.output_dir}/{outfile}

Contains the raw comparison of sizes for every individual across all tools.

Summary Table: {self.output_dir}/mean_{outfile}

Contains the averaged results grouped by GT cluster.

Column	Meaning
`cluster_size`	The actual size of the clones in the Ground Truth.
`{Tool}_pred_size`	The average size predicted by the specific tool for clones of that GT size.

Example

from hcbench.gtbench import gtbench

# Initialize the runner
gtbench_runner = gtbench.GTBench(output_dir="out/gt_output/")

# Define inputs
gt_path = "/home/jianganna/workspace/HCDSIM/data/new-gt/clusters.csv"
tool_files = [
    "out/chisel/clusters.csv",
    "out/signals/clusters.csv"
]
tool_names = ["CHISEL", "SIGNALS"]

# Execute analysis
gtbench_runner.cloneSizebycluster(
    gt_cluster_file=gt_path,
    tool_cluster_files=tool_files,
    tool_names=tool_names,
    outfile="comparison_results.csv"
)

# Example output:
#    cluster_size  CHISEL_pred_size  SIGNALS_pred_size
# 0            50              48.2               51.5
# 1           200             185.0              205.2