Skip to content

cloneSizebycluster

Function

cloneSizebycluster(
    self,
    gt_cluster_file: str,
    tool_cluster_files: List[str],
    tool_names: List[str],
    outfile: str = "clone_size_by_cluster.csv",
) -> pd.DataFrame

This function evaluates the consistency of clone size predictions across different tools compared to a Ground Truth (GT) reference.

  • Result: Groups the results by the GT cluster_size and calculates the mean predicted size for each group to identify scaling trends or biases.

Parameters

Name Type Description
gt_cluster_file str Path to the Ground Truth CSV file containing cluster assignments.
tool_cluster_files List[str] A list of file paths to the clusters.csv files generated by each tool.
tool_names List[str] A list of tool names corresponding to the tool_cluster_files (must match in length and order).
outfile str The name of the detailed output CSV file. Defaults to "clone_size_by_cluster.csv".

Input File Format

Cluster Files

Both the gt_cluster_file and each entry in tool_cluster_files are expected to be CSV files.

cell_id,clone_id
cell_001,A
cell_002,A
cell_003,B

Output

The function writes two CSV files to self.output_dir:

  1. Detailed Table: {self.output_dir}/{outfile}

Contains the raw comparison of sizes for every individual across all tools.

  1. Summary Table: {self.output_dir}/mean_{outfile}

Contains the averaged results grouped by GT cluster.

Column Meaning
cluster_size The actual size of the clones in the Ground Truth.
{Tool}_pred_size The average size predicted by the specific tool for clones of that GT size.

Example

from hcbench.gtbench import gtbench

# Initialize the runner
gtbench_runner = gtbench.GTBench(output_dir="out/gt_output/")

# Define inputs
gt_path = "/home/jianganna/workspace/HCDSIM/data/new-gt/clusters.csv"
tool_files = [
    "out/chisel/clusters.csv",
    "out/signals/clusters.csv"
]
tool_names = ["CHISEL", "SIGNALS"]

# Execute analysis
gtbench_runner.cloneSizebycluster(
    gt_cluster_file=gt_path,
    tool_cluster_files=tool_files,
    tool_names=tool_names,
    outfile="comparison_results.csv"
)

# Example output:
#    cluster_size  CHISEL_pred_size  SIGNALS_pred_size
# 0            50              48.2               51.5
# 1           200             185.0              205.2