cloneSizebycluster
Function
cloneSizebycluster(
self,
gt_cluster_file: str,
tool_cluster_files: List[str],
tool_names: List[str],
outfile: str = "clone_size_by_cluster.csv",
) -> pd.DataFrame
This function evaluates the consistency of clone size predictions across different tools compared to a Ground Truth (GT) reference.
- Result: Groups the results by the GT
cluster_sizeand calculates the mean predicted size for each group to identify scaling trends or biases.
Parameters
| Name | Type | Description |
|---|---|---|
gt_cluster_file |
str |
Path to the Ground Truth CSV file containing cluster assignments. |
tool_cluster_files |
List[str] |
A list of file paths to the clusters.csv files generated by each tool. |
tool_names |
List[str] |
A list of tool names corresponding to the tool_cluster_files (must match in length and order). |
outfile |
str |
The name of the detailed output CSV file. Defaults to "clone_size_by_cluster.csv". |
Input File Format
Cluster Files
Both the gt_cluster_file and each entry in tool_cluster_files are expected to be CSV files.
cell_id,clone_id
cell_001,A
cell_002,A
cell_003,B
Output
The function writes two CSV files to self.output_dir:
- Detailed Table:
{self.output_dir}/{outfile}
Contains the raw comparison of sizes for every individual across all tools.
- Summary Table:
{self.output_dir}/mean_{outfile}
Contains the averaged results grouped by GT cluster.
| Column | Meaning |
|---|---|
cluster_size |
The actual size of the clones in the Ground Truth. |
{Tool}_pred_size |
The average size predicted by the specific tool for clones of that GT size. |
Example
from hcbench.gtbench import gtbench
# Initialize the runner
gtbench_runner = gtbench.GTBench(output_dir="out/gt_output/")
# Define inputs
gt_path = "/home/jianganna/workspace/HCDSIM/data/new-gt/clusters.csv"
tool_files = [
"out/chisel/clusters.csv",
"out/signals/clusters.csv"
]
tool_names = ["CHISEL", "SIGNALS"]
# Execute analysis
gtbench_runner.cloneSizebycluster(
gt_cluster_file=gt_path,
tool_cluster_files=tool_files,
tool_names=tool_names,
outfile="comparison_results.csv"
)
# Example output:
# cluster_size CHISEL_pred_size SIGNALS_pred_size
# 0 50 48.2 51.5
# 1 200 185.0 205.2