Skip to content

clusterConsistency

Function

clusterConsistency(
    tool_clone_files: List[str],
    tool_names: List[str],
) -> pd.DataFrame

For each tool, the function loads clusters.csvfile and calculates:

  • ARI: Adjusted Rand Index between cell_id and clone_id
  • AMI: Adjusted Mutual Information between cell_id and clone_id

It aggregates results across tools into a single table, writes it to:

  • {self.output_dir}/clustering_result.csv

and returns the result as a DataFrame.


Parameters

Name Type Description
tool_clone_files List[str] List of file paths, one per tool. Each file should be a CSV containing at least cell_id and clone_id columns.
tool_names List[str] List of tool names aligned with tool_clone_files (same length and order).

Input File Format

Each tool_clone_files[i] is expected to be a CSV with at least:

  • cell_id: string cell identifier, expected format like "<cluster>_<rest>" so that cell_id.split("_")[0] yields the reference cluster label
  • clone_id: clone assignment label from the tool

Example:

cell_id,clone_id
A_cell0001,1
A_cell0002,1
B_cell0003,2

From this, the function derives:

  • clusters1 = ["A", "A", "B"]
  • clusters2 = ["1", "1", "2"]

Return Type

  • pd.DataFrame

Returns

A DataFrame with one row per tool:

Column Meaning
Tool Tool name from tool_names
ARI Adjusted Rand Index between derived clusters and clone_id
AMI Adjusted Mutual Information between derived clusters and clone_id

Output

  • Writes a CSV summary to: os.path.join(self.output_dir, "clustering_result.csv")

Example

from hcbench.gtbench import gtbench

# Suppose self.output_dir = "out/gt_output"

gtbench_runner = gtbench.GTBench(
    output_dir=f"out/gt_output/")

tool_clone_files = [
    "out/chisel/clusters.csv",
    "out/alleloscope/clusters.csv",
    "out/signals/clusters.csv",
]
tool_names = ["CHISEL", "Alleloscope", "SIGNALS"]

# Call from within your class instance that has self.output_dir (out/gt_output/) defined
df = gtbench_runner.clusterConsistency(
    tool_clone_files=tool_clone_files,
    tool_names=tool_names,
)

print(df)
# Expected output (values are illustrative):
#           Tool     ARI     AMI
# 0       CHISEL  0.230   0.301
# 1  Alleloscope  0.5321  0.4880
# 2      SIGNALS  0.8123  0.7451

After running, you will also find:

out/gt_output/clustering_result.csv

containing the same summary table.