Cndetect
Function
cndetect(
self,
tool_cna_files: List[str],
cna_profile_file: str,
tool_names: List[str],
haplotype: str = "combined",
profile_bin_size=100000,
outfile: str = "bin_level_results.csv",
index_col: Optional[int] = 0,
) -> pd.DataFrame
This function evaluates bin-level CNA detection performance of multiple tools against a reference CNA profile file (ground truth).
For each tool CNA matrix, it:
- Loads the tool CNA file and the reference CNA file.
- Splits tool regions into a uniform binning resolution (
profile_bin_size). - Aligns tool predictions and reference profiles on common regions and cells.
- Computes evaluation metrics (RMSE, SCC, ACC) under a specified haplotype mode.
- Saves a summary CSV across all tools.
Parameters
| Name | Type | Description |
|---|---|---|
tool_cna_files |
List[str] |
List of tool CNA profile CSV files. Must align with tool_names in order. |
cna_profile_file |
str |
Path to the reference (truth) CNA profile CSV file. |
tool_names |
List[str] |
Tool names used in logs and the result table. |
haplotype |
str |
Haplotype evaluation mode passed to evaluate_haplotype_predictions. Default: "combined". |
profile_bin_size |
int |
Bin size used to split regions in tool predictions before alignment. Default: 100000 (100kb). |
outfile |
str |
Output filename for the summary table. Default: "bin_level_results.csv". |
index_col |
Optional[int] |
(Currently unused in implementation.) Intended CSV index column specification. Default: 0. |
Input File Format
cna_profile_file (truth) and each file in tool_cna_files
Expected to be CNA matrices with:
- one column named
region - remaining columns as cell IDs
- values as CNA states (e.g.,
"1|1","2|1"), consistent with your pipeline conventions
Example:
region,cell_001,cell_002,cell_003
chr1:1-100000,1|1,1|1,2|1
chr1:100001-200000,1|1,1|1,2|1
Implementation details:
- Both truth and predictions are loaded with
read_and_drop_empty(...)(drops empty columns/cells). - The truth dataframe is indexed by
"region"(truth_df.set_index("region", inplace=True)). - Tool predictions are re-binned using:
pred = split_all_regions(pred.set_index("region"), profile_bin_size)
Then truth and pred are aligned with:
truth, pred = align(truth_df, pred)
Evaluation Metrics
For each tool, metrics are computed via:
rmse, scc, acc = evaluate_haplotype_predictions(pred, truth, haplotype)
The output table contains:
| Column | Meaning |
|---|---|
Tool |
Tool name |
RMSE |
Root Mean Squared Error between prediction and truth |
SCC |
Similarity/Correlation metric returned by your evaluator (often “Spearman Correlation Coefficient” in CNA contexts, but exact definition depends on your implementation) |
ACC |
Accuracy metric returned by your evaluator |
Notes:
SCCmay beNonefor certain inputs depending on yourevaluate_haplotype_predictionsimplementation (e.g., constant vectors / insufficient comparable bins).
Output
A single CSV is written to self.output_dir:
os.path.join(self.output_dir, outfile)
Default path:
{self.output_dir}/bin_level_results.csv
Example output:
Tool,RMSE,SCC,ACC
CHISEL,0.12,0.85,0.93
SIGNALS,0.20,0.71,0.88
Return Value
Returns a pd.DataFrame with one row per tool and the columns:
ToolRMSESCCACC
Example
from hcbench.gtbench.gtbench import GTBench
bench = GTBench(output_dir="out/gt_output")
df = bench.cndetect(
tool_cna_files=[
"/path/to/chisel/haplotype_combined.csv",
"/path/to/signals/haplotype_combined.csv",
],
cna_profile_file="/path/to/gt/haplotype_combined.csv",
tool_names=["CHISEL", "SIGNALS"],
haplotype="combined",
profile_bin_size=100000,
outfile="bin_level_results.csv",
)
print(df)