SIGNALS Parser
This module provides SignalsParser, specialized for parsing Signals outputs and exporting standardized matrices and helper files used by hcbench workflows.
Key features:
- Run the existing pipeline on Signals CNA tables
- Parse Signals cluster assignments into a standardized
clusters.csv - Export bin-level count matrices as
bin_counts.csv - Convert a VAF long table into sparse matrix outputs
🚀 Quick Start
🧬 Exporting hscn_data.tsv from R
Run the following R code to extract and save the hscn$data component:
mnt <- "/demo_output/signals/output/"
hscn <- readRDS("/demo_output/signals/hscn.rds")
# Export the data component
write.table(
hscn$data,
file = paste0(mnt, "hscn_data.tsv"),
sep = "\t",
quote = FALSE,
row.names = FALSE
)
This will produce the file hscn_data.tsv, which is then used as input for the parser.
🧬 1. Parse the CNA Matrix
from hcbench.parsers.signals import SignalsParser
signals_input = "/demo_output/signals/output/hscn_data.csv"
signals_output = "/output/signals/"
signals_parser = SignalsParser(input_path=signals_input, output_path=signals_output)
signals_parser.run()
After running, the parser will read the input file and results are saved to the output directory, typically containing the following files:
/output/signals/
├── haplotype_combined.csv
├── haplotype_1.csv
├── haplotype_2.csv
├── minor.csv
├── major.csv
└── minor_major.csv
haplotype_combined.csv— main CNA matrix (regions × cells).
🧬 2. Parse the Cluster File
If you have a Signals cluster mapping file, you can parse it separately using the get_cluster() method:
cluster_file = "/demo_output/signals/clusters.csv"
signals_parser.get_cluster(cluster_file)
An example of the input cluster file:
cell_id,clone_id
AAACCTGAGAAGGACA,CloneA
AAACCTGAGATCTGCT,CloneB
AAACCTGAGTAATCCC,CloneA
After running this command, the following standardized file will be created:
/output/signals/
└── clusters.csv
🧬 3. Parse the bin counts matrix
Unlike the CHISEL parser, the Signals parser requires the explicit path to the bin counts file:
bin_count_file = "hmmcopy_results/reads.csv.gz"
signals_parser.get_bin_counts(bin_count_file)
After running this command, the following standardized file will be created:
/output/signals/
└── bin_counts.csv
🧬 4. Parse the VAF sparse matrices
signals_parser.get_VAF_matrix(
vaf_file_path="hscn_pipeline_apptainer/results/counthaps/allele_counts_all.csv.gz",
min_dp=3,
min_cells=10,
)
After running this command, the following standardized directory and files will be created:
/output/signals/
├── VAF/
│ └── cellSNP_*.mtx
⚙️ Initialization
SignalsParser(
input_path: str,
output_path: str,
**kwargs
)
input_path: Path to the Signals CNA output. An example of the expected input format:
chr start end reads copy state cell_id alleleA alleleB totalcounts BAF state_min A B state_AS_phased state_AS LOH phase state_phase state_BAF
1 5000001 10000000 34762 NA 2 clone1_cell1 677 193 870 0.22183908045977 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
1 20000001 25000000 34200 NA 2 clone1_cell1 639 222 861 0.257839721254355 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
1 30000001 35000000 42510 NA 2 clone1_cell1 807 203 1010 0.200990099009901 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
The required columns are:
chr, start, end, cell_id, state_AS_phased
output_path: Base output directory where the standardized matrices will be saved.
🧠 Core Methods
SignalsParser.run()
Executes the standard pipeline for the CNA matrix:
SignalsParser.get_cluster(cluster_file_path)
Parses a Signals cluster mapping file and writes a standardized CSV.
Input
cluster_file_path: CSV file containing at least the columns:cell_id,clone_id.
Output writes to:
{self.output_path}/clusters.csv, strictly retaining these two columns.
SignalsParser.get_bin_counts(bin_count_file_path)
Creates a region-by-cell wide matrix of per-bin counts. Note that this initializes an internal temporary parser specifically configured for the hg38 reference genome and targets the reads column without adding chromosome prefixes.
Input
bin_count_file_path: Path to the raw bin counts CSV file.
Output writes to:
{self.output_path}/bin_counts.csv
This is a wide matrix:
- rows:
region - columns: cells
- values: counts
SignalsParser.get_VAF_matrix(vaf_file_path, output_path=None, min_dp=1, min_cells=1, prefix="cellSNP")
Converts a VAF long table into sparse matrix outputs.
Parameters
output_path (optional)
- If provided: outputs under
{output_path}/VAF, else outputs under{self.output_path}/VAF.
min_dp
- Filter low depth sites.
min_cells
- Filter sites supported by too few cells.
prefix
- Output file prefix (default:
cellSNP).
Output
Creates a VAF/ directory containing the Matrix Market (.mtx) files:
Plaintext
.../VAF/
└── cellSNP_*.mtx
🧩 Example of hscn_data.tsv
An example of the exported file may look like:
chr start end reads copy state cell_id alleleA alleleB totalcounts BAF state_min A B state_AS_phased state_AS LOH phase state_phase state_BAF
1 5000001 10000000 34762 NA 2 clone1_cell1 677 193 870 0.22183908045977 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
1 20000001 25000000 34200 NA 2 clone1_cell1 639 222 861 0.257839721254355 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
1 30000001 35000000 42510 NA 2 clone1_cell1 807 203 1010 0.200990099009901 1 1 1 1|1 1|1 NO Balanced Balanced 0.5
Each row corresponds to one genomic bin in a given single cell.
The *required columns* are:
```
chr,start,end,cell_id,state_AS_phased
```