CNRein Parser
This module provides CNReinParser, specialized for parsing CNRein outputs and exporting standardized matrices and helper files used by hcbench workflows.
Key features:
- Read CNRein prediction tables and merge separate haplotype columns into a unified haplotype-level CNA matrix
- Optional region splitting by a user-defined
bin_size - Extract bin-level Read Depth Ratio (RDR) from
.npzfiles into a standardizedbin_rdr.csv - Convert a directory of separate chromosome VCF files into sparse VAF matrix outputs
🚀 Quick Start
🧬 1. Parse the CNA Matrix
Python
from hcbench.parsers.cnrein import CNReinParser
cnrein_input = "/demo_output/cnrein/finalPrediction/CNReinPrediction.csv"
cnrein_output = "/output/cnrein/"
cnrein_parser = CNReinParser(
input_path=cnrein_input,
output_path=cnrein_output,
)
cnrein_parser.run()
After running, the parser will read the input table, merge the Haplotype 1 and Haplotype 2 columns into a HAP_CN column, and save the standardized files to the output directory:
Plaintext
/output/cnrein/
├── haplotype_combined.csv
├── haplotype_1.csv
├── haplotype_2.csv
├── minor.csv
├── major.csv
└── minor_major.csv
haplotype_combined.csv— main CNA matrix (regions × cells).
Each value represents the combined haplotype copy number in the form "hap1|hap2".
🧬 2. Parse the bin RDR matrix
Unlike parsers that extract counts from a text file, CNRein stores RDR and cell names in numpy .npz arrays.
counts_file = "/demo_output/cnrein//binScale/filtered_RDR_avg.npz"
cells_file = "/demo_output/cnrein/initial/cellNames.npz"
cnrein_parser.get_bin_rdr(
counts_path=counts_file,
cell_name_path=cells_file
)
After running this command, the following standardized file will be created:
Plaintext
/output/cnrein/
└── bin_rdr.csv
🧬 3. Parse the VAF sparse matrices
This method expects vaf_file_dir to be a directory containing 22 separate chromosome VCF files named precisely as seperates_chr1.vcf.gz through seperates_chr22.vcf.gz.
cnrein_parser.get_VAF_matrix(
vaf_file_dir="/demo_output/cnrein/vcf_output/",
min_dp=3,
min_cells=10,
prefix="cellSNP"
)
After running this command, the following standardized directory and Matrix Market files will be created:
Plaintext
/output/cnrein/
├── VAF/
│ ├── cellSNP_AD.mtx
│ ├── cellSNP_DP.mtx
│ └── ...
⚙️ Initialization
CNReinParser(
input_path: str,
output_path: str,
chrom_col: str = "Chromosome",
start_col: str = "Start",
end_col: str = "End",
cell_col: str = "Cell barcode",
value_col: str = "HAP_CN",
start_offset: int = 0,
add_chr_prefix: bool = True,
**kwargs
)
input_path: Path to the CNRein prediction output table.
The required columns based on the default configuration are:
Chromosome, Start, End, Cell barcode, HAP_CN
output_path: Base output directory where the standardized matrices will be saved.
🧠 Core Methods
CNReinParser.run()
Executes the standard pipeline for the CNA matrix:
- Parses the table dynamically.
- Writes the reshaped region-by-cell matrices directly into
output_path.
CNReinParser.get_bin_rdr(counts_path, cell_name_path, cna_path=None)
Creates a region-by-cell wide matrix of per-bin Read Depth Ratios (RDR).
Input
counts_path: Path to an.npzfile containing the counts array (expected shape:n_cells, n_bins).cell_name_path: Path to an.npzfile containing the cell names array (expected shape:n_cells,).cna_path(optional): Path to thehaplotype_combined.csvfile generated byrun(). If not provided, it defaults to looking in theoutput_path. This is used to extract the correctregionindex.
Output writes to:
{self.output_path}/bin_rdr.csv
This is a wide matrix:
- rows:
region - columns: cells
- values: RDR counts
CNReinParser.get_VAF_matrix(vaf_file_dir, output_path=None, min_dp=1, min_cells=1, prefix="cellSNP")
Converts split chromosome VCF files into sparse matrix outputs (AD and DP).
Input format
vaf_file_dir must be a directory containing exactly 22 gzipped VCF files named:
seperates_chr1.vcf.gzseperates_chr2.vcf.gz- ...
seperates_chr22.vcf.gz
Parameters
output_path (optional)
- If provided: outputs under
{output_path}/VAF, else outputs under{self.output_path}/VAF.
min_dp
- Filter low depth sites.
min_cells
- Filter sites supported by too few cells.
prefix
- Output file prefix (default:
cellSNP).
Output
Creates a VAF/ directory containing the Matrix Market (.mtx) files.