Skip to content

CNRein Parser

This module provides CNReinParser, specialized for parsing CNRein outputs and exporting standardized matrices and helper files used by hcbench workflows.

Key features:

  • Read CNRein prediction tables and merge separate haplotype columns into a unified haplotype-level CNA matrix
  • Optional region splitting by a user-defined bin_size
  • Extract bin-level Read Depth Ratio (RDR) from .npz files into a standardized bin_rdr.csv
  • Convert a directory of separate chromosome VCF files into sparse VAF matrix outputs

🚀 Quick Start

🧬 1. Parse the CNA Matrix

Python

from hcbench.parsers.cnrein import CNReinParser

cnrein_input = "/demo_output/cnrein/finalPrediction/CNReinPrediction.csv"
cnrein_output = "/output/cnrein/"

cnrein_parser = CNReinParser(
    input_path=cnrein_input,
    output_path=cnrein_output,
)

cnrein_parser.run()

After running, the parser will read the input table, merge the Haplotype 1 and Haplotype 2 columns into a HAP_CN column, and save the standardized files to the output directory:

Plaintext

/output/cnrein/
├── haplotype_combined.csv
├── haplotype_1.csv       
├── haplotype_2.csv       
├── minor.csv             
├── major.csv             
└── minor_major.csv       
  • haplotype_combined.csv — main CNA matrix (regions × cells).

Each value represents the combined haplotype copy number in the form "hap1|hap2".


🧬 2. Parse the bin RDR matrix

Unlike parsers that extract counts from a text file, CNRein stores RDR and cell names in numpy .npz arrays.

counts_file = "/demo_output/cnrein//binScale/filtered_RDR_avg.npz"
cells_file = "/demo_output/cnrein/initial/cellNames.npz"

cnrein_parser.get_bin_rdr(
    counts_path=counts_file, 
    cell_name_path=cells_file
)

After running this command, the following standardized file will be created:

Plaintext

/output/cnrein/
└── bin_rdr.csv

🧬 3. Parse the VAF sparse matrices

This method expects vaf_file_dir to be a directory containing 22 separate chromosome VCF files named precisely as seperates_chr1.vcf.gz through seperates_chr22.vcf.gz.

cnrein_parser.get_VAF_matrix(
    vaf_file_dir="/demo_output/cnrein/vcf_output/",
    min_dp=3,
    min_cells=10,
    prefix="cellSNP"
)

After running this command, the following standardized directory and Matrix Market files will be created:

Plaintext

/output/cnrein/
├── VAF/
│   ├── cellSNP_AD.mtx
│   ├── cellSNP_DP.mtx
│   └── ...

⚙️ Initialization

CNReinParser(
    input_path: str,
    output_path: str,
    chrom_col: str = "Chromosome",
    start_col: str = "Start",
    end_col: str = "End",
    cell_col: str = "Cell barcode",
    value_col: str = "HAP_CN",
    start_offset: int = 0,
    add_chr_prefix: bool = True,
    **kwargs
)

input_path: Path to the CNRein prediction output table.

The required columns based on the default configuration are:

Chromosome, Start, End, Cell barcode, HAP_CN

output_path: Base output directory where the standardized matrices will be saved.

🧠 Core Methods

CNReinParser.run()

Executes the standard pipeline for the CNA matrix:

  • Parses the table dynamically.
  • Writes the reshaped region-by-cell matrices directly into output_path.

CNReinParser.get_bin_rdr(counts_path, cell_name_path, cna_path=None)

Creates a region-by-cell wide matrix of per-bin Read Depth Ratios (RDR).

Input

  • counts_path: Path to an .npz file containing the counts array (expected shape: n_cells, n_bins).
  • cell_name_path: Path to an .npz file containing the cell names array (expected shape: n_cells,).
  • cna_path (optional): Path to the haplotype_combined.csv file generated by run(). If not provided, it defaults to looking in the output_path. This is used to extract the correct region index.

Output writes to:

  • {self.output_path}/bin_rdr.csv

This is a wide matrix:

  • rows: region
  • columns: cells
  • values: RDR counts

CNReinParser.get_VAF_matrix(vaf_file_dir, output_path=None, min_dp=1, min_cells=1, prefix="cellSNP")

Converts split chromosome VCF files into sparse matrix outputs (AD and DP).

Input format

vaf_file_dir must be a directory containing exactly 22 gzipped VCF files named:

  • seperates_chr1.vcf.gz
  • seperates_chr2.vcf.gz
  • ...
  • seperates_chr22.vcf.gz

Parameters

output_path (optional)

  • If provided: outputs under {output_path}/VAF, else outputs under {self.output_path}/VAF.
min_dp
  • Filter low depth sites.
min_cells
  • Filter sites supported by too few cells.
prefix
  • Output file prefix (default: cellSNP).

Output

Creates a VAF/ directory containing the Matrix Market (.mtx) files.