SEACON Parser

This module provides SeaconParser, specialized for parsing SEACON outputs and exporting standardized matrices and helper files used by hcbench workflows.

Key features:

Read SEACON CNA tables and automatically preprocess copy number values by replacing commas with pipes (|).
generate minor.csv, major.csv, and combined minor_major.csv outputs by default.
Export bin-level count matrices as bin_counts.csv, dynamically mapped to the standard regions extracted from the CNA output.
Convert a headerless, tab-separated VAF long table into sparse matrix outputs.

🚀 Quick Start

🧬 1. Parse the CNA Matrix

from hcbench.parsers.seacon import SeaconParser

seacon_input = "/demo_output/seacon/cna_output.tsv"
seacon_output = "/output/seacon/"

seacon_parser = SeaconParser(
    input_path=seacon_input, 
    output_path=seacon_output,
)
seacon_parser.run()

After running, the parser will read the input file, split the haplotypes, and save the standardized files to the output directory. The output directory typically contains:

/output/seacon/
├── minor.csv             
├── major.csv             
└── minor_major.csv

minor_major.csv — combined CNA matrix (regions × cells).

Each value represents the combined haplotype copy number.

🧬 2. Parse the bin counts matrix

⚠️ Important: This method requires the minor_major.csv file to already exist in your output directory to properly extract the region index. Ensure you run the main parser pipeline first.

counts_file = "/demo_output/seacon/counts.tsv"

seacon_parser.get_bin_counts(counts_path=counts_file)

After running this command, the following standardized file will be created:

/output/seacon/
└── bin_counts.csv

🧬 3. Parse the VAF sparse matrices

This method expects a tab-separated VAF file with no header.

Python

seacon_parser.get_VAF_matrix(
    vaf_file_path="/demo_output/seacon/vaf.tsv",
    min_dp=3,
    min_cells=10,
    prefix="cellSNP"
)

After running this command, the following standardized directory and Matrix Market files will be created:

Plaintext

/output/seacon/
├── VAF/
│   ├── cellSNP_AD.mtx
│   ├── cellSNP_DP.mtx
│   └── ...

⚙️ Initialization

Python

SeaconParser(
    input_path: str,
    output_path: str,
    chrom_col: str = "chrom",
    start_col: str = "start",
    end_col: str = "end",
    cell_col: str = "cell",
    value_col: str = "CN",
    start_offset: int = 0,
    add_chr_prefix: bool = False,
    output_haplotype: bool = False,
    **kwargs
)

input_path: Path to the SEACON CNA output table.

The required columns based on the default configuration are:

chrom, start, end, cell, CN

output_path: Base output directory where the standardized matrices will be saved.

🧠 Core Methods

`SeaconParser.run()`

Executes the standard pipeline for the CNA matrix:

`SeaconParser.get_bin_counts(counts_path)`

Creates a region-by-cell wide matrix of per-bin counts.

Input

counts_path: Path to the counts table. The parser expects a table with cells in rows, so it transposes the dataframe automatically.

Output writes to:

{self.output_path}/bin_counts.csv.

This is a wide matrix:

rows: region (dynamically matched from minor_major.csv).
columns: cells.
values: counts.

`SeaconParser.get_VAF_matrix(vaf_file_path, output_path=None, min_dp=1, min_cells=1, prefix="cellSNP")`

Converts a tab-separated VAF table into sparse matrix outputs using long_to_mtx().

Input format

vaf_file_path must be a tab-separated file with no header and exactly 5 columns. The parser automatically maps them to:

column index	meaning
0	`chr`
1	`position`
2	`cell`
3	`Acount`
4	`Bcount`

Parameters

output_path (optional)

If provided: outputs under {output_path}/VAF, else outputs under {self.output_path}/VAF.

min_dp

Filter low depth sites.

min_cells

Filter sites supported by too few cells.

prefix

Output file prefix (default: cellSNP).

Output

Creates a VAF/ directory containing the Matrix Market (.mtx) files.

📂 Input Files

The output directory of SEACON typically includes a single main file containing copy number information for all cells:

demo_output/seacon/
└── calls.tsv

calls.tsv — the primary SEACON output file containing inferred copy number (CN) states per cell and genomic region.

An example of calls.tsv:

cell    chrom   start   end CN
clone9_cell6    chr1    5000001 10000000    0,2
clone9_cell6    chr1    15000001    20000000    0,2
clone9_cell6    chr1    20000001    25000000    0,2
clone9_cell6    chr1    25000001    30000000    0,2

The required columns are:

chrom, start, end, cell, CN

Each entry in the CN column may contain comma-separated allele-specific copy numbers (e.g. "0,2"). The parser automatically converts these values to a standardized "hap1|hap2" format (e.g. "0|2").

📤 Output Files

After parsing, the following file is generated in the specified output directory:

haplotype_combined.csv

haplotype_combined.csv — standardized CNA matrix (regions × cells). Each value represents the haplotype-level copy number in the format "hap1|hap2".

If split_haplotype=True is enabled in the base class, the parser will also produce additional derived matrices:

haplotype_1.csv
haplotype_2.csv
minor.csv
major.csv
minor_major.csv

⚙️ Key Parameters

Parameter	Description	Default
`chrom_col`	Column name for chromosome.	`"chrom"`
`start_col`	Column name for region start.	`"start"`
`end_col`	Column name for region end.	`"end"`
`cell_col`	Column name for cell ID.	`"cell"`
`value_col`	Column name for copy number value.	`"CN"`
`start_plus_one`	Whether to shift start coordinates by +1.	`False`
`add_chr_prefix`	Whether to enforce `"chr"` prefix on chromosome names.	`False`

🚀 Example Usage

from hcbench.parsers.seacon import SeaconParser

seacon_input = "/demo_output/seacon/calls.tsv"
seacon_output = "/output/seacon/"

seacon_parser = SeaconParser(
    input_path=seacon_input,
    output_path=seacon_output
)
seacon_parser.run()

After running, the parser will read calls.tsv, convert CN values from "x,y" to "x|y", and save the standardized file:

haplotype_combined.csv

in your output directory.