🧩 Script Parameter Reference: `001_prepareData.py`¶

This script is responsible for preparing datasets for neural network-based classification on scRNA-seq data. It performs gene filtering, data normalization, reference generation, and splits the dataset into training/testing sets.

It is typically the first step in the pipeline and should be executed before 002_geneContribution.py.

🔧 Parameters¶

`-i`, `--infile` (Required)¶

Type: str
Description:
Path to the input data file.
The input file should be a tab-delimited matrix where:
- First column = Cell IDs
- Second column = Cell type / cluster labels
- Remaining columns = Gene expression values

`-o`, `--outDIR` (Required)¶

Type: str
Description:
Output directory where the processed files will be saved.

`-t`, `--file_type`¶

Type: str
Choices: "count", "normalized"
Default: "count"
Description:
Type of expression matrix.
- "count": Raw count matrix (will be normalized to CPM)
- "normalized": Already normalized/log-transformed data

🧬 Core Workflow¶

This script will: - Filter genes based on expression (e.g., max value > 0.01 and group specificity) - Normalize raw count matrix to CPM if needed - Generate artificial reference data for training stability - Format expression values (log-transformed if count) - Create files:
- train.dat, test.dat, pool.dat: For training/evaluation
- universal.dat: Simulated reference profiles
- expression_by_group_gene.npy: Gene expression by group
- orderedGeneList.dat: Filtered gene list
- cellCode.dat: Mapping of group labels to numeric codes
- sampleList.pool.dat: List of all sample IDs
- meta.dat: Dataset summary (e.g., number of genes, references, training samples)

✅ Example Usage¶

python 001_prepareData.py \
  -i data/scRNA_expression_matrix \
  -o results/output_dir/ \
  -t count

Need help? Contact the developer on zhao_yongbing@gibh.ac.cn.

🧩 Script Parameter Reference: 001_prepareData.py¶