Skip to content

🧩 Script Parameter Reference: 001_prepareData.py

This script is responsible for preparing datasets for neural network-based classification on scRNA-seq data. It performs gene filtering, data normalization, reference generation, and splits the dataset into training/testing sets.

It is typically the first step in the pipeline and should be executed before 002_geneContribution.py.


🔧 Parameters

-i, --infile (Required)

Type: str
Description:
Path to the input data file.
The input file should be a tab-delimited matrix where:
- First column = Cell IDs
- Second column = Cell type / cluster labels
- Remaining columns = Gene expression values


-o, --outDIR (Required)

Type: str
Description:
Output directory where the processed files will be saved.


-t, --file_type

Type: str
Choices: "count", "normalized"
Default: "count"
Description:
Type of expression matrix.
- "count": Raw count matrix (will be normalized to CPM)
- "normalized": Already normalized/log-transformed data


🧬 Core Workflow

This script will: - Filter genes based on expression (e.g., max value > 0.01 and group specificity) - Normalize raw count matrix to CPM if needed - Generate artificial reference data for training stability - Format expression values (log-transformed if count) - Create files:
- train.dat, test.dat, pool.dat: For training/evaluation
- universal.dat: Simulated reference profiles
- expression_by_group_gene.npy: Gene expression by group
- orderedGeneList.dat: Filtered gene list
- cellCode.dat: Mapping of group labels to numeric codes
- sampleList.pool.dat: List of all sample IDs
- meta.dat: Dataset summary (e.g., number of genes, references, training samples)


✅ Example Usage

python 001_prepareData.py \
  -i data/scRNA_expression_matrix \
  -o results/output_dir/ \
  -t count

Need help? Contact the developer on zhao_yongbing@gibh.ac.cn.