🧩 Script Parameter Reference: 001_prepareData.py¶
This script is responsible for preparing datasets for neural network-based classification on scRNA-seq data. It performs gene filtering, data normalization, reference generation, and splits the dataset into training/testing sets.
It is typically the first step in the pipeline and should be executed before 002_geneContribution.py.
🔧 Parameters¶
-i, --infile (Required)¶
Type: str
Description:
Path to the input data file.
The input file should be a tab-delimited matrix where:
- First column = Cell IDs
- Second column = Cell type / cluster labels
- Remaining columns = Gene expression values
-o, --outDIR (Required)¶
Type: str
Description:
Output directory where the processed files will be saved.
-t, --file_type¶
Type: str
Choices: "count", "normalized"
Default: "count"
Description:
Type of expression matrix.
- "count": Raw count matrix (will be normalized to CPM)
- "normalized": Already normalized/log-transformed data
🧬 Core Workflow¶
This script will:
- Filter genes based on expression (e.g., max value > 0.01 and group specificity)
- Normalize raw count matrix to CPM if needed
- Generate artificial reference data for training stability
- Format expression values (log-transformed if count)
- Create files:
- train.dat, test.dat, pool.dat: For training/evaluation
- universal.dat: Simulated reference profiles
- expression_by_group_gene.npy: Gene expression by group
- orderedGeneList.dat: Filtered gene list
- cellCode.dat: Mapping of group labels to numeric codes
- sampleList.pool.dat: List of all sample IDs
- meta.dat: Dataset summary (e.g., number of genes, references, training samples)
✅ Example Usage¶
python 001_prepareData.py \
-i data/scRNA_expression_matrix \
-o results/output_dir/ \
-t count
Need help? Contact the developer on zhao_yongbing@gibh.ac.cn.