Skip to content

scMarkerGene Tutorial

Welcome to the scMarkerGene tutorial! This guide walks you through the key functionalities of the pipeline, from installation to data preparation and quick execution.


🧠 scMarkerGene Workflow

workflow plot

In Step 1, the input scRNA-seq expression matrix is randomly split into training and validation sets (9:1) and used to train an MLP classifier supervised by cell type annotations. The training process is guided by validation accuracy, with hyperparameters iteratively optimized using Optuna. Once the optimal hyperparameter set is determined, ensemble learning is further applied to generate a more robust attribution score matrix. In Step 2, the top N genes (user-defined) with the highest contribution scores are selected as candidate marker sets for each cell type. These gene lists are then re-ranked by computing a marker score, yielding the final marker lists.


πŸ› οΈ Installation

Make sure Mamba is installed on your Linux system.

Then, run the following commands:

mamba create -n scMarkerGene python=3.10
mamba activate scMarkerGene
mamba install pandas captum optuna scikit-learn

πŸ“₯ Input Preparation

Your input single-cell RNA-seq expression matrix must be provided as a tab-delimited text file (.txt) with the following format:

  • File must be in cell Γ— gene wide format (rows = cells, columns = genes).
  • The first two columns must be:
  • cell_id : unique identifier for each cell (e.g., cell_001, cell_002, …)
  • group : known label, annotation, or cell type of each cell
  • All remaining columns correspond to gene expression values, with column names as gene symbols.
    These values can be either:
  • Raw counts
  • Normalized expression

⚠️ Important:

File should be tab (\t) separated, not comma separated.␣␣ Missing values should be filled with 0.


Example:

cell_id group GeneA GeneB GeneC
cell_001 cell_type1 0.12 0.03 1.22
cell_002 cell_type2 0.45 0.00 0.98

πŸ“‚ Demo Data

The demo dataset is provided as raw counts, where each value represents the original sequencing count per gene per cell.

You can download a demo dataset in the correct format from:
πŸ‘‰ Demo Data


πŸš€ Quick Start

After downloading and extracting the Source code package, navigate to the folder containing the four Python files:

cd path/to/scMarkerGene

Run the automation script to start the full marker gene analysis:

python 000_makeAutoRunScript.py -i <input_file> -w <workdir> -t <count|normalized> --strategy <upsample|weights|none>
Argument Requirement Default Description
-i INFILE required – Path to the input expression matrix (must follow the Input Preparation format).
-w WORKDIR required – Absolute path to the working directory for output files. Must exist and be writable.
-t {count, normalized} optional count Input data type: count = raw counts, normalized = log-normalized / CPM / TPM.
--strategy optional upsample Strategy for handling class imbalance: Options: upsample = simple oversampling, weights = class-weighted los), none = no balancing.

After running the above command, an autorun.sh script will be generated in <workdir> path.

You can then automatically execute the entire pipeline by running:

bash <workdir>/autorun.sh

This will run all analysis steps and generate the output file in <workdir>.

Note: Expression values can be either raw counts or normalized values.
Use -t normalized if your data has already been log-transformed or scaled.
We recommend adding --strategyupsampleorweights` to mitigate class imbalance β€” though this may increase runtime.


πŸ“₯ Source Code

The source code includes four Python files:

  • 000_makeAutoRunScript.py
  • 001_prepareData.py
  • 002_geneContribution.py
  • 003_extractCellMarker.py

The 000_makeAutoRunScript.py will automatically run 001_prepareData.py, 002_geneContribution.py and 003_extractCellMarker.py in sequence, and generate the final output files in <workdir>.

πŸ“„ Detailed descriptions of the main marker extraction Python file:
- 001_prepareData
- 002_geneContribution
- 003_extractCellMarker

You can download the complete source code package here:
πŸ‘‰ Source code file


πŸ“‘ Output File

The main result is a marker gene table named: scMarkerGene_Top20_marker_exp.txt

It is saved in the output folder of the marker extraction step:

<workdir>/003_extractCellMarker_output/scMarkerGene_Top20_marker_exp.txt

This txt file contains the top 20 ranked marker genes per cell type.

Columns:

  • marker β†’ gene name of the marker
  • target_cell_type β†’ the cell type where the gene is identified as a marker
  • fold_level β†’ the expression fold-change of the gene in the target cell type relative to other cell types
  • marker_score β†’ composite score (silhouette Γ— specificity) indicating the gene’s discriminative power


Example (partial):

marker target_cell_type fold_level marker_score
Gene1 cell_type1 2.0 2.5
Gene2 cell_type1 2.0 2.3
Gene3 cell_type1 1.9 2.1
... ... ... ...

πŸ“ŒBy default:
For each cell type, the top 20 marker genes are listed (ranked from highest to lowest marker score).
The values in the expression matrix (from column 3 onwards) represent the median expression of each marker gene across different cell types.
For example, in the cell_type1, the highest-scoring marker is Gene1.