scRNA-seq Preprocessing
This document outlines the standard for single-cell RNA sequencing (scRNA-seq) datasets generated by a block that processes raw sequencing data. Upstream blocks that perform scRNA-seq preprocessing should produce a p-frame containing the p-columns defined here. This ensures that downstream tools for analysis, visualization, and comparison can operate on a consistent and predictable data structure.
Overview
The diagram below illustrates a typical user flow involving a scRNA-seq preprocessing block.
Blocks Result pool
┌───────────
┌─────────────────────────┤
│ │
v │
╔═══════════════════════╗ exports │
║ Samples & Data ║───────>─────┤ Sequencing Dataset
╚═══════════════════════╝ │ ------------------
│
├ [sampleId](readIndex)(lane) -> file
┌─────────────────────────┤
│ │
v │
╔═══════════════════════════╗ exports │
║ scRNA-seq Preprocessing ║────->─-─┤ Count Matrices, Gene & Cell Properties
╚═══════════════════════════╝ │ --------------------------------------
│
├ [sampleId][cellId][geneId] -> raw & normalized counts
├ [geneId] -> gene properties
├ [cellId] -> cell metrics
┌─────────────────────────┤
│ │
v │
╔═══════════════════════╗ │
║ Downstream Analysis ║ │
╚═══════════════════════╝ │
- Samples & Data: The flow starts with a universal entry point for importing and organizing raw sequencing data (FASTQ files).
- scRNA-seq Preprocessing Block: This block takes the sequencing dataset as input. It runs a preprocessing pipeline and generates a standardized scRNA-seq dataset as its primary output. This dataset consists of several main p-columns:
- Count Matrix p-columns: Two matrices keyed by
[sampleId][cellId][geneId], one for raw UMI counts and one for normalized counts. - Gene Property p-columns: Keyed by
[geneId], these store descriptive attributes of the genes, such as gene symbols. - Cell Metrics p-columns: Keyed by
[sampleId][cellId], these store per-cell quality control metrics.
- Count Matrix p-columns: Two matrices keyed by
- Downstream Blocks: Subsequent blocks consume the scRNA-seq dataset, using the anchor raw count matrix to identify the dataset.
Core structure: axes and p-columns
A standard scRNA-seq dataset is a p-frame composed of p-columns that describe gene expression per cell. The structure of these p-columns hinges on three primary axes.
Primary axes
| Axis Name | Type | Description |
|---|---|---|
pl7.app/sampleId | String | Uniquely identifies the sample from which the data was derived. |
pl7.app/sc/cellId | String | Uniquely identifies a single cell, typically by its barcode sequence. |
pl7.app/rna-seq/geneId | String | Uniquely identifies a gene, typically using a standard ID like Ensembl ID. |
The anchor column: Raw Count Matrix
To facilitate discovery by downstream blocks, the raw count matrix p-column must be designated as the "anchor" for the dataset.
- P-column name:
pl7.app/rna-seq/countMatrix - Domain: Must contain
{"pl7.app/rna-seq/normalized": "false"} - Annotation: Must contain
{"pl7.app/isAnchor": "true"} - Axes:
[pl7.app/sampleId][pl7.app/sc/cellId][pl7.app/rna-seq/geneId]
Count Matrix p-columns
The block produces two count matrices, distinguished by their domain.
1. Raw Count Matrix (Anchor)
- Description: The raw number of UMIs for each gene in each cell.
- Requirement: Required.
- Specification:
name: pl7.app/rna-seq/countMatrix
valueType: Long
axesSpec:
- name: pl7.app/sampleId
- name: pl7.app/sc/cellId
- name: pl7.app/rna-seq/geneId
domain:
pl7.app/species: "hsa"
pl7.app/blockId: "<block-run-id>"
pl7.app/rna-seq/normalized": "false"
annotations:
pl7.app/isAnchor: "true"
pl7.app/label: "Raw gene expression"
2. Normalized Count Matrix
- Description: Normalized gene expression values (e.g., CP10k).
- Requirement: Required.
- Specification:
name: pl7.app/rna-seq/countMatrix
valueType: Double
axesSpec:
- name: pl7.app/sampleId
- name: pl7.app/sc/cellId
- name: pl7.app/rna-seq/geneId
domain:
pl7.app/species: "hsa"
pl7.app/blockId: "<block-run-id>"
pl7.app/rna-seq/normalized": "true"
annotations:
pl7.app/label: "Normalized gene expression"
Gene Property p-columns
A preprocessing block must provide a human-readable label for the geneId axis. This is accomplished by generating a pl7.app/label p-column. This column is keyed by pl7.app/rna-seq/geneId and its values are the corresponding gene symbols.
- P-column name:
pl7.app/label - Description: The gene symbol corresponding to the
geneId. - Requirement: Required.
- Specification:
name: pl7.app/label
valueType: String
axesSpec:
- name: pl7.app/rna-seq/geneId
type: String
domain:
# The species for the gene annotations
pl7.app/species: "hsa"
annotations:
pl7.app/label: "Gene Symbol"
Cell Metrics p-columns
These p-columns are keyed by [pl7.app/sampleId][pl7.app/sc/cellId] and provide important quality control information for each cell.
pl7.app/rna-seq/totalCounts: The total number of UMIs detected in the cell.pl7.app/rna-seq/nGenesByCounts: The number of unique genes detected in the cell.pl7.app/rna-seq/pctCountsMt: The percentage of reads mapping to mitochondrial genes.pl7.app/rna-seq/complexity: The library complexity, calculated aslog10(nGenesByCounts) / log10(totalCounts).
Querying for scRNA-seq data: examples
The following examples show how downstream blocks can reliably find and use scRNA-seq datasets without knowing the specifics of the upstream block that generated it.
Model: Populating a UI with available datasets
This TypeScript example shows how a block's model can find all available scRNA-seq datasets by looking for the anchor column. This is useful for populating a UI dropdown that allows a user to select an input for their analysis.
// In a block's model file (`/model/src/index.ts`)
import { BlockModel } from "@platforma-sdk/model";
export const model = BlockModel.create()
// ...
.output("datasetOptions", (ctx) =>
ctx.resultPool.getOptions(
// This matcher finds anchor p-columns for scRNA-seq datasets by looking
// for the specific axes and the `isAnchor` annotation.
{
axes: [
{ name: "pl7.app/sampleId" },
{ name: "pl7.app/sc/cellId" },
{ name: "pl7.app/rna-seq/geneId" },
],
annotations: { "pl7.app/isAnchor": "true" },
},
{
// This setting improves the UI by showing only the dataset label,
// not the specific name of the anchor p-column (e.g. "Raw gene expression").
label: { includeNativeLabel: false },
}
)
)
//...
.done();
Workflow: Using the count matrix for calculations
This Tengo example shows how a workflow, having been given an inputAnchor dataset by the user, can reliably fetch the associated raw count matrix for use in statistical calculations.
// In a block's workflow file (`/workflow/src/main.tpl.tengo`)
wf := import("@platforma-sdk/workflow-tengo:workflow")
wf.prepare(func(args) {
bundleBuilder := wf.createPBundleBuilder()
bundleBuilder.addAnchor("main", args.inputAnchor)
// Add the raw count matrix to the bundle.
// The platform will find the correct p-column
// by matching the annotations against the provided anchor.
bundleBuilder.addSingle({
axes: [
{ anchor: "main", idx: 0 },
{ anchor: "main", idx: 1 },
{ anchor: "main", idx: 2 },
],
name: "pl7.app/rna-seq/countMatrix",
domain: {
"pl7.app/rna-seq/normalized": "false",
}
},
"rawCounts")
return {
bundle: bundleBuilder.build()
}
})
wf.body(func(args) {
// ...
// The `rawCounts` p-column is now available
// in the `args.bundle` bundle for processing.
// ...
})
Summary of standard p-columns
| P-Column Name | Description | Axes | Requirement |
|---|---|---|---|
| Count Matrices | |||
pl7.app/rna-seq/countMatrix | Raw UMI counts per gene per cell. | [sampleId][cellId][geneId] | Required |
pl7.app/rna-seq/countMatrix | Normalized expression values. | [sampleId][cellId][geneId] | Required |
| Gene Properties | |||
pl7.app/label | Gene symbol. | [geneId] | Required |
| Cell Metrics | |||
pl7.app/rna-seq/totalCounts | Total UMIs per cell. | [sampleId][cellId] | Required |
pl7.app/rna-seq/nGenesByCounts | Number of detected genes per cell. | [sampleId][cellId] | Required |
pl7.app/rna-seq/pctCountsMt | Percentage of mitochondrial reads. | [sampleId][cellId] | Required |
pl7.app/rna-seq/complexity | Library complexity per cell. | [sampleId][cellId] | Required |