Skip to main content

scRNA-seq Preprocessing

This document outlines the standard for single-cell RNA sequencing (scRNA-seq) datasets generated by a block that processes raw sequencing data. Upstream blocks that perform scRNA-seq preprocessing should produce a p-frame containing the p-columns defined here. This ensures that downstream tools for analysis, visualization, and comparison can operate on a consistent and predictable data structure.

Overview

The diagram below illustrates a typical user flow involving a scRNA-seq preprocessing block.

 Blocks                                 Result pool
┌───────────
┌─────────────────────────┤
│ │
v │
╔═══════════════════════╗ exports │
║ Samples & Data ║───────>─────┤ Sequencing Dataset
╚═══════════════════════╝ │ ------------------

├ [sampleId](readIndex)(lane) -> file
┌─────────────────────────┤
│ │
v │
╔═══════════════════════════╗ exports │
║ scRNA-seq Preprocessing ║────->─-─┤ Count Matrices, Gene & Cell Properties
╚═══════════════════════════╝ │ --------------------------------------

├ [sampleId][cellId][geneId] -> raw & normalized counts
├ [geneId] -> gene properties
├ [cellId] -> cell metrics
┌─────────────────────────┤
│ │
v │
╔═══════════════════════╗ │
║ Downstream Analysis ║ │
╚═══════════════════════╝ │

  1. Samples & Data: The flow starts with a universal entry point for importing and organizing raw sequencing data (FASTQ files).
  2. scRNA-seq Preprocessing Block: This block takes the sequencing dataset as input. It runs a preprocessing pipeline and generates a standardized scRNA-seq dataset as its primary output. This dataset consists of several main p-columns:
    • Count Matrix p-columns: Two matrices keyed by [sampleId][cellId][geneId], one for raw UMI counts and one for normalized counts.
    • Gene Property p-columns: Keyed by [geneId], these store descriptive attributes of the genes, such as gene symbols.
    • Cell Metrics p-columns: Keyed by [sampleId][cellId], these store per-cell quality control metrics.
  3. Downstream Blocks: Subsequent blocks consume the scRNA-seq dataset, using the anchor raw count matrix to identify the dataset.

Core structure: axes and p-columns

A standard scRNA-seq dataset is a p-frame composed of p-columns that describe gene expression per cell. The structure of these p-columns hinges on three primary axes.

Primary axes

Axis NameTypeDescription
pl7.app/sampleIdStringUniquely identifies the sample from which the data was derived.
pl7.app/sc/cellIdStringUniquely identifies a single cell, typically by its barcode sequence.
pl7.app/rna-seq/geneIdStringUniquely identifies a gene, typically using a standard ID like Ensembl ID.

The anchor column: Raw Count Matrix

To facilitate discovery by downstream blocks, the raw count matrix p-column must be designated as the "anchor" for the dataset.

  • P-column name: pl7.app/rna-seq/countMatrix
  • Domain: Must contain {"pl7.app/rna-seq/normalized": "false"}
  • Annotation: Must contain {"pl7.app/isAnchor": "true"}
  • Axes: [pl7.app/sampleId][pl7.app/sc/cellId][pl7.app/rna-seq/geneId]

Count Matrix p-columns

The block produces two count matrices, distinguished by their domain.

1. Raw Count Matrix (Anchor)

  • Description: The raw number of UMIs for each gene in each cell.
  • Requirement: Required.
  • Specification:
name: pl7.app/rna-seq/countMatrix
valueType: Long
axesSpec:
- name: pl7.app/sampleId
- name: pl7.app/sc/cellId
- name: pl7.app/rna-seq/geneId
domain:
pl7.app/species: "hsa"
pl7.app/blockId: "<block-run-id>"
pl7.app/rna-seq/normalized": "false"
annotations:
pl7.app/isAnchor: "true"
pl7.app/label: "Raw gene expression"

2. Normalized Count Matrix

  • Description: Normalized gene expression values (e.g., CP10k).
  • Requirement: Required.
  • Specification:
name: pl7.app/rna-seq/countMatrix
valueType: Double
axesSpec:
- name: pl7.app/sampleId
- name: pl7.app/sc/cellId
- name: pl7.app/rna-seq/geneId
domain:
pl7.app/species: "hsa"
pl7.app/blockId: "<block-run-id>"
pl7.app/rna-seq/normalized": "true"
annotations:
pl7.app/label: "Normalized gene expression"

Gene Property p-columns

A preprocessing block must provide a human-readable label for the geneId axis. This is accomplished by generating a pl7.app/label p-column. This column is keyed by pl7.app/rna-seq/geneId and its values are the corresponding gene symbols.

  • P-column name: pl7.app/label
  • Description: The gene symbol corresponding to the geneId.
  • Requirement: Required.
  • Specification:
name: pl7.app/label
valueType: String
axesSpec:
- name: pl7.app/rna-seq/geneId
type: String
domain:
# The species for the gene annotations
pl7.app/species: "hsa"
annotations:
pl7.app/label: "Gene Symbol"

Cell Metrics p-columns

These p-columns are keyed by [pl7.app/sampleId][pl7.app/sc/cellId] and provide important quality control information for each cell.

  • pl7.app/rna-seq/totalCounts: The total number of UMIs detected in the cell.
  • pl7.app/rna-seq/nGenesByCounts: The number of unique genes detected in the cell.
  • pl7.app/rna-seq/pctCountsMt: The percentage of reads mapping to mitochondrial genes.
  • pl7.app/rna-seq/complexity: The library complexity, calculated as log10(nGenesByCounts) / log10(totalCounts).

Querying for scRNA-seq data: examples

The following examples show how downstream blocks can reliably find and use scRNA-seq datasets without knowing the specifics of the upstream block that generated it.

Model: Populating a UI with available datasets

This TypeScript example shows how a block's model can find all available scRNA-seq datasets by looking for the anchor column. This is useful for populating a UI dropdown that allows a user to select an input for their analysis.

// In a block's model file (`/model/src/index.ts`)

import { BlockModel } from "@platforma-sdk/model";

export const model = BlockModel.create()
// ...
.output("datasetOptions", (ctx) =>
ctx.resultPool.getOptions(
// This matcher finds anchor p-columns for scRNA-seq datasets by looking
// for the specific axes and the `isAnchor` annotation.
{
axes: [
{ name: "pl7.app/sampleId" },
{ name: "pl7.app/sc/cellId" },
{ name: "pl7.app/rna-seq/geneId" },
],
annotations: { "pl7.app/isAnchor": "true" },
},
{
// This setting improves the UI by showing only the dataset label,
// not the specific name of the anchor p-column (e.g. "Raw gene expression").
label: { includeNativeLabel: false },
}
)
)
//...
.done();

Workflow: Using the count matrix for calculations

This Tengo example shows how a workflow, having been given an inputAnchor dataset by the user, can reliably fetch the associated raw count matrix for use in statistical calculations.

// In a block's workflow file (`/workflow/src/main.tpl.tengo`)

wf := import("@platforma-sdk/workflow-tengo:workflow")

wf.prepare(func(args) {
bundleBuilder := wf.createPBundleBuilder()
bundleBuilder.addAnchor("main", args.inputAnchor)

// Add the raw count matrix to the bundle.
// The platform will find the correct p-column
// by matching the annotations against the provided anchor.
bundleBuilder.addSingle({
axes: [
{ anchor: "main", idx: 0 },
{ anchor: "main", idx: 1 },
{ anchor: "main", idx: 2 },
],
name: "pl7.app/rna-seq/countMatrix",
domain: {
"pl7.app/rna-seq/normalized": "false",
}
},
"rawCounts")

return {
bundle: bundleBuilder.build()
}
})

wf.body(func(args) {
// ...
// The `rawCounts` p-column is now available
// in the `args.bundle` bundle for processing.
// ...
})

Summary of standard p-columns

P-Column NameDescriptionAxesRequirement
Count Matrices
pl7.app/rna-seq/countMatrixRaw UMI counts per gene per cell.[sampleId][cellId][geneId]Required
pl7.app/rna-seq/countMatrixNormalized expression values.[sampleId][cellId][geneId]Required
Gene Properties
pl7.app/labelGene symbol.[geneId]Required
Cell Metrics
pl7.app/rna-seq/totalCountsTotal UMIs per cell.[sampleId][cellId]Required
pl7.app/rna-seq/nGenesByCountsNumber of detected genes per cell.[sampleId][cellId]Required
pl7.app/rna-seq/pctCountsMtPercentage of mitochondrial reads.[sampleId][cellId]Required
pl7.app/rna-seq/complexityLibrary complexity per cell.[sampleId][cellId]Required