scRNA-seq Preprocessing

This document outlines the standard for single-cell RNA sequencing (scRNA-seq) datasets generated by a block that processes raw sequencing data. Upstream blocks that perform scRNA-seq preprocessing should produce a p-frame containing the p-columns defined here. This ensures that downstream tools for analysis, visualization, and comparison can operate on a consistent and predictable data structure.

Overview

The diagram below illustrates a typical user flow involving a scRNA-seq preprocessing block.

 Blocks                                 Result pool
                                       ┌───────────
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════╗   exports   │
 ║    Samples & Data     ║───────>─────┤ Sequencing Dataset
 ╚═══════════════════════╝             │ ------------------
                                       │
                                       ├ [sampleId](readIndex)(lane) -> file
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════════╗ exports │
 ║  scRNA-seq Preprocessing  ║────->─-─┤ Count Matrices, Gene & Cell Properties
 ╚═══════════════════════════╝         │ --------------------------------------
                                       │
                                       ├ [sampleId][cellId][geneId] -> raw & normalized counts
                                       ├ [geneId] -> gene properties
                                       ├ [cellId] -> cell metrics
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════╗             │
 ║  Downstream Analysis  ║             │
 ╚═══════════════════════╝             │

Samples & Data: The flow starts with a universal entry point for importing and organizing raw sequencing data (FASTQ files).
scRNA-seq Preprocessing Block: This block takes the sequencing dataset as input. It runs a preprocessing pipeline and generates a standardized scRNA-seq dataset as its primary output. This dataset consists of several main p-columns:
- Count Matrix p-columns: Two matrices keyed by [sampleId][cellId][geneId], one for raw UMI counts and one for normalized counts.
- Gene Property p-columns: Keyed by [geneId], these store descriptive attributes of the genes, such as gene symbols.
- Cell Metrics p-columns: Keyed by [sampleId][cellId], these store per-cell quality control metrics.
Downstream Blocks: Subsequent blocks consume the scRNA-seq dataset, using the anchor raw count matrix to identify the dataset.

Core structure: axes and p-columns

A standard scRNA-seq dataset is a p-frame composed of p-columns that describe gene expression per cell. The structure of these p-columns hinges on three primary axes.

Primary axes

Axis Name	Type	Description
`pl7.app/sampleId`	`String`	Uniquely identifies the sample from which the data was derived.
`pl7.app/sc/cellId`	`String`	Uniquely identifies a single cell, typically by its barcode sequence.
`pl7.app/rna-seq/geneId`	`String`	Uniquely identifies a gene, typically using a standard ID like Ensembl ID.

The anchor column: Raw Count Matrix

To facilitate discovery by downstream blocks, the raw count matrix p-column must be designated as the "anchor" for the dataset.

P-column name: pl7.app/rna-seq/countMatrix
Domain: Must contain {"pl7.app/rna-seq/normalized": "false"}
Annotation: Must contain {"pl7.app/isAnchor": "true"}
Axes: [pl7.app/sampleId][pl7.app/sc/cellId][pl7.app/rna-seq/geneId]

Count Matrix p-columns

The block produces two count matrices, distinguished by their domain.

1. Raw Count Matrix (Anchor)

Description: The raw number of UMIs for each gene in each cell.
Requirement: Required.
Specification:

name: pl7.app/rna-seq/countMatrix
valueType: Long
axesSpec:
  - name: pl7.app/sampleId
  - name: pl7.app/sc/cellId
  - name: pl7.app/rna-seq/geneId
domain:
  pl7.app/species: "hsa"
  pl7.app/blockId: "<block-run-id>"
  pl7.app/rna-seq/normalized": "false"
annotations:
  pl7.app/isAnchor: "true"
  pl7.app/label: "Raw gene expression"

2. Normalized Count Matrix

Description: Normalized gene expression values (e.g., CP10k).
Requirement: Required.
Specification:

name: pl7.app/rna-seq/countMatrix
valueType: Double
axesSpec:
  - name: pl7.app/sampleId
  - name: pl7.app/sc/cellId
  - name: pl7.app/rna-seq/geneId
domain:
  pl7.app/species: "hsa"
  pl7.app/blockId: "<block-run-id>"
  pl7.app/rna-seq/normalized": "true"
annotations:
  pl7.app/label: "Normalized gene expression"

Gene Property p-columns

A preprocessing block must provide a human-readable label for the geneId axis. This is accomplished by generating a pl7.app/label p-column. This column is keyed by pl7.app/rna-seq/geneId and its values are the corresponding gene symbols.

P-column name: pl7.app/label
Description: The gene symbol corresponding to the geneId.
Requirement: Required.
Specification:

name: pl7.app/label
valueType: String
axesSpec:
  - name: pl7.app/rna-seq/geneId
    type: String
domain:
  # The species for the gene annotations
  pl7.app/species: "hsa"
annotations:
  pl7.app/label: "Gene Symbol"

Cell Metrics p-columns

These p-columns are keyed by [pl7.app/sampleId][pl7.app/sc/cellId] and provide important quality control information for each cell.

pl7.app/rna-seq/totalCounts: The total number of UMIs detected in the cell.
pl7.app/rna-seq/nGenesByCounts: The number of unique genes detected in the cell.
pl7.app/rna-seq/pctCountsMt: The percentage of reads mapping to mitochondrial genes.
pl7.app/rna-seq/complexity: The library complexity, calculated as log10(nGenesByCounts) / log10(totalCounts).

Querying for scRNA-seq data: examples

The following examples show how downstream blocks can reliably find and use scRNA-seq datasets without knowing the specifics of the upstream block that generated it.

Model: Populating a UI with available datasets

This TypeScript example shows how a block's model can find all available scRNA-seq datasets by looking for the anchor column. This is useful for populating a UI dropdown that allows a user to select an input for their analysis.

// In a block's model file (`/model/src/index.ts`)

import { BlockModel } from "@platforma-sdk/model";

export const model = BlockModel.create()
  // ...
  .output("datasetOptions", (ctx) =>
    ctx.resultPool.getOptions(
      // This matcher finds anchor p-columns for scRNA-seq datasets by looking
      // for the specific axes and the `isAnchor` annotation.
      {
        axes: [
          { name: "pl7.app/sampleId" },
          { name: "pl7.app/sc/cellId" },
          { name: "pl7.app/rna-seq/geneId" },
        ],
        annotations: { "pl7.app/isAnchor": "true" },
      },
      {
        // This setting improves the UI by showing only the dataset label,
        // not the specific name of the anchor p-column (e.g. "Raw gene expression").
        label: { includeNativeLabel: false },
      }
    )
  )
  //...
  .done();

Workflow: Using the count matrix for calculations

This Tengo example shows how a workflow, having been given an inputAnchor dataset by the user, can reliably fetch the associated raw count matrix for use in statistical calculations.

// In a block's workflow file (`/workflow/src/main.tpl.tengo`)

wf := import("@platforma-sdk/workflow-tengo:workflow")

wf.prepare(func(args) {
	bundleBuilder := wf.createPBundleBuilder()
	bundleBuilder.addAnchor("main", args.inputAnchor)

	// Add the raw count matrix to the bundle.
	// The platform will find the correct p-column
	// by matching the annotations against the provided anchor.
	bundleBuilder.addSingle({
		axes: [ 
            { anchor: "main", idx: 0 }, 
            { anchor: "main", idx: 1 },
            { anchor: "main", idx: 2 },
        ],
		name: "pl7.app/rna-seq/countMatrix",
		domain: {
				"pl7.app/rna-seq/normalized": "false",
			}
		},
		"rawCounts")

	return {
		bundle: bundleBuilder.build()
	}
})

wf.body(func(args) {
	// ...
	// The `rawCounts` p-column is now available
	// in the `args.bundle` bundle for processing.
	// ...
})

Summary of standard p-columns

P-Column Name	Description	Axes	Requirement
Count Matrices
`pl7.app/rna-seq/countMatrix`	Raw UMI counts per gene per cell.	`[sampleId][cellId][geneId]`	Required
`pl7.app/rna-seq/countMatrix`	Normalized expression values.	`[sampleId][cellId][geneId]`	Required
Gene Properties
`pl7.app/label`	Gene symbol.	`[geneId]`	Required
Cell Metrics
`pl7.app/rna-seq/totalCounts`	Total UMIs per cell.	`[sampleId][cellId]`	Required
`pl7.app/rna-seq/nGenesByCounts`	Number of detected genes per cell.	`[sampleId][cellId]`	Required
`pl7.app/rna-seq/pctCountsMt`	Percentage of mitochondrial reads.	`[sampleId][cellId]`	Required
`pl7.app/rna-seq/complexity`	Library complexity per cell.	`[sampleId][cellId]`	Required

Overview​

Core structure: axes and p-columns​

Primary axes​

The anchor column: Raw Count Matrix​

Count Matrix p-columns​

1. Raw Count Matrix (Anchor)​

2. Normalized Count Matrix​

Gene Property p-columns​

Cell Metrics p-columns​

Querying for scRNA-seq data: examples​

Model: Populating a UI with available datasets​

Workflow: Using the count matrix for calculations​

Summary of standard p-columns​