Clonotype Clustering Block Guide

This document outlines the standard inputs and outputs for a downstream block that performs clonotype clustering. By adhering to this standard, a clustering block can seamlessly process VDJ datasets from any compliant clonotyping block and produce results that are easy to understand and use in further analyses.

Overview

The diagram below illustrates where a clustering block fits in a typical VDJ analysis pipeline. It consumes a VDJ dataset and produces a new, augmented dataset with cluster information.

 Blocks                                 Result pool
                                       ┌───────────
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════╗   exports   │
 ║   Clonotyping Block   ║───────>─────┤ Abundance, Sequences, Gene Hits
 ╚═══════════════════════╝             │ -------------------------------
                                       │
                                       ├ [sampleId][clonotypeKey] -> abundance
                                       ├ [clonotypeKey] -> sequence
                                       ├ [clonotypeKey] -> V, J genes
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════╗   exports   │
 ║   Clustering Block    ║───────>─────┤ Cluster Abundance & Properties
 ╚═══════════════════════╝             │ --------------------------------
                                       │
                                       ├ [clusterId] -> cluster props
                                       ├ [sampleId][clusterId] -> abundance
                                       ├ [clonotypeKey][clusterId] -> 1 (linker)
             ┌─────────────────────────┤
             │                         │
             v                         │
 ╔═══════════════════════╗   exports   │
 ║  Downstream Blocks    ║───────>─────┤ (Downstream Results)
 ╚═══════════════════════╝             │ --------------------
                                       │

Inputs

A standard clustering block operates on a VDJ dataset (either bulk or single-cell). The developer implementing a clustering block should ensure it can consume the following p-columns from an upstream clonotyping block:

VDJ Dataset Anchor: A reference to the anchor p-column (pl7.app/isAnchor: "true") of the input dataset. This is the primary input that defines the scope of the analysis.
Sequence P-Columns: One or more sequence p-columns (pl7.app/vdj/sequence) to be used for calculating similarity. The specific sequences are typically selected by the user in the UI.
Primary Abundance: The primary, non-normalized abundance p-column (e.g., readCount or uniqueMoleculeCount). This is fetched via an anchored query and used to calculate aggregated abundance for the resulting clusters.
V/J Gene Hits (Optional): The block may optionally use V and J gene hit information to constrain clustering (e.g., only cluster sequences that use the same V gene).

Exports

A clustering block ingests a VDJ dataset and produces a new, augmented p-frame. The core of this output is a new axis and a set of p-columns that describe the clusters.

NOTE: For brevity, [clonotypeKey] is used throughout this section to refer to either [clonotypeKey] for bulk data or [scClonotypeKey] for single-cell data, as the specific key depends on the input dataset.

The `clusterId` axis

The block must introduce a new primary axis to identify the clusters.

Name: pl7.app/vdj/clusterId
Domain: The domain of this axis should be inherited from the input clonotypeKey or scClonotypeKey and extended with information about the clustering run, such as the algorithm used and a unique ID for the block execution.

name: pl7.app/vdj/clusterId
type: String
domain:
  # Inherited from input clonotypeKey
  pl7.app/vdj/chain: "IGHeavy" 
  pl7.app/vdj/clonotypingRunId": "a3e90973-e470-473f-b3fa-96c53965dd78"
  # --- Clustering-specific domain keys ---
  pl7.app/vdj/clustering/algorithm: "mmseqs2"
  pl7.app/vdj/clustering/blockId: "39211d98-1573-4eb9-9bce-3d6331282cc4"
annotations:
  pl7.app/label: "Cluster ID"

Cluster membership and properties

These p-columns describe the clusters themselves and link the original clonotypes to them.

1. Centroid sequence (`pl7.app/vdj/sequence`)

Description: An optional column that stores the representative sequence for each cluster.
Requirement: Optional.
Specification:

name: pl7.app/vdj/sequence
valueType: String
axesSpec:
  - name: pl7.app/vdj/clusterId
    type: String
# The domain is inherited from the input sequence used for clustering
domain:
  pl7.app/vdj/feature: "CDR3"
  pl7.app/alphabet: "aminoacid"
annotations:
  pl7.app/label: "Centroid CDR3 aa"
  pl7.app/table/fontFamily: "monospace"

2. Distance to centroid (`pl7.app/vdj/distanceToCentroid`)

Description: An optional column that stores the calculated distance (e.g., Levenshtein distance) of a clonotype's sequence from its assigned cluster's centroid. This column is crucial for ranking.
Requirement: Optional.
Specification:

name: pl7.app/vdj/distanceToCentroid
valueType: Float
axesSpec:
  - name: pl7.app/vdj/clonotypeKey # or scClonotypeKey
    type: String
annotations:
  pl7.app/label: "Distance to centroid"
  pl7.app/table/visibility: "optional"
  # --- Score & Ranking Annotations ---
  pl7.app/isScore: "true"
  # Lower distance is better, so the ranking order is increasing.
  pl7.app/score/rankingOrder: "increasing"
  pl7.app/min: "0"
  pl7.app/max: "1"

3. Cluster size (`pl7.app/vdj/clustering/clusterSize`)

Description: Stores the total number of unique clonotypes within each cluster.
Requirement: Required.
Specification:

name: pl7.app/vdj/clustering/clusterSize
valueType: Int
axesSpec:
  - name: pl7.app/vdj/clusterId
    type: String
annotations:
  pl7.app/label: "Cluster Size"

4. Cluster label (`pl7.app/label`)

Description: Provides a short, human-readable label for the clusterId.
Requirement: Required.
Format: The label should be a string prefixed with "CL-", followed by 6-7 alphanumeric characters in upper case (e.g., "CL-7VCA13").
Specification:

name: pl7.app/label
valueType: String
axesSpec:
  - name: pl7.app/vdj/clusterId
    type: String
annotations:
  pl7.app/label: "Cluster Label"

5. Cluster ID (`pl7.app/vdj/clusterId`)

Description: This p-column maps each original clonotype to its new cluster ID. Values in this columns are cluster labels.
Requirement: Required.
Specification:

name: pl7.app/vdj/clusterId
valueType: String
axesSpec:
  - name: pl7.app/vdj/clonotypeKey # or scClonotypeKey
    type: String
# The domain for the axis is the full clusterId spec from above
domain:
  pl7.app/vdj/chain: "IGHeavy"
  pl7.app/vdj/clonotypingRunId": "a3e90973-e470-473f-b3fa-96c53965dd78"
  pl7.app/vdj/clustering/algorithm: "mmseqs2"
  pl7.app/vdj/clustering/blockId: "39211d98-1573-4eb9-9bce-3d6331282cc4"
annotations:
  pl7.app/label: "Cluster ID"

Cluster abundance

Requirement: Required. At least one aggregated abundance column must be produced, corresponding to the primary abundance from the input dataset.

Specification (Read count example): Note how the domain is inherited from the input's clonotypeKey axis, and the annotations are a combination of inherited and new values.

# --- Core Identity ---
name: pl7.app/vdj/readCount
valueType: Long

# --- Axes ---
axesSpec:
  - name: pl7.app/sampleId
    type: String
    # Domain is inherited from the input sampleId axis
  - name: pl7.app/vdj/clusterId
    type: String
    # Domain is constructed by the clustering block

# --- Domain ---
# The domain is inherited from the input abundance column's clonotypeKey axis,
# ensuring that the biological context (e.g., chain) is preserved.
domain:
  pl7.app/vdj/chain: "IGHeavy" 
  pl7.app/vdj/clonotypingRunId": "a3e90973-e470-473f-b3fa-96c53965dd78"

# --- Annotations ---
# Annotations are a mix of inherited and new values.
annotations:
  # The label is updated to reflect the new context.
  pl7.app/label: "Number of Reads in cluster"
  # Other abundance annotations are preserved.
  pl7.app/isAbundance: "true"
  pl7.app/abundance/unit: "reads"
  pl7.app/abundance/normalized: "false"
  # Note: isPrimary and isAnchor are typically NOT inherited for aggregated columns.

The linker column

To establish a formal link between the original clonotypes and the new clusters, the block must generate a linker p-column. This column represents the many-to-one relationship between clonotypes and clusters.

Name: pl7.app/vdj/clusterLink
Value: The integer 1.
Axes: [clonotypeKey][clusterId]
Annotation: Must be marked with {"pl7.app/isLinkerColumn": "true"}.

This linker makes it trivial for downstream tools to join the original VDJ dataset with the new clustering information.

Summary of standard exports

The following table provides a summary of all standard p-columns that a developer can expect to be produced by a compliant clustering block.

P-Column Name	Description	Axes	Requirement
Cluster Definition & Membership
`pl7.app/vdj/clusterId`	Links an original clonotype to its assigned cluster.	`[clonotypeKey]`	Required
`pl7.app/vdj/clusterLink`	The join key between clonotypes and clusters.	`[clonotypeKey][clusterId]`	Required
Cluster Properties
`pl7.app/vdj/clustering/clusterSize`	Total number of clonotypes in the cluster.	`[clusterId]`	Required
`pl7.app/label`	Human-readable label for the cluster.	`[clusterId]`	Required
`pl7.app/vdj/sequence`	The representative/centroid sequence of the cluster.	`[clusterId]`	Optional
`pl7.app/vdj/distanceToCentroid`	Distance of a clonotype from its cluster centroid.	`[clonotypeKey]`	Optional
Aggregated Abundance
`pl7.app/vdj/readCount`	Total reads for all clonotypes in the cluster.	`[sampleId][clusterId]`	Required⁵
`pl7.app/vdj/uniqueMoleculeCount`	Total UMIs for all clonotypes in the cluster.	`[sampleId][clusterId]`	Required⁵
`pl7.app/vdj/uniqueCellCount`	Total cells for all clonotypes in the cluster.	`[sampleId][clusterId]`	Required⁵

⁵ Required if the corresponding abundance column exists in the input VDJ dataset.

Overview​

Inputs​

Exports​

The clusterId axis​

Cluster membership and properties​

1. Centroid sequence (pl7.app/vdj/sequence)​

2. Distance to centroid (pl7.app/vdj/distanceToCentroid)​

3. Cluster size (pl7.app/vdj/clustering/clusterSize)​

4. Cluster label (pl7.app/label)​

5. Cluster ID (pl7.app/vdj/clusterId)​

Cluster abundance​

The linker column​

Summary of standard exports​