P-frames and p-columns

A "p-column" is a foundational data structure in the Platforma ecosystem, designed for robust, scalable, and structured representation of biological and experimental data.

Definition and concept

A p-column is a typed, structured column of data that serves as a mapping from a tuple of keys (called axes) to a value. Each p-column is defined by its specification, a metadata document that describes the types and meaning of its axes, the type of the value it stores, and annotations that provide additional context.

P-columns are the building blocks for p-frames. A p-frame brings together a collection of p-columns whose axes are compatible, allowing them to be joined and queried together as a cohesive, multidimensional dataset. Unlike a simple table where all columns share one index, different p-columns within a p-frame can have different, though related, sets of axes. This makes them highly flexible for modeling the complex, hierarchical data commonly found in biological experiments.

Data and axis types

A p-column's specification defines the data types for its axes and values.

Axis types can be: Int, Long, or String.
Value types can be: Int, Long, String, Float, Double, or File (to reference file resources).

P-columns in Platforma

Role in data organization

In Platforma, p-columns are the atomic units of data storage and manipulation. All biological data, from raw sequencing files to processed results, are represented as p-columns within p-frames. This approach ensures that data is always highly structured, facilitating efficient processing, querying, and interoperability between workflow blocks.

Relationship with p-frames

A p-frame is a container for a set of related p-columns. When a p-column is created and stored, it is assigned a unique ID. A p-frame simply groups these p-columns, allowing blocks to operate on them collectively. The model and UI can then resolve a p-frame to access all the p-columns it contains for display or further processing.

Practical examples

Example 1: Clonotype read abundance

A common p-column in VDJ analysis stores the read count for each clonotype in a given sample. Its specification might look like this:

Axes: Two String axes, one named pl7.app/sampleId and the other pl7.app/vdj/clonotypeKey.
Value: Long integer.
Name: pl7.app/vdj/readCount.
Annotations: It would be annotated with pl7.app/isAbundance: true and pl7.app/abundance/unit: reads to give it clear semantic meaning for other blocks.

Example 2: CDR3 amino acid sequence

Another p-column might store the amino acid sequence of the CDR3 region, which is a key part of a clonotype's identity.

Axes: A String axis named pl7.app/vdj/clonotypeKey, which uniquely identifies the clonotype.
Value: String.
Name: pl7.app/vdj/sequence.
Domain: It would include a domain like {"pl7.app/vdj/feature": "CDR3", "pl7.app/alphabet": "aminoacid"} to specify precisely what kind of sequence it is.

Joins

In the user interface of a block, p-frames are commonly visualized using a tabular representation (like a data grid) or passed to a component like GraphMaker, which can render the data on a plot.

The power of p-frames is that they can join together multiple p-columns that have compatible, but not identical, axes into a single, coherent view. Let's see how our two example p-columns—clonotype abundance and CDR3 sequence—would look when visualized as a table.

Sample ID	Clonotype ID	CDR3 Sequence	Read Count
Sample 1	clonotype_A	CASSLAPGATNEQFF	100
Sample 1	clonotype_B	CASSLDRVGGYTF	50
Sample 2	clonotype_A	CASSLAPGATNEQFF	200

Here's what's happening:

The Sample ID and Clonotype ID columns are derived from the axes of the abundance p-column.
The Read Count column is derived from the values of the abundance p-column.
The CDR3 Sequence column is derived from the values of the sequence p-column.
The system automatically joins the data using the common clonotypeKey axis to correctly associate the sequence with the relevant abundance records.

Technical implementation and storage

Specification storage

A p-column's specification, which contains all of its metadata (name, types, axes, domain, annotations), is always stored as a simple JSON resource.

Data storage

The actual data of a p-column—the mapping from keys to values—is stored in one of several formats, chosen based on the data's size and intended use:

Binary Format: Used for most biological data generated by analysis workflows. This format is highly efficient for storing very large datasets (billions of records) of primitive types (String, Int, Double, etc.) and can be partitioned. Partitioning allows for both efficient parallel processing and quick access to specific data slices without needing to read the entire dataset.
JSON Format: Used for smaller datasets, particularly metadata that needs to be accessed quickly from the block's model or UI without invoking the backend p-frames engine. A good example is sample metadata provided by the "Samples & Data" block.
Resource Maps: A special format used for p-columns whose values are references to other resources, most commonly files. For instance, a p-column mapping sample IDs to their corresponding FASTQ files would use a resource map.

All p-column resources are reference-counted. When no blocks reference a p-column, it is flagged for garbage collection to reclaim storage.

📄️ P-column Spec

A p-column's structure is formally defined by its specification (PColumnSpec). This specification is the metadata that details the column's identity, the type of data it holds, its dimensional axes, and various other attributes for context and behavior.

📄️ Usage in Model

The Block Model, written in TypeScript, acts as the bridge between the workflow and the UI. It serves two primary functions. First, it defines the block's arguments (args) and generates choice options for the UI components (like dropdowns) that set these arguments. Second, it processes results from the workflow and the project's result pool to generate outputs for visualization. This creates a reactive loop: user selections in the UI update the args, which in turn can trigger recalculations of the model's outputs.

📄️ Label p-columns

In the Platforma ecosystem, we distinguish between stable, machine-readable identifiers and mutable, human-readable labels. For example, a sample has a unique, permanent sampleId, but a user might want to refer to it with a descriptive name like "Day 7 Post-Infusion". If this descriptive name were used as the primary key, changing it would trigger massive recalculations.

📄️ Usage in Workflow

The workflow, written in a Tengo-based scripting language, is where the core data processing logic of a block is described. It defines input data requirements, executes bioinformatics tools, processes the results, and exports new p-columns and p-frames for use by the UI or other blocks.

📄️ XSV Conversion Spec

The Platforma SDK provides powerful tools for converting between p-frames and flat file formats like CSV or TSV (XSV). This is handled by the xsv and pframes libraries, which wrap a powerful pframes-conv command-line tool. This document details the specifications for xsv.importFile and the XSV file builders, complete with real-world examples.

📄️ ProcessColumn API

The pframes.processColumn function is a powerful and advanced utility in the Workflow SDK for performing mapping and aggregation operations on a p-column. It allows you to iterate over a p-column's data, apply custom logic for each entry (or group of entries) via a separate Tengo template, and generate one or more new p-columns or other artifacts as output.

📄️ Standard Annotations

This document lists standard annotations used in block development for p-column specifications. These annotations provide hints for the UI, define behavior in workflows, and add semantic meaning to the data. They are applied to the annotations field of either a PColumnSpec or an AxisSpec.

Definition and concept​

Data and axis types​

P-columns in Platforma​

Role in data organization​

Relationship with p-frames​

Practical examples​

Example 1: Clonotype read abundance​

Example 2: CDR3 amino acid sequence​

Joins​

Technical implementation and storage​

Specification storage​

Data storage​

Related docs​