P-frames and p-columns
A "p-column" is a foundational data structure in the Platforma ecosystem, designed for robust, scalable, and structured representation of biological and experimental data.
Definition and concept
A p-column is a typed, structured column of data that serves as a mapping from a tuple of keys (called axes) to a value. Each p-column is defined by its specification, a metadata document that describes the types and meaning of its axes, the type of the value it stores, and annotations that provide additional context.
P-columns are the building blocks for p-frames. A p-frame brings together a collection of p-columns whose axes are compatible, allowing them to be joined and queried together as a cohesive, multidimensional dataset. Unlike a simple table where all columns share one index, different p-columns within a p-frame can have different, though related, sets of axes. This makes them highly flexible for modeling the complex, hierarchical data commonly found in biological experiments.
Data and axis types
A p-column's specification defines the data types for its axes and values.
- Axis types can be:
Int
,Long
, orString
. - Value types can be:
Int
,Long
,String
,Float
,Double
, orFile
(to reference file resources).
P-columns in Platforma
Role in data organization
In Platforma, p-columns are the atomic units of data storage and manipulation. All biological data, from raw sequencing files to processed results, are represented as p-columns within p-frames. This approach ensures that data is always highly structured, facilitating efficient processing, querying, and interoperability between workflow blocks.
Relationship with p-frames
A p-frame is a container for a set of related p-columns. When a p-column is created and stored, it is assigned a unique ID. A p-frame simply groups these p-columns, allowing blocks to operate on them collectively. The model and UI can then resolve a p-frame to access all the p-columns it contains for display or further processing.
Practical examples
Example 1: Clonotype read abundance
A common p-column in VDJ analysis stores the read count for each clonotype in a given sample. Its specification might look like this:
- Axes: Two
String
axes, one namedpl7.app/sampleId
and the otherpl7.app/vdj/clonotypeKey
. - Value:
Long
integer. - Name:
pl7.app/vdj/readCount
. - Annotations: It would be annotated with
pl7.app/isAbundance: true
andpl7.app/abundance/unit: reads
to give it clear semantic meaning for other blocks.
Example 2: CDR3 amino acid sequence
Another p-column might store the amino acid sequence of the CDR3 region, which is a key part of a clonotype's identity.
- Axes: A
String
axis namedpl7.app/vdj/clonotypeKey
, which uniquely identifies the clonotype. - Value:
String
. - Name:
pl7.app/vdj/sequence
. - Domain: It would include a domain like
{"pl7.app/vdj/feature": "CDR3", "pl7.app/alphabet": "aminoacid"}
to specify precisely what kind of sequence it is.
Joins
In the user interface of a block, p-frames are commonly visualized using a tabular representation (like a data grid) or passed to a component like GraphMaker, which can render the data on a plot.
The power of p-frames is that they can join together multiple p-columns that have compatible, but not identical, axes into a single, coherent view. Let's see how our two example p-columns—clonotype abundance and CDR3 sequence—would look when visualized as a table.
Sample ID | Clonotype ID | CDR3 Sequence | Read Count |
---|---|---|---|
Sample 1 | clonotype_A | CASSLAPGATNEQFF | 100 |
Sample 1 | clonotype_B | CASSLDRVGGYTF | 50 |
Sample 2 | clonotype_A | CASSLAPGATNEQFF | 200 |
Here's what's happening:
- The Sample ID and Clonotype ID columns are derived from the axes of the abundance p-column.
- The Read Count column is derived from the values of the abundance p-column.
- The CDR3 Sequence column is derived from the values of the sequence p-column.
- The system automatically joins the data using the common
clonotypeKey
axis to correctly associate the sequence with the relevant abundance records.
Technical implementation and storage
Specification storage
A p-column's specification, which contains all of its metadata (name, types, axes, domain, annotations), is always stored as a simple JSON resource.
Data storage
The actual data of a p-column—the mapping from keys to values—is stored in one of several formats, chosen based on the data's size and intended use:
-
Binary Format: Used for most biological data generated by analysis workflows. This format is highly efficient for storing very large datasets (billions of records) of primitive types (
String
,Int
,Double
, etc.) and can be partitioned. Partitioning allows for both efficient parallel processing and quick access to specific data slices without needing to read the entire dataset. -
JSON Format: Used for smaller datasets, particularly metadata that needs to be accessed quickly from the block's model or UI without invoking the backend p-frames engine. A good example is sample metadata provided by the "Samples & Data" block.
-
Resource Maps: A special format used for p-columns whose values are references to other resources, most commonly files. For instance, a p-column mapping sample IDs to their corresponding FASTQ files would use a resource map.
All p-column resources are reference-counted. When no blocks reference a p-column, it is flagged for garbage collection to reclaim storage.
Related docs
📄️ P-column Spec
A p-column's structure is formally defined by its specification (PColumnSpec). This specification is the metadata that details the column's identity, the type of data it holds, its dimensional axes, and various other attributes for context and behavior.
📄️ Usage in Model
The Block Model, written in TypeScript, acts as the bridge between the workflow and the UI. It serves two primary functions. First, it defines the block's arguments (args) and generates choice options for the UI components (like dropdowns) that set these arguments. Second, it processes results from the workflow and the project's result pool to generate outputs for visualization. This creates a reactive loop: user selections in the UI update the args, which in turn can trigger recalculations of the model's outputs.
📄️ Label p-columns
In the Platforma ecosystem, we distinguish between stable, machine-readable identifiers and mutable, human-readable labels. For example, a sample has a unique, permanent sampleId, but a user might want to refer to it with a descriptive name like "Day 7 Post-Infusion". If this descriptive name were used as the primary key, changing it would trigger massive recalculations.
📄️ Usage in Workflow
The workflow, written in a Tengo-based scripting language, is where the core data processing logic of a block is described. It defines input data requirements, executes bioinformatics tools, processes the results, and exports new p-columns and p-frames for use by the UI or other blocks.
📄️ XSV Conversion Spec
The Platforma SDK provides powerful tools for converting between p-frames and flat file formats like CSV or TSV (XSV). This is handled by the xsv and pframes libraries, which wrap a powerful pframes-conv command-line tool. This document details the specifications for xsv.importFile and the XSV file builders, complete with real-world examples.
📄️ ProcessColumn API
The pframes.processColumn function is a powerful and advanced utility in the Workflow SDK for performing mapping and aggregation operations on a p-column. It allows you to iterate over a p-column's data, apply custom logic for each entry (or group of entries) via a separate Tengo template, and generate one or more new p-columns or other artifacts as output.
📄️ Standard Annotations
This document lists standard annotations used in block development for p-column specifications. These annotations provide hints for the UI, define behavior in workflows, and add semantic meaning to the data. They are applied to the annotations field of either a PColumnSpec or an AxisSpec.