Skip to main content

P-frames and p-columns

A "p-column" is a foundational data structure in the Platforma ecosystem, designed for robust, scalable, and structured representation of biological and experimental data.

Definition and concept

A p-column is a typed, structured column of data that serves as a mapping from a tuple of keys (called axes) to a value. Each p-column is defined by its specification, a metadata document that describes the types and meaning of its axes, the type of the value it stores, and annotations that provide additional context.

P-columns are the building blocks for p-frames. A p-frame brings together a collection of p-columns whose axes are compatible, allowing them to be joined and queried together as a cohesive, multidimensional dataset. Unlike a simple table where all columns share one index, different p-columns within a p-frame can have different, though related, sets of axes. This makes them highly flexible for modeling the complex, hierarchical data commonly found in biological experiments.

Data and axis types

A p-column's specification defines the data types for its axes and values.

  • Axis types can be: Int, Long, or String.
  • Value types can be: Int, Long, String, Float, Double, or File (to reference file resources).

P-columns in Platforma

Role in data organization

In Platforma, p-columns are the atomic units of data storage and manipulation. All biological data, from raw sequencing files to processed results, are represented as p-columns within p-frames. This approach ensures that data is always highly structured, facilitating efficient processing, querying, and interoperability between workflow blocks.

Relationship with p-frames

A p-frame is a container for a set of related p-columns. When a p-column is created and stored, it is assigned a unique ID. A p-frame simply groups these p-columns, allowing blocks to operate on them collectively. The model and UI can then resolve a p-frame to access all the p-columns it contains for display or further processing.

Practical examples

Example 1: Clonotype read abundance

A common p-column in VDJ analysis stores the read count for each clonotype in a given sample. Its specification might look like this:

  • Axes: Two String axes, one named pl7.app/sampleId and the other pl7.app/vdj/clonotypeKey.
  • Value: Long integer.
  • Name: pl7.app/vdj/readCount.
  • Annotations: It would be annotated with pl7.app/isAbundance: true and pl7.app/abundance/unit: reads to give it clear semantic meaning for other blocks.

Example 2: CDR3 amino acid sequence

Another p-column might store the amino acid sequence of the CDR3 region, which is a key part of a clonotype's identity.

  • Axes: A String axis named pl7.app/vdj/clonotypeKey, which uniquely identifies the clonotype.
  • Value: String.
  • Name: pl7.app/vdj/sequence.
  • Domain: It would include a domain like {"pl7.app/vdj/feature": "CDR3", "pl7.app/alphabet": "aminoacid"} to specify precisely what kind of sequence it is.

Joins

In the user interface of a block, p-frames are commonly visualized using a tabular representation (like a data grid) or passed to a component like GraphMaker, which can render the data on a plot.

The power of p-frames is that they can join together multiple p-columns that have compatible, but not identical, axes into a single, coherent view. Let's see how our two example p-columns—clonotype abundance and CDR3 sequence—would look when visualized as a table.

Sample IDClonotype IDCDR3 SequenceRead Count
Sample 1clonotype_ACASSLAPGATNEQFF100
Sample 1clonotype_BCASSLDRVGGYTF50
Sample 2clonotype_ACASSLAPGATNEQFF200

Here's what's happening:

  • The Sample ID and Clonotype ID columns are derived from the axes of the abundance p-column.
  • The Read Count column is derived from the values of the abundance p-column.
  • The CDR3 Sequence column is derived from the values of the sequence p-column.
  • The system automatically joins the data using the common clonotypeKey axis to correctly associate the sequence with the relevant abundance records.

Technical implementation and storage

Specification storage

A p-column's specification, which contains all of its metadata (name, types, axes, domain, annotations), is always stored as a simple JSON resource.

Data storage

The actual data of a p-column—the mapping from keys to values—is stored in one of several formats, chosen based on the data's size and intended use:

  • Binary Format: Used for most biological data generated by analysis workflows. This format is highly efficient for storing very large datasets (billions of records) of primitive types (String, Int, Double, etc.) and can be partitioned. Partitioning allows for both efficient parallel processing and quick access to specific data slices without needing to read the entire dataset.

  • JSON Format: Used for smaller datasets, particularly metadata that needs to be accessed quickly from the block's model or UI without invoking the backend p-frames engine. A good example is sample metadata provided by the "Samples & Data" block.

  • Resource Maps: A special format used for p-columns whose values are references to other resources, most commonly files. For instance, a p-column mapping sample IDs to their corresponding FASTQ files would use a resource map.

All p-column resources are reference-counted. When no blocks reference a p-column, it is flagged for garbage collection to reclaim storage.

📄️ Usage in Model

The Block Model, written in TypeScript, acts as the bridge between the workflow and the UI. It serves two primary functions. First, it defines the block's arguments (args) and generates choice options for the UI components (like dropdowns) that set these arguments. Second, it processes results from the workflow and the project's result pool to generate outputs for visualization. This creates a reactive loop: user selections in the UI update the args, which in turn can trigger recalculations of the model's outputs.