Post

How to design a phage data system

Design decisions and considerations behind building a system for collecting data from every part of the phage therapy pipeline

October 07, 2022

Jan Zheng

How to design a phage data system cover — Image credit: Jan Zheng, DALL-E 2

For phage therapy to be efficient, cost-effective, and reliable at scale, every stage of the phage therapy pipeline — phage hunting, matching, sequencing, manufacturing, administration, and therapeutic monitoring — needs to be measured.

Data collected from each stage tells us that a phage remains safe and viable, and helps us easily track down issues. By looking at patterns in our data, we can replace methods with safer and more efficient alternatives.

To scale up phage treatment, we need to scale up data collection. Consequently, this means that both generating and adding data and drawing useful insights and patterns from the data needs to be super easy.

At Phage Australia, we’re building a data system that tracks and connects our phage and bacterial data with outputs from the rest of the therapy pipeline. We use this data to generate safety and viability reports, and eventually leverage emerging patterns to make our phage therapy process cheaper, faster, and safer.

This post is a continuation of the informal “Data Series” posts and builds on some of the previous concepts like how to keep track of phages, how to name a phage, and how to organize biobank data.

Considerations of a phage data system

Previously, we discussed how to organize and how to accession items into a collection. We covered core identity data (a phage name), descriptive data (where was it from), and instance data (which fridge is the sample in). We also covered derived properties like how the phage behaves against a range of hosts.

A phage data system collects these properties for phages and strains at scale. Most of the data it collects about its phages and strains will be derived — meaning this data will come from experiments, device measurements, plaque assay images, CSVs and spreadsheets, and bioinformatics file outputs.

This is fundamentally a hard problem. The system should aggregate and make sense of both well-structured relational data and semi- and unstructured data. And because we’re discussing phage therapy at scale, it needs to support a diversity of labs, researchers, methods, protocols, and tools.

If we’re designing such a system for both data capture and regulatory accountability, we need to start with a few core assertions about the system’s design. Namely, the system needs to be auditable, replicable, and accessible.

1. The System needs to be Auditable

In order to be auditable, we can borrow concepts from the finance, such as: append-only log (no edits or deletes are allowed; any corrections are added), event-driven architecture (e.g. the available books in a library is a consequence of the number of books previously checked in), and double-entry bookkeeping (every transaction is recorded as a debit and a credit event, and they need match).

The system should also be accountable, meaning each piece of data should be attributed to a lab member and/or the machine that produced the result. Any “mysterious, unaccounted-for data” needs to be invalidated. This can be implemented through account management and lab agreements.

Finally, the system should be tamper-resistant. If there’s any signs the data has been changed, either through editing or data corruption (e.g. faulty hard drive), the data should be invalidated. “Tamper-evident seals” can be created with cryptographic tricks like checksums and merkle trees.

An auditable system creates accountability and trust in the output of each step of the phage production pipeline. Traceable outputs also help us more easily track and fix errors, generate reports for compliance and regulations, cite and reward contributors who provide data, and improve inefficient methods.

Finally, the data we collect and the reports we generate will build a track record for each phage, which should increase our confidence in the phage’s efficacy and safety profile.

This track record also helps us show authenticity and provenance. A phage provided with an auditable track record should inspire more confidence in its safety and efficacy compared to a phage without any data and questionable provenance.

And with provenance and authenticity systems in place, we can start doing neat things like share the phages with others, and collaboratively generate data.

2. The System needs to support Replicability

An auditable system is useless if the data isn’t accurate.

Our data is only as good as our protocols, methods, and machines. If triplicates are necessary for valid results, then replicates from additional labs should further validate those results.

The system needs to support adding replicated data from multiple labs working on phages from the same production batch with the same protocols. Pooling the data allows us to get better averages and detect anomalies. Such a “Proof of Data” mechanism would allow multiple labs to separately confirm and validate a phage’s profile in a decentralized way.

To support replicability, the system needs to be “federated”. This means that labs will generate and collect their own data, but then elect to consolidate their data into a central registry. As long as the lab follows the protocols and collects data in a way that meets requirements, their data (and even their phages and hosts) can be made available to the wider network.

With data federation and replicability, multiple labs across the network could create their own specialized phage collections, send those phages to other labs for replication, and trust that if they followed all the protocols, the federated system could generate reports on their phages and make them generally available.

3. The System needs to be Accessible

For a system like this to work, it needs to be openly available to all. All protocols, processes, requirements, schemas, and tools need to be available and easy to use.

This way, any lab can follow the right protocols and generate the correct kind of data. The data also needs to be openly available for auditing, reporting and further research. This can all be achieved through a combination of web interfaces, reporting tools and APIs.

Lastly, the system needs to be openly available (through the website and through open source), so others can contribute to the growth of the system.

New tools and techniques emerge constantly, and the system needs to account for the rapidly changing landscape.

Only with everyone working together can we hope to create a system that benefits us all.

Looking forward: “bio analytics”

The eventual goal of the system is to turn the data into actionable insights and strategic decisions about how to make phage therapy cheaper, faster, and safer.

In the business world, data scientists inspect, clean, transform, and manage business data for data analysts, who build models on top of the data to inform conclusions and support strategic decision-making. Bioinformatics is discipline of applying data science to biology, mainly with a focus of discovering new biological insights.

The phage field is missing the equivalent of “data analysts” who build models on top of data provided by bioinformatics and wet lab data to inform conclusions and support strategic decision-making for the phage therapy process itself.

In the next issue, we’ll dig deeper into data analytics, and explore what a data pipeline for phage therapy might look like.