maggtomic
(metadata aggregation using the Datomic information model) is intended to be a
modular system for metadata management.
The primary near-term use case is support of metadata submission, processing, and management for the National Microbiome Data Collaborative (NMDC) pilot project. A priority of this project is the cultivation of an open (and open-source-powered) data ecosystem.
The development status is currently a mix of Planning and Pre-Alpha (as per https://pypi.org/classifiers/).
A priority for maggtomic
is agility. The system must be extensible, with little impedance, to ongoing
introduction of a wide variety of data sources and sinks, all of which need to be findable, accessible,
interoperable, and reusable (FAIR).
The Datomic information model facilitates such agility via its singular, universal relation. It extends the W3C standard Resource Description Framework (RDF) information model, which was designed to facilitate interoperability of distributed data, with minimal annotation to support ACID transactions. These transactions are reified as durable entities and thus may be annotated with provenance, enabling historical auditing and qualified reproducibility. Each and every fact-as-of-now (a so-called datom) is recorded as a 5-tuple: an RDF triple of entity-attribute-value, a transaction id (annotated with the transaction wall time as a separate fact), and whether the fact is an assertion or retraction.
To implement the Datomic information model in an open-source system that facilitates agility
for the motivating near-term use case -- supporting the NMDC pilot project -- the most important operational
considerations are familiarity and manageability (see: familiarity) with the chosen technology. maggtomic
chooses MongoDB. Why?
- Much of the infrastructural support for NMDC is located at two U.S. Dept. of Energy (DOE) user facilities: Joint Genome Institute (JGI) and National Energy Research Scientific Computing Center (NERSC). The JGI Archive and Metadata Organizer (JAMO), which in turn uses NERSC hardware and staff support, manages user-facing metadata with MongoDB.
- Other large user-facing facilities use MongoDB for (meta)data management through NERSC, such as the Advanced Light Source (ALS) user facility and the Materials Project (MP).
- Another large project with a focus on biological metadata management, the Center for Expanded Data Annotation and Retrieval (CEDAR), uses MongoDB as a metadata repository.
So, there is strong operational familiarity among relevant stakeholders both at the level of infrastructure
support and of suitability for scientific-domain modeling. But what about system features necessary to
support adequate performance of the Datomic information model? Firstly, maggtomic
is intended to support
the needs of a pilot project, so adequate is an important qualifier on expected performance. Secondly, there
are several features of MongoDB, and choices for configuration, that help address performance concerns:
- redundant indexing to support a variety of access patterns: Datomic redundantly stores all data in at least 4 sort orders, including one index that covers a subset of datoms to support reverse attribute lookup. This redundancy is to flexibly support key-value-store-oriented, row-oriented, column-oriented, document-oriented, and graph-oriented access patterns in the same system. For this functionality, MongoDB supports multiple (covering) compound indexes, and partial indexes, on a collection.
- compression: Because (a) all data is stored redundantly in each of several indexes, and (b) all data is
immutable (accumulate-only, great for historical auditing and qualified reproducibility), Datomic index segments
are highly compressed. With the MongoDB (default) WiredTiger storage engine, compression is supported for all
collections and indexes, with different options to trade off higher compression rates versus CPU usage. The
zstd
library available in MongoDB 4.2 seems appropriate here, with a higher compression rate than the defaultsnappy
option and lower CPU usage (and also higher compression rate) than thezlib
option (previously the only built-in alternative tosnappy
). Furthermore, indexes for WiredTiger collection are prefix-compressed by default: queries against an index, including covering queries, operate directly on the compressed index -- i.e., it remains compressed in RAM. - fixed-space identifiers rather than redundant string storage: A separate mechanism for space reduction in Datomic is the storage of entity references as numbers rather than storing larger string values. This is particularly important when using the distributed-data model of RDF and thus using fully qualified names. MongoDB supports this space reduction strategy, either via the built-in 12-byte ObjectId reference type, or via an alternative strategy such as Crockford's Base32 for deserializable-for-humans numerical entity IDs.
- transactions: A performance concern in the sense that losing data is poor performance (!). MongoDB supports multi-document transactions as of 4.0 (and across a sharded deployment as of 4.2), and configurable write- and read-concern levels.
There is also the matter of supporting analogues to Datomic schema, query, and
transaction functions, all of which in turn support effective and productive interaction with the underlying
information model. maggtomic
chooses Python as the language for client interfaces, as this language
is in heavy use by stakeholders.
- For schema support, rather than translate the ad hoc vocabulary used in
Datomic,
maggtomic
aims to support a subset of the RDF-based W3C Shapes Constraint Language (SHACL) standard, which admittedly was only finalized as a standard in 2017, whereas Datomic schema was launched earlier. Crucially, Python tooling such as pySHACL exists to validate SHACL shape graphs against data graphs. Various SHACL node shapes can be checked -- these are analogous to JSON Schema documents, with the advantage of structural sharing-by-reference of SHACL property shapes, whereas the equivalent of property shapes need to be restated for each JSON Schema document. Through this mechanism, for example, a metadata submission can conform to multiple "templates", and suggestions can be derived to bring a submission to compliance with one or more templates not initially considered by a submitter. - For query,
maggtomic
aims to leverage the expressiveness of the MongoDB query language and the MongoDB aggregation pipeline to provide a query interface similar in appearance and composability to Datomic's variant of datalog. - For transaction functions,
maggtomic
aims to provide Python functions that return e.g. lists of dictionaries that correspond to tiny MongoDB documents as new datoms, MongoDB aggregation pipeline stages, etc. Modulo performance considerations, transaction functions or query predicates may be arbitrary Python functions, as their equivalents may be arbitrary Clojure functions in Datomic, which would manifest e.g. as interruptions of a MongoDB aggragation pipeline.
Certainly, not all of the above things need to be implemented prior to productive evaluative use of an
alpha version of maggtomic
, but it's important to consider longer-term ramifications of choosing Python
and MongoDB to implement
(an ad hoc, informally-specified, bug-ridden, slow implementation of half of) the commercial Datomic offering,
even if one knows the motivating use case is for agility in the context
of a pilot system and thus one must
plan to throw one away; you will, anyhow.
Finally, maggtomic
aims to provide interoperability among data sources and
sinks via translation between JSON-LD serializations (as JSON is a familiar
format for stakeholders) and the RDF graphs corresponding to values of the
maggtomic
database as-of given times (and thus as a set of
entity-attribute-value tuples for a given filtration of transactions). Again,
Python tooling for this translation is crucial, and e.g. the
pyLD library is a JSON-LD processor
that supports necessary operations such as expansion+flattening --
context-annotated JSON-LD to RDF -- and framing(+compacting) -- RDF to
context-annotated JSON-LD, which can leverage specs, e.g. SHACL shape-graph
schemas and ontologies, installed as facts-as-of-now themselves in the database.
Dataflow may be handled via "builder" ETL processes as with the Materials
Project's maggma system. Though
currently out of scope for the near-term, it may be possible to construct a
timely dataflow system to
support adequately-performant interactive queries of the knowledge graph
embodied by a maggtomic
database. The tuple space
model
(e.g. Linda) of
parallel programming may also be fruitful here.
For Web API support for metadata submission and search/retrieval,
maggtomic
aims to include a FastAPI server module. For browser-based
metadata submission and basic search/retrieval, maggtomic
aims to include a (authentication-enabled)
static-site frontend that connects to the Web API.
Below is a
breadboard sketch
for the maggtomic
user interface (UI). Each place has affordances, and the
connection lines show how affordances take a user from place to place. In
addition, here each place represents an entity, and the breadboard doubles as an
entity relationship diagram (ERD) with relationship multiplicities labeled on
the solid edges between entities. Examples: a context associates with zero or
more ("0..n") datasets, a query associates with one and only one ("1")
context, etc.
Each of these entities are described in more detail below as part of the use cases illustrated by sequence diagrams.
Furthermore, the choice of entities and their relationships was inspired by the data.world catalog service.
Below are sequence diagrams for a core set of use cases. Each diagram sketches out a "happy path" and can be used as a checklist for a spike, where each arrow in a diagram corresponds to one checklist item.
A context can be likened to a "project" or "analysis", in that it serves as a mechanism to collect information, ask questions about it, and communicate answers. It's called a context because (a) it isn't limited to one focused activity with a beginning, middle, and end, as is the case for a project or analysis; and (b) it's intent is similar to that of a JSON-LD context:
When two people communicate with one another, the conversation takes place in a shared environment, typically called "the context of the conversation". This shared context allows the individuals to use shortcut terms, like the first name of a mutual friend, to communicate more quickly but without losing accuracy. A context in JSON-LD works in the same way. It allows two applications to use shortcut terms to communicate with one another more efficiently, but without losing accuracy.
Specifically, a maggtomic
context is used to map terms and relationships
among datasets linked to the context. The below diagram shows a user story for
creating a new context.
A dataset is an RDF graph, i.e. a set of entity-attribute-value pairs for which all entities and attributes are URIs, and values are either URIs or data literals (strings, numbers, etc.). A dataset can be used in many contexts, and a context can link to many datasets.
To import a new dataset into maggtomic, a user does so from a working context; other contexts may later also link to and thus use the dataset. A dataset may be uploaded, or HTTP endpoint information can be provided. For example, a URL can be entered for a simple GET request. More complex endpoint registration could be supported, such as using a POST request, providing authentication and other headers, asking that the dataset be re-fetched via the endpoint periodically, etc.
A user can import a dataset in non-RDF form, for example as a collection of TSV
tables or as a collection of JSON documents. In this case, the dataset will be
atomized, that is, destructured to atomic statements of the form
entity-attribute-value. URIs for a newly atomized dataset's entities and
attributes are prefixed using the user's maggtomic
user/organization name and
the context's slug, e.g.
http://maggtomic-host.example.com/awesomeorg/my-great-context
, analogous to
the namespacing of e.g. GitHub code repositories.
The below diagram shows a user story for importing a new dataset.
An spec (for "specification") is a power-up for datasets that increases their FAIRness. A spec is also an RDF graph, and can represent something to be "applied" to a dataset, meaning additional facts are inferred from the dataset in this context -- that is, after verifying that the dataset doesn't contain facts that contradict with the spec. A spec of this kind is also called an ontology.
A spec can also be something a dataset is validated against for conformance,
e.g. a schema (called a shape in SHACL). A dataset needn't be conformant to
all such linked specs in a context; rather, linking them can help the
maggtomic
system generate suggested mappings. Each context has a "local" spec
where a user can use the namespace of the context to manage a controlled
vocabulary (dictionary of terms) and mappings among terms. In this way, a user
can ensure dataset conformance to specs of interest.
The notion of "spec" here is inspired by that of clojure.spec: represented just like datasets, and more dynamic and flexible than a static system of types.
A central design goal of maggtomic
is to facilitate the mapping of imported
datasets to shared specs. Thus, a user should be able to enter a context, import
their dataset, import a spec shared with them by a colleague (for example, the
NMDC Schema), and
establish mappings among terms. If another user has already imported a spec of
interest, that spec may be linked from other contexts.
The below diagram shows a user story for importing a new spec. The flow is similar to that for importing a new dataset, but in this case no atomization is needed, as specs are presumed to be imported as RDF-serialized.
To support queries across datasets, a user must ensure mappings among terms in their working context's linked datasets and spec (unless the datasets are already linked through use of the same terms (URIs) for the same concepts -- the dream of Linked Data!). Imagine a simple context of one dataset and one spec: a user has imported a new dataset they wish to share with the community, and has linked to a previously imported recommended spec from the context where they imported their dataset. Now what?
A user can request suggestions for mappings from the maggtomic
system. First,
the system should ensure that all inferences have been determined and persisted
given the context's linked datasets and specs. Inferences are
entity-attribute-value statements entailed by applying user-supplied ontology
specs to user-supplied data. In other words, inferences are mappings that are
already unambiguously implied by what the user supplied, so it would be
redundant to offer these mappings as suggestions to be confirmed for explicit
inclusion by the user in their context's local spec.
The below diagram shows a user story for requesting suggestions for mappings to apply to a context's local spec.
A user can curate a context to provide a data dictionary of terms and mappings
that empower and ease the construction of readable queries across datasets.
Mappings may be suggested by maggtomic
, or they may be entered manually,
in either case applied by a user.
Because datasets and specs are both represented as RDF, a user can update a dataset to include spec statements as part of the dataset itself, making it more readily interoperable. Hooray for Linked Data!
The below diagram shows a user story for applying mappings to a context's local spec. The validation step ensures consistency with the entailments of linked specs.
The standard protocol and language for queries across RDF datasets is
SPARQL, for which
prefix mappings enable readable queries given URI terms. An alternative query
interface/language may be helpful, in particular one that is based on data
literals rather than strings, as is the case with Datalog's Clojure/EDN-based
datalog. Eve syntax is another
datalog variant that may be worth investigating. For maggtomic
, the JSON-based
query and aggregation languages of MongoDB may prove fruitful for adaptation.
A query is labeled with a natural-language question, to facilitate user navigation and search for relevant queries that lead the user to relevant contexts and datasets.
The below diagram shows a user story for saving and running a query across one or more datasets of a working context.
An insight is a lightweight annotation for a context that communicates some result and its significance. It is a "post" by a user, with text and perhaps included visualizations in the form of e.g. PNG images. Ideally, an insight links to one or more specific queries of the context that support the insight.
No sequence diagram is shown yet for creating an insight because this is not
considered a core use case for a minimum viable product (MVP) demo of
maggtomic
. Thus, it is not crucial at this time to sketch out a checklist for
a spike that demonstrates this feature.