Back to Digital Crust Home
From dataset to data network
One of the original motivations behind EarthCube that is really at the heart of the developing global web is the concept of having reliable and robust data as a commodity, as reliable, extensible, and simple as the existing web of documents. We should be able to query the web for any observation or measurement that has been made by a human or sensor and shared. We should be able to send more and more complex queries to the data as we come up with questions that have not been answered and keep track of those questions and answers to the point where the data network is able to suggest new questions that might be answered.
We are starting to see the data network take shape through implementations of the observations and measurements data model and the sensor web via Open Geospatial Consortium standards, interoperability experiments, and working implementations. Cutting edge work is going on around the edges of these standards to develop semantic integration techniques on top of transmission protocols like the Sensor Observation Service. New research infrastructures and data collections can and should be informed by both established and emerging technologies to build in methods of directly transmitting data into the global data network from the start. However, some of those technologies are still out of reach for smaller research teams, and we will likely continue to have hundreds of datasets created every year as discrete file entities with less than optimal metadata, formats, and structures that make them not directly compatible with the growing data network.
Data that start and stop in file formats intended primarily for offline use through desktop-based analytical tools are not easily accessible to the broader data network. We often rely on catalog records containing metadata of variable quality that provide some information about the properties in the data. We hope these catalogs are exposed in a way that we can at least discover the existence of data and ways to go about accessing them, but it might still be a fairly onerous chore to transform the data into some more usable form. In the Digital Crust building block activity, we are working to develop a framework to enable transformation of some types of observation and measurement data that achieves the following:
Automate inspection of a file to extract the data schema in a form that can be compared against known properties and schemas to determine or propose schema alignments.
Expose the data in a format accessible for streaming access by machine agents via an online API.
Formalize and record source information and transformation details in structured provenance records to ensure traceability from any future derivatives back to original data.
Test the framework for the groundwater-focused use case at the heart of the Digital Crust project and adjust according to results.
Methods and Technical Architecture
In pursuit of these goals, we are developing a working architecture using open source components, open standards, processing algorithms instantiated as microservices, and interfacing with a number of other EarthCube building blocks.
Files containing structured data are accessed from the cloud, processed in a transformation pipe to get the data into a standard format with registered datatypes. These processed files are then used to support data services (Figure 1).
Figure 1. The data processing flow from random structured data to AVRO formatted, self-describing data files with registered data types that are used to support integrated data web services.
The USGS is implementing this processing component as sbFiles, as a module in the USGS ScienceBase Platform. The component includes a cloud-adapted file handling system that serves as a broker to various physical file stores that make up a logical online file system. Basic functionality includes the following features:
Logical file system structure across multiple file stores (local disc, local network, remote cloud (Amazon S3, etc.))
Checksums for file validity
The sbFiles component is being engineered for portability so that is can be instantiated in any logical infrastructure, for example as an EArthCube Building Block.
A key part of this processing workflow is use of the Apache AVRO project which provides a data serialization system, rich data structures, a compact, fast, binary data format, a container file format, and simple integration with dynamic languages. Data schema are included in the container file, allowing clients using the AVRO libraries to read data without a priori knowledge of the data structure.
Figure 2. Interaction of the transformation pipeline with the DataType Registry, Provenance Registry and Data Access Provider.
The transformation pipeline executes automated introspection processes on an incoming data file to attempt to match fields against registered datatypes. This process can be run in a fully automated mode, or operator assisted for greater accuracy. If the collections of fields in a dataset matches a registered datatype, the input dataset is cateorized with that type. If not, a new type is defined in the registry. The schema extracted by inspecting the data is encoded as an AVRO schema and the data are serialized with the schema and sent to the Serialized Data Cache in the Data Access Provider.
Comparison of a new source schema against previously encountered schemas, including those encountered through external authorities, will return the degree of alignment for both the schema itself and its properties. Based on the degree of match, a set of assembly rules will determine the possible read schemas that can be executed against the new data. Those will be passed back to the sbFiles store and recorded in provenance as actions taken on the source data.
Based on the extracted schema and mapping to existing data types in the Data Type (or Property/Schema) registry, file content can be mapped to or more ‘read schemas’ that the Avro application makes accessible in binary data structures available for exposure on the net. Each of these data type read schemas (including the source schema) become a resource accessible via an HTTP API-based distribution point for the data.
PROV and Provenance Registry
Each stage of the processing system records actions in W3C-PROV notation to a provenance registry service (ProvAAS from the EarthCube CINERGI project). PROV entries will ultimately be mostly all based on persistent identifier references (URIs) to registered collections of entities, agents, and actions.
Metadata Assembly Service
The Metadata constructor uses a combination of original source documentation (e.g., some form of metadata record), informatioin about the data schema from the processing pipeline, and a provenance trace of everything that happened to that source entity through the process to dynamically generate a metadata record through another microservice. The metadata will be generated in the ISO19115 standard and encoding for publishing to make available for harvestint to any number of data catalogs. Those records will include distribution links to the API interfaces with available read schemas (Figure 1) to provide users and software developers with multiple avenues for data access.
Figure 3. Detail of the Data file to dataNet transformation pipeline
Data File to Datanet Transformation Pipeline
Dispatcher system based on Spring MVC that handles processing files through a set of discrete microservices (Figure 3).
Processable Filetype Service - determines ability and methods to process various types of files through the rest of the pipeline; focus initially on valid CSV and similar column/row text formats
Schema Sussing Service - reads and extracts an Apache Avro schema in JSON
Schema Registry Service - registers the extracted schema with the Property/Schema Registry and returns relative property and schema matches with other registered schemas
Data Serialization Service - serializes the data into the Apache Avro binary format
The result of the pipeline on a given source dataset include the following:
A new Avro-based data container including an explicit schema and binary data
A set of possible read schemas resulting from analysis within the property/schema registry that can also act on the same data
PROV signals indicating actions taken and recording new connections