Project 418: An EarthCube Initiative in Web-based Geoscience Data Discovery

Project 418 is an ESSO-managed pilot project which addresses some of EarthCube's core activities envisioned for the EarthCube Cyberinfrastructure. These activities include Resource Registration, Data Discovery, and Data Access. Project 418 will serve as a pilot for the beginning point for these tasks, provide a foundation for future initiatives, as well as become a core component linking data facilities and EarthCube funded projects.

1. Background

The EarthCube Council of Data Facilities (CDF) is a federation of existing and emerging geoscience data facilities that serves as a foundation for EarthCube and cyberinfrastructure for the geosciences. Its current membership includes > 40 Data Facilities. The CDF Registry Working Group (RWG) was formed to review alignment of existing approaches to research facility description and discovery and it includes the EarthCube CDF, the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS) and the Registry of Research Data Repositories (re3data). The CDF RWG determined that today, NSF-funded domain repositories have no common way to share information about each repository and their data holdings and there is an urgent need to share with the larger Earth and Space Science (ESS) community. This includes providing and facilitating support for FAIR Principles (a set of guiding principles to make data Findable Accessible Interoperable Reusable). The CDF RWG also concluded that data management and preservation is best conducted by domain repositories and it developed guidelines for what information would be valuable to share and a machine-readable method to publish that information.

2. Utilization of Schema.org

The CDF RWG recommended the standards at Schema.org as an optimal way to publish information about organizations and datasets. Schema.org is a collaborative, community activity founded by Google, Microsoft, Yahoo, and Yandex to create, maintain, and promote schemas for structured data on the Internet. These vocabularies cover entities and relationships between entities and actions that can easily be extended through a well-documented extension model. Currently, over 10 million sites use Schema.org to markup their web pages!

3. Principles Over Project

Project 418 is a technical implementation of Schema.org that seeks to demonstrate common publishing approaches for Data Facility holdings using Schema.org and extensions. The project helps place data in context and encourages interoperability among facilities. Project 418 executes the F in FAIR in a scalable manner and in a way that anyone can take this approach and implement it. Its reduces a priori knowledge needed by all actors (e.g., facilities, developers, scientists).

4. Project 418 Overview

The results of the CDF RWG includes only information about the data facilities, but not the individual data sets under curation by utilizing Schema.org/Organization and JSON-LD formatted metadata. JSON is JavaScript Object Notation which is a standard web format for storing and exchanging data. JSON-LD encodes Linked Data into the JSON format.

Project 418 is a pilot EarthCube project which attempts to

  • Extend the CDF RWG methodology to data holdings
  • Investigate usage of schema.org/Dataset for Data Facilities
  • Implement schema.org/Dataset with geoscience specific vocabularies
  • Assist Data Facilities with adoption of metadata implementation
  • Develop a cloud-based software stack for crawling, indexing, and access
  • Develop sample user interfaces for accessing cloud-based software stack

5. Utilization of Web Standards

Project 418 seeks to leverage existing web standards such as HTML5, the W3C Semantic Web, and JSON-LD formatted metadata. By utilizing existing web publication patterns, it incorporate community tools & libraries and applies best practices for resource interfaces. In fact, Google Research is currently developing approaches to organically harvest this information, but only for the core Schema.org/Dataset type and not for geoscience vocabulary extensions. The result of this work will enable a broader community to discover the data facilities and their data holdings through search engines like Google and BING.

6. Scope and Timeline

Project 418 is being developed under a limited timeline with only a 6 month development window. The project ends around April 2018 (hence the name), and the deliverables will be publicized at the June EarthCube All Hands Meeting.

Project 418 has a limited scope.

  • Work with a small set (~10) of NSF data providers for pilot implementation
  • Apply alignment of Schema.org with external community vocabularies
  • Enable simple search capability over text/keyword and spatial constraints
  • Deploy interlinked computational modules using the XSEDE JetStream cloud which is a NSF-funded computational and web service platform for science
  • Create interactive Python Jupyter notebooks, R/Rstudio Markdown notebooks and web-browser UI components
  • Develop a robust collection of publishing guideline documents and links

7. Cloud-based Software Architecture

Project 418 Development Team

Doug Fils - Consortium for Ocean Leadership - Data Management Expert

Adam Shepherd - BCO-DMO - Technical Director

Eric Lingerfelt - EarthCube Science Support Office - Technical Officer

Project 418 Advisory Team (PAT)

Rick Benson - IRIS

Fran Boler - UNAVCO

Steve Kuehn - Concord University

Tanu Malik - DePaul University

Matt Mayernick - NCAR

Sarah Stamps - Virginia Tech

Project 418 Pilot Project Partners

NSF Funded Data Providers

Biological and Chemical Oceanography Data Management Office (BCO-DMO) - https://www.bco-dmo.org/

Continental Scientific Drilling Coordination Office (CSDCO) - https://csdco.umn.edu/

HydroShare - https://www.hydroshare.org/

Incorporated Research Institutions for Seismology (IRIS) - https://www.iris.edu/hq/

Interdisciplinary Earth Data Alliance (IEDA) - https://www.iedadata.org/

Magnetics Information Consortium (MagIC) - https://www2.earthref.org/MagIC

Neotoma Paleoecology Database and Community - https://www.neotomadb.org/

Open Core Data - http://opencoredata.org/

OpenTypography - http://www.opentopography.org/

UNAVCO - http://www.unavco.org/

EarthCube Funded Projects

Brokered Alignment of Long-Tail Observations (BALTO) - https://cires.colorado.edu/research/research-groups/project/balto-earthcube-brokered-alignment-long-tail-observations

LinkedEarth - http://linked.earth/

Upcoming Partners

Arctic Data Center - https://arcticdata.io/

Cloud-Hosted Real Time Data Services (CHORDS) - https://www.eol.ucar.edu/content/chords-cloud-hosted-real-time-data-services

National Center for Atmospheric Research (NCAR) - https://ncar.ucar.edu/home

Rolling Deck to Repository (R2R) - http://www.rvdata.us/

Unidata - https://www.unidata.ucar.edu/

And many more...

Project 418 Web User Portals

Text Search Web App - https://earthcube.org/webapps/p418/textSearch.html

Spatial Search Web App - https://earthcube.org/webapps/p418/spatialSearch.html

Project 418 Interactive Notebooks

Text Search Python Jupyter Notebook - https://github.com/earthcubearchitecture-project418/p418Notebooks/blob/master/Text_Search_Simple.ipynb

Text Search Python Notebook using iPyWidgets - https://github.com/earthcubearchitecture-project418/p418Notebooks/blob/master/Text_Search_Widgets.ipynb

Spatial Search Python Notebook - https://github.com/earthcubearchitecture-project418/p418Notebooks/blob/master/Spatial_Search_Simple.ipynb

Spatial Search Python Notebook using iPyWidgets and Folium Maps - https://github.com/earthcubearchitecture-project418/p418Notebooks/blob/master/Spatial_Search_Widgets.ipynb

R / RStudio Markdown Notebooks - https://github.com/earthcubearchitecture-project418/p418NotebooksR

MATLAB Live Script Notebooks - https://github.com/earthcubearchitecture-project418/p418NotebooksMATLAB

Project 418 Documentation for Data Providers

Main GitHub Documentation Repository - https://github.com/earthcubearchitecture-project418/p418Docs

Publishing Guidance - https://github.com/earthcubearchitecture-project418/p418Docs/blob/master/publishing.md

Vocabulary Guidance at GitHub - https://github.com/earthcubearchitecture-project418/p418Vocabulary

Vocabulary Guidance at Geodex.org - http://geodex.org/voc/

Project 418 Developer Links

Main GitHub Organization - https://github.com/earthcubearchitecture-project418

Web Services Repository - https://github.com/earthcubearchitecture-project418/services

Web App Repository - https://github.com/earthcubearchitecture-project418/webUI2

Crawler Repository - https://github.com/earthcubearchitecture-project418/gleaner

Data Repository - https://github.com/earthcubearchitecture-project418/assay-data