LinkedEarth: Crowdsourcing Data Curation & Standards Development in Paleoclimatology
Rationale and Scope
Current climate change must be understood within the context of past climate variations, which are inferred from indirect measurements known as paleoclimate observations. A grand challenge for paleoclimatology is that these observations come in very disparate formats, so there is no standard way to exchange these records between researchers, or with machines. This hinders their re-use and hence lowers their value to science and society. Traditionally, these observations have been archived in data warehouses where the experts that make them have very little control over them. LinkedEarth aimed to manifest a better future by creating an online platform that: (1) enables the curation of a publicly-accessible database by paleoclimate experts and (2) fosters the development of standards, so paleoclimate data are easier to analyze, share, and re-use.
The basic premise of LinkedEarth is that no-one understands data better than the people who generated them. We thus set out to develop a platform that would enable paleoclimatologists to interact with data in an intuitive way, resulting in standardized datasets that are (by construction) editable, interoperable, and discoverable. Editability was achieved via a semantic wiki: the LinkedEarth platform is very similar in user experience to any other wiki (e.g. Wikipedia), and likewise tracks changes and attributes them to authenticated contributors (an ORCID is all that is required to join LinkedEarth). Under the hood, the LinkedEarth platform requires that terms be defined unambiguously: to this end, we developed the first paleoclimate ontology.
An ontology is a formal representation of the knowledge common to a field: take for example the term “proxy”, which is understood by any paleoclimatologist but is, at best, ambiguous to anyone outside the field. Ontologies formally define such terms, and have had an enormous impact in biomedical research, ranging from genomics to diseases to anatomy. To be useful, ontology standards need to be sufficiently rigid that dependent applications can rely on their structure being stable over time, yet sufficiently flexible to accommodate growth and evolution. Rigidity was provided by the Linked Paleo Data (LiPD) format, the ontology’s backbone. Flexibility was provided by the LinkedEarth wiki and its associated charter. This flexibility is essential to any long-term view of data stewardship: in coming years, paleoclimatologists will devise new methods, create new terms, deprecate old ones, and revise outdated interpretations. A fixed schema cannot accomodate for such an evolution; only a flexible, organic structure can. LinkedEarth embraces this need for evolution.
Because LinkedEarth datasets are based on LiPD, such datasets are uploaded or downloaded with a push of a button. Thus, any LinkedEarth-hosted dataset benefits from the entire LiPD research ecosystem. This makes LinkedEarth-hosted data inherently interoperable. LiPD only provides the bones, however. Community-led data standards (the flesh on the bones) arose via the unique capabilities of the LinkedEarth platform, including working groups, discussions, and polling (Gil et al, 2017). This standard is currently being written up as a paper, which will provide more opportunity for input and vetting by the paleoclimate community.
Lastly, the semantic part of LinkedEarth means that datasets are broadcast to the web using standard schemas, which make them discoverable by various search engines, including Google. Because of this outward-facing design, LinkedEarth was the first database to be integrated into EarthCube’s Project 418.
In a short two years, LinkedEarth has brought to life a functional platform for the crowd-curation of paleoclimate data and an emerging data standard. Along the way, it also developed a paleoclimate-centric Python package (pyleoclim), and provided a nucleus for interoperability of EarthCube paleodata.
Despite these rapid accomplishments, the vision still faces notable challenges. Firstly, it has proven difficult to elicit participation from a broad community: only 100 paleoclimatologists have answered our survey on paleoclimate data standards so far. Another issue concerns adoption: despite a considerable investment of resources (funding, personal time for participants), very few scientists are actively using LinkedEarth. Both of these challenges could be remedied if funding agencies and publishers drew attention to this resource. PAGES is playing a leading role in incentivising a new generation of paleoclimate scientists to curate high-quality data compilations. PAGES 2k is a case in point, having motivated the birth of LiPD, the need for crowd-curation, many of the ontologies’ categories, including the very concept of compilation.
One persistent obstacle to adoption is the perceived redundancy with other data repositories. LinkedEarth was never designed to replace Pangaea or WDS-Paleo, and never will. LinkedEarth has been a laboratory to advance the notion of decentralized paleo data curation. WDS-Paleo will ensure its long-term sustainability, since it now accepts LiPD as a submission format. Because of LiPD’s structured nature, LinkedEarth integrates well with repositories - initial links have been developed with Neotoma, with more in the works.
The success of LinkedEarth will primarily be measured by whether this organic, decentralized approach to data curation takes hold in the community. We look forward to many more paleo data compilations being generated, discussed, and published on LinkedEarth. Every new working group brings with it new needs and constraints; so far, LinkedEarth’s intrinsic flexibility has enabled it to accommodate them all, and likely will for the foreseeable future.
Emile-Geay, J., and N. P. McKay (2016), Paleoclimate data standards, PAGES Magazine, 24, 47, doi:10.22498/pages.24.1.47.
Gil, Y., D. Garijo, V. Ratnakar, D. Khider, J. Emile-Geay, and N. McKay (2017), A controlled crowdsourcing approach for practical ontology extensions and metadata annotations, in The Semantic Web –ISWC 2017, doi:10.1007/978-3-319-68204-4_24
Khider, D., F. Zhu, J. Hu, and J. Emile-Geay (2018), LinkedEarth/Pyleoclim util: Pyleoclim release v0.4.0, doi:10.5281/zenodo.1205662.
McKay, N. P., and J. Emile-Geay (2016), Technical note: The linked paleo data framework : a common tongue for paleoclimatology, Climate of the Past, 12(4), 1093–1100, doi: 10.5194/cp-12-1093-2016
McKay, Nicholas, Julien, Emile-Geay, Heiser, Christopher, & Khider, Deborah. (2018). GeoChronR (Version 1.0.0). Zenodo. doi:10.5281/zenodo.60812
PAGES2k Consortium (2017), A global multiproxy database for temperature reconstructions of the Common Era, Scientific Data, 4, 170,088 EP, doi:10.1038/sdata.2017.88.