While EarthCube strives to help others organize their own data, it has generated quite a bit of data of its own, especially in the form of collaborative documents and other research products. For some time, it has been in need of a consistent method for assigning unique identifiers to products that need them, and has also been in need of a document repository. We now have options for both.
EarthCube as an organization must try to follow its own recommendations, including those so elegantly laid out by Yolanda Gil et al. (2016) in the EarthCube Geoscience Paper of the Future initiative. When EarthCube extols to geoscientists the advantages of using unique identifiers, rich metadata, retrievably stored research products with good provenance, and documentation for replicability, then EarthCube must be willing to do the same itself in very practical ways. We tried to be consistent with those guidelines in making a string of choices underlying a document repository. An important additional consideration was making use of FAIR Principles – making data and other digital research objects more Findable, Accessible, Interoperable, and Reusable – outlined here for geoscience applications (Shelley Stall et al., 2018).
With those guiding lights to illuminate a path, Leadership Council (LC) members Ouida Meier, Rebecca Koskela, and Ken Rubin led a search for appropriate solutions to simultaneously meet multiple needs. This quickly turned into a broader examination of DOIs, metadata schemas, repository choices, search interfaces, compatibilities of all sorts, sustainability, etc. – and not for the first time within EarthCube. Our NSF GEO Program Director, Eva Zanzerkia, encouraged the LC to choose something that could be put to work right away. We were able to make some choices that we think can: be rapidly deployed, will be adequately flexible and durable over the long term, and allow high priority documents (and other products if desired) to be archived right away.
Unique Identifiers for Documents and Other Objects
Because unique identifiers are the basic handle for the whole Internet of data things, we started there. Every document needed to have a DOI – but what flavor? We considered the opportunity and our core priority requirements. Acquiring unique identifiers for individual documents would allow meeting multiple purposes: 1 – reference, 2 – linkages, 3- part of the system of Internet information things, 4 – the ability to cross-link identifiers, 5 – the ability to assign and use searchable keywords, 6 – low-cost or free, and 7 – sustainable, even beyond the end of EarthCube funding. From there, we had the opportunity to use genuine DataCite DOIs, or a repository-specific identifier. We decided to go with DataCite DOIs since datacite.org is one of the places to actually go now to search and reliably find EarthCube-related research products. At this writing, a search for “earthcube” at DataCite yielded 66 different research products of 9 different resource types in 12 different repositories. Therefore, at a bare minimum, if EarthCube researchers obtain a DataCite DOI and use “EarthCube” in title, keyword, or abstract, that product should be findable. It also seems that DataCite is winning the unique identifier war (15.1 million objects so far), just as ORCID IDs are winning the research identifier competition (6.5 million issued so far). Using a DataCite DOI also meant that we were not limited to documents: working out a system for documents means that we could also assign DOIs to other research products, such as software, datasets, workflows, images, videos, presentations, collections, and more.
Acquiring a DOI provides an obligation and an opportunity to provide metadata for each document or research product. The process can elicit information that meets several interrelated needs: defining authorship and contributorship (including acknowledgment of teams, the EarthCube office, NSF, and other funders); provenance (lineage and history); versioning; licensing; and an approval pathway for release. We noted that metadata required (or possible) for procuring a DOI may be distinct from metadata required (or possible) for storing in a repository. Therefore, for both immediate and future flexibility, the metadata requirements would need to be mappable to each other and to other systems. Current metadata specifications for DataCite or specified at http://schema.datacite.org/, currently at version 4.2 (download pdf). DataCite offers Mandatory, Recommended, and Optional metadata fields or properties; once a property is chosen, it may have subproperty requirements. This metadata specification’s requirements and options appeared to be minimalist but flexible, amenable to mapping to other systems, and likely to persist well into the long-term future. EarthCube example documents are shown at this spreadsheet https://bit.ly/ecmdata.
While choosing a repository is the next step in this sequence, the question of which repository to use rose at every step of consideration, particularly since repositories typically have adopted some metadata schema already. Repository considerations included: 1 – appropriate metadata affordances given EarthCube as a community exemplar; 2 – structure, expandability, and interoperability; 3 – cost and openness for sustainability; 4 – live search interface, discoverability, and capability to discover similar resources; 6 – links to related entities (critical); 7 – compatibility with the developing EarthCube registry; 8 – a public-facing appearance or face that will reflect what EarthCube is and can do.
We examined several repositories. Being a DataONE member node was a strong positive attribute. Free or low cost to use was another consideration: e.g., Dataverse, Zenodo, ResearchGate, and figshare were free options (and DataCite members). Figshare was a particularly interesting candidate since a number of EarthCube products are already stored there, it is free for individuals to use, and it did have the option of a very attractive organizational interface if we were willing to pay for that (example: ESIP), but that turned out to be very expensive with annual costs, and would not meet EarthCube’s sustainability needs. Repository ownership and funding model was also an issue for long-term sustainability (university, government, non-profit).
All of these choices would require a very significant amount of volunteer time to move documents into the system and provide either minimalist or rich metadata we wanted to optionally harvest for DataCite DOIs. In conversations it turned out that the EarthCube office has a stated mandate to preserve EarthCube documents, and also that there might be a potential repository option at NCAR, the current home of the EarthCube Science Support office (ESSO).
We spoke in some detail with Matt Mayernik and NCAR Library specialists. While they have built collections for units of NCAR/UCAR within their OpenSky repository, it uses a repository-specific unique identifier. We had decided for multiple reasons that we strongly preferred a DataCite DOI and its metadata properties for long-term resource findability and sustainability. The Library system was not a member of DataCite, but UCAR was, and has obtained DOIs in addition to OpenSky identifiers for some selected research products. The OpenSky repository was also highly searchable, including – remarkably – the ability to search the full text of any documents stored as text (searchable pdf). This would be an advantage for EarthCube documents in particular, the original intent of the effort. The NCAR library agreed to go the extra step to acquire DataCite DOIs for all of the EarthCube products deposited there, had already developed a mappable interface between its own metadata system (MODS) and DataCite’s metadata. They also determined that supporting EarthCube products was within its mandate as long as the EarthCube office was located at NCAR. If the EarthCube office were to move to another institution in future, as it has in the past, documents already ingested into the NCAR library would still continue to be housed in perpetuity, and would be available as a special collection and icon under the OpenSky umbrella. New negotiations would potentially need to take place if the office moves, but the reassurance of having a DataCite DOI for every resource stored in OpenSky means it will ultimately be findable through DataCite as well.
We are now working to move test resources into the OpenSky repository as an EarthCube collection. We can also upload metadata descriptions and obtain DOIs for documents or resources stored in other repositories. A Google form is being set up for additional individual contributions, and the first items to be ingested will be those in the current EarthCube online document repository and key governance documents. We are also aware of multiple other kinds of products, such as videos on EarthCube’s YouTube channel, poster collections, software in the Ontosoft registry and repository, and other project outputs. All of these will be eligible for ingestion as metadata-only records, if wished, into the EarthCube OpenSky repository segment. We invite the EarthCube community to stay engaged, offer feedback on this, and prepare your items for submission as soon as initial workflows for ingestion are in place.
Gil, Y., C.H. David, I. Demir, B.T. Essawy, R.W. Fulweiler, J.L. Goodall, L. Karlstrom, H. Lee, H.J. Mills, J.‐H. Oh, S.A. Pierce, A. Pope, M.W. Tzeng, S.R. Villamizar, and X. Yu (2016), Towards the Geoscience Paper of the Future: Best Practices for Documenting and Sharing Research from Data to Software to Provenance. Earth and Space Science 3(10):387-444. DOI: 10.1002/2015EA000136.
Stall, S., et al. (2018), Advancing FAIR data in Earth, space, and environmental science, Eos, 99, DOI: 10.1029/2018EO109301. Published on 05 November 2018.
Grateful appreciation is extended to Emily Villasenor, Rebecca Koskela, and Ken Rubin for editing help, to Matt Mayernik and NCAR Libraries staff for thoughtful and focused discussion during the conclusion of this process, and to so many members of the EarthCube community for years of foundational work that allowed our group to make these decisions: like everything EarthCube does, real progress is always underpinned by deeply collective effort.