Title: Integrating Long-Tail Data and Model Resources for Advancing Earth System Science
Authors: Praveen Kumar, Mostafa Elag, Luigi Marini, Rui Liu, Pieshi Jiang: U. of Illinois,
Scott Peckham: U. of Colorado, Boulder,
Leslie Hsu, Lamont-Doherty Earth Observatory, Columbia U.
Often, scientists and small research groups collect data that are targeted to address specific scientific issues and have limited geographic or temporal range. However, a large number of such collections together constitute a large database that is of immense value to the Earth Sciences disciplines. Complexity of reusing these data collections encompass heterogeneity in dimensions, coordinate systems, scales, variables, providers, users and scientific contexts. These data have been defined as long-tail data. Similarly, we use “long-tail models” to characterize a heterogeneous collection of models and/or modules developed for targeted problems by individuals and small groups, which together provide a large valuable collection. Complexity of linking these models in a workflow incorporate differing variable names and units for the same concept, run at different time steps and spatial resolution, use differing naming and reference conventions, etc. Ability to integrate “long-tail” models and “long-tail” data across the geoscience fields will provide a transformative opportunity for the interoperability and reusability of communities’ resources, where not only models can be combined in a workflow, but each model will be able to discover and (re)use data in application specific context of space, time and scientific questions. This capability is essential to represent, understand, predict, and manage heterogeneous and interconnected Earth system processes and activities by harnessing the complex, heterogeneous, and extensive set of distributed resources. Because of the staggering production rate of long-tail models and data resulting from the continued advances in computational, sensing, and information technologies, an important challenge arises: how can geoinformatics science bring together all these resources seamlessly, given the inherent complexity among model and data resources that span across various geoscience domains. Here, we will present a semantic-based framework to support “long-tail” models and data integration. The framework builds on existing technologies including: (i) SEAD (Sustainable Environmental Actionable Data - http://sead-data.net/), which supports curation and preservation of long-tail data during its life-cycle; (ii) CSDMS (Community Surface Dynamics Modeling System - http://csdms.colorado.edu/wiki/Main_Page), which “componentizes” models by providing plug-and-play environment for models integration. In addition, we will describe methods of integration with three ongoing EarthCube initiatives that focus on increasing the interoperability among models and data: GeoSoft, Earth System Bridge, and Sediment Experiment Network (SEN).