Jesper Gjerloev1, Robin Barnes1, Colin Waters2
1JHU-APL, United States of America; 2U. Newcastle, Australia
The Magnetosphere-Ionosphere-Atmosphere Coupling (ARCH) project addresses a science need of the magnetosphere-ionosphere-atmosphere communities, the complete electromagnetic solution of the auroral ionosphere, through an implementation that matches the goals of the EarthCube program.
We have developed a set of algorithms that ingest all the high-latitude electrodynamic related data available. By applying first principle physics constraints we produce a set of self-consistent output state variables that completely characterize the high-latitude polar electro-dynamic environment on a high spatial and temporal resolution grid.
We present the implemented website, including data, plots and derived products.
Liping Di, Ziheng Sun, Eugene Yu, Chen Zhang, Juozas Gaigalas
George Mason University, United States of America
This talk will introduce the latest progress of CyberConnector, one of the NSF EarthCube building blocks. The project aims to establish an online facility for automatically tailoring multisource Earth observation data to feed Earth scientific models. The system has been deployed onto the Internet for operational running. We have used it in aiding atmospheric scientists on data preprocessing and streaming observations directly into air models. The typical use cases include CMAQ (Community Multiscale Air Quality Modeling System) and reanalyses models. We created a series of VDPs (virtual data products) to dynamically and on-demand generate products based on the real-time observations from sensors. The VDPs can be scheduled to produce data regularly as long as new observations arrive. The generated products are either directly used in model assimulation by the scientists or archived for search by the public. We excised the system on several datasets of UCAR (COLA and OSU) and ensure the scientists can truly benefit from using CyberConnector. The major challenge in atmospheric use cases is big data searching and processing. The petabytes of data in UCAR is almost un-migratable. We stretch the datasets by providing searching, ordering, preprocessing and rendering capabilities for scientists to manipulate the big atmospheric datasets at the lowest cost. The services of CyberConnector are provided in on-demand and customizable mannar so that modellers can order the exactly-match input products for their models. We demonstrated the usage of CyberConnector with several inputs of CMAQ, teleconnection and reanalysis and how CyberConnector relieve modelers from dense data collecting and preprocessing.
Sara Graves1, Emily Law2, Chaowei Yang3, Ashish Mahabal4, Ken Keiser1
1University of Alabama in Huntsville, United States of America; 2Jet Propulsion Laboratory; 3George Mason University; 4California Institute of Technology
The realization of an integrated EarthCube cyberinfrastructure is dependent on the interoperability of software tools and data products, within EarthCube projects and including resources from the larger Geo and Earth science communities. The EarthCube Integration and Test Environment (ECITE) was a partially funded pilot project to demonstrate the need for assessing the integration of EarthCube building technologies, to include software and data products. The definition of EarthCube interoperability objectives is crucial to ultimately integrate all the community resources to support science-driven needs and use cases. Interoperable integration includes capabilities to discover resources that are compatible, generate and test workflows, and provide re-usable and interoperably technology solutions to science problems. This poster illustrates some of the resource interoperability connections that should be assessed and validated through ECITE to promote successful integration with the EarthCube cyberinfrastructure.
Daniel Garijo Verdejo1, Yolanda Gil1, Scott Peckham2, Christopher Duffy3
1University of Southern California, United States of America; 2University of Colorado; 3The Pennsylvania State University
Model repositories are key resources for scientists in terms of model discovery and reuse, but do not addres crucial tasks such as data preparation, model comparison and composition. Model repositories do not typically capture important comparative metadata to describe assumptions and model variables that enable a scientist to discern which models would be better for their purposes. Once a scientist selects a model from a repository it takes significant effort to understand and use the model. Our goal is to develop model repositories with machine-actionable model metadata that can be used to provide intelligent assistance to scientists in model selection and reuse.
We are extending the OntoSoft semantic software metadata registry (http://www.ontosoft.org/) to include machine-readable metadata of models. This work includes: 1) exposing model variables and their relationships; 2) adopting a standardized representation of model variables based on the conventions of the Geoscience Standard Names ontology (GSN) (http://www.geoscienceontology.org/); 3) capturing the semantic structure of model invocation signatures based on functional inputs and outputs and their correspondence to model variables; 4) associating models with readily reusable workflow fragments for data preparation, model calibration, and visualization of results. We are designing representations to capture the semantic structure of model invocation signatures that maps model variables to data requirements to facilitate discovery and comparison of models. With these improvements, OntoSoft is expected to reduce the time to find, understand, compare, and reuse models
Michael D, Daniels1, Branko Kerkez2, V. Chandrasekar3, Sara Graves4, D. Sarah Stamps5, Aaron Botnick1, Charles Martin1, Ken Keiser4, Jones Joshua5, S. Ryan Gooch3, Matthew Bartos2
1National Center for Atmospheric Research, United States of America; 2University of Michigan; 3Colorado State University; 4University of Alabama/Huntsville; 5Virginia Tech
Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS), an EarthCube Building Block, addresses the ever-increasing importance of real-time scientific data, which is particularly useful in mission critical scenarios, where informed decisions must be made rapidly. Many of the hazardous phenomenon studied within the geosciences, ranging from hurricanes and severe weather, to earthquakes, tsunamis, volcanoes and floods, can benefit from better use of real-time data. The National Science Foundation funds both large teams at laboratories and small teams at Universities who are taking measurements that could lead to a better understanding of these phenomenon in order to ultimately improve forecasts and predictions. CHORDS is supporting and extending the community of real-time data providers by exposing their data via developing standards.
CHORDS is currently in use by hydrology, atmosphere and solid earth sensors. Since our user base spans the geosciences, we are in the midst of navigating the various controlled vocabularies, ontologies, ingesting and archiving systems most commonly used by each of these desperate communities. In addition, broad use of data streams using the CHORDS framework across science domains will push the need to standardize other aspects of data access through standard services such as the location and type of measurements being taken, sample rate, spatial coverage, etc. Enabling measurement streams that are “born connected” (Leadbetter et al., 2016) through CHORDS will in turn expand the role of real-time data within the geosciences, enhancing the potential of streaming data sources to enable adaptive experimentation and real-time hypothesis testing across a range of geoscience applications.
Shane Loeffler, Amy Myrbo, Alex Stone, Sijia Ai, Reed McEwan
University of Minnesota, United States of America
EarthCube and other efforts to make the vast amounts of data currently available in domain repositories more easily accessible and interoperable have made great strides, allowing scientists, educators, and tool developers to do novel work. Still, differences in the way domain repositories handle requests and returns of data create barriers for efficient synthesis of data across them. Interdisciplinary research projects are often the flagship example of what the interoperability of these systems is designed to support, but the high variability of research projects makes them challenging for domain repositories to design for since these projects exert little consistent pressure and researchers have high motivation to perform whatever laborious transitions are required to complete their particular research project. Tools that query data from several domain repositories at once may provide a more consistent and attractive set of requirements that, through collaboration between domain repository developers and tool developers, can lead to increased interoperability between resources. Here we present the collaborations between the Flyover Country mobile app and the domain repositories that provide the data it uses. These collaborations have lead to an increase in interoperability between domain repositories. General tools like Flyover Country are not only the product of the creation of interoperable systems, but also can act as part of the process that helps hone the interoperability of those systems through the creation of a consistent and attractive use case which domain repository developers can aim to accommodate.
Nicholas Jarboe1, Rupert Minnett2, Cathrine Constable1, Lisa Tauxe1, Anthony Koppers2, Lori Jonestrask1
1Scripps Institution of Oceanography - UCSD; 2Oregon State University
The Magnetics Information Consortium (MagIC) supports an online database for the paleo, geo, and rock magnetic communities (https://earthref.org/MagIC). The website has an XML sitemap as required for P418 and Google indexing and each data contribution is served with embedded schema.org and JSON-LD compliant data descriptions. This meta-data will also be used by the European Plate Observing System and can be read by any other entities that wish to machine query the MagIC database. MagIC has completed the transition from an Oracle backed, Perl based, server-oriented website to an Elasticsearch backed, Meteor based thick client website technology stack. On-the-fly data validation, column header suggestions, and online spreadsheet editing are some new features made possible by using these software technologies. Uploading data into the archive with comprehensive indexing and completing complicated search queries to obtain unique datasets are an order of magnitude quicker than the old system. Searches return row level data over all contributions and the user can choose to download only those rows meeting the search criteria as a single text file or spreadsheet. MagIC meets the FAIR data principles; we mint data DOIs for each data contribution, do data versioning, have web access with a sophisticated search interface, and all data is available online with an open data license. Meta-data method codes and vocabulary lists can be browsed via the MagIC website, downloaded as JSON files for other to use, and can be easily updated by the MagIC team via email or submitting an issue on the MagIC GitHub site. All source code for MagIC is publicly available on GitHub (https://github.com/earthref/MagIC) and the MagIC file format is natively compatible with the PmagPy (https://github.com/PmagPy/PmagPy) paleomagnetic analysis software. Data downloaded from MagIC can be examined and modified with the command line PmagPy Python programs and contributions with measurement level data can be explored using PmagPy’s easily installed Thellier_GUI and Demag_GUI programs.
Matthew Mayernik1, Mike Daniels1, Don Stott1, Linda Rowan2, Erica Johns3, Huda Khan3, Dean Krafft3
1National Center for Atmospheric Research (NCAR); 2UNAVCO; 3Cornell University
EarthCollab, an EarthCube-funded building blocks project, is focused on improving the discovery and sharing of information to advance research and scientific collaboration. The partners include the NCAR Library, NCAR Earth Observing Lab (EOL), UNAVCO, and Cornell University Library. EarthCollab has produced two operational systems, Connect UNAVCO (http://connect.unavco.org/) and EOL Arctic Data Connects (http://vivo.eol.ucar.edu/), that provide linked information about complex scientific projects and their products. This poster presents EarthCollab’s efforts to engage the project’s relevant scientific communities to gather focused feedback. Over the course of our project, we have used a variety of methods to get input into our projects directions and outcomes, including a kick-off workshop, surveys, targeted usability testing, focus groups, and lengthy interviews. This poster will discuss insights that came from these methods, and practical outcomes that were implemented based on user feedback. We will also discuss lessons learned in using a multi-method approach, and compare the relative utility of the different types of feedback.
Wonsuck Kim1, Brandon McElroy2, Kimberly Miller2, Leslie Hsu3
1University of Texas, Austin; 2University of Wyoming; 3U.S. Geological Survey, United States of America
The Sediment Experimentalist Network (SEN) is an EarthCube Research Coordination Network that began in 2013. SEN aims to improve the efficiency and transparency of sedimentary and geomorphic research for experimentalists, modelers, and field geologists by providing guidance on best practices for data collection and management. SEN’s accomplishments include publications on recommended data practices in disciplinary journals, clinics and town halls on tools for experimental data management, and workshops that have built an international network that can rapidly share evolving techniques in experimental data collection, management, and dissemination. Most importantly, SEN has established of a community platform on which to ask questions and bring up concerns in our changing research data landscape. The ability of the SEN community to act with one voice has facilitated connections with modeling communities such as the Community Surface Dynamics Modeling System (CSDMS) and has introduced a publication solution for our large volumes (hundreds of gigabytes to terabytes) of data through SEAD data services. A challenge for many research coordination networks is sustaining activity and momentum after the initial funding period. This poster summarizes SEN’s activities and lists practices that we’ve found most useful for sustaining engagement with our research community.
Chen Zhang, Liping Di, Ziheng Sun, Eugene Yu, Juozas Gaigalas
Center for Spatial Information Science and Systems, George Mason University
This study investigates the feasibility of integrating geospatial Web services and cloud infrastructure to facilitate Earth science study. A general framework is designed and a prototype of implementing Earth science models as web services is developed. The implementation is composed of hardware layer, platform layer, service layer, and client layer. All services, model, and data are deployed as virtual machine instances, managed by GeoBrain Cloud and bridged by the CyberConnector. The result shows that publishing Earth science models through the framework would significantly improve its performance and provide great benefits to the Earth science modeling community.
Michael Kirk1, Raphael Attie2, Barbara Thompson3, Alisdair Davey4, W. Dean Pesnell3
1NASA Goddard Space Flight Center, Catholic University of America; 2NASA Goddard Space Flight Center, NASA Postdoctoral Fellow; 3NASA Goddard Space Flight Center; 4National Solar Observatory
With the launch of the Solar Dynamics Observatory (SDO), solar physics was plunged into the era of big data. SDO has revolutionized solar imaging with its unprecedented spatial detail and temporal coverage since 2010 – taking about one image of the sun each second, 24 hours a day. It has also produced nearly 6 petabytes of data. How do we make use of this rich dataset to answer pressing solar physics problems? To begin to answer this question, we had to change the way we thought about data. Instead of thinking of a physics question and then interrogating the dataset, we are querying the data to see what physics questions could possibly be answered. We take this novel approach to survey more than 130 million images and identify and group compact bright points across the entire SDO mission. In the past, these features have been discarded in favor of more energetic eruptions on the sun. We create innovative software to filter and extract 0.000001% of the pixels captured to obtain fundamental information about the sun. These detected bright points are about 0.04% of the diameter of the sun and last for just a few seconds but may hold the clues to some of the biggest outstanding questions in solar physics.
Supporting collaboration for Greenland ice sheet-ocean interaction research
Ginny Catania1, Patrick Heimbach1, Twila Moon2, Leigh Stearns3, Fiamma Straneo4, David Sutherland5
1University of Texas, Austin; 2National Snow and Ice Data Center, University of Colorado, Boulder; 3University of Kansas; 4Scripps Institution of Oceanography; 5University of Oregon
Accelerating Greenland Ice Sheet mass loss and freshwater export is changing the properties of fjord and ocean waters. The extent of these changes, and their connections to other elements of the atmosphere and biosphere, are not well understood. This knowledge gap is exacerbated by difficulties in communicating across different research domains (e.g., glaciologists and oceanographers). Cross-domain research is increasingly important, however, for answering critical questions about how ice sheet changes are influencing the ocean and vice versa. The Greenland Ice Sheet/Ocean Research Coordination Network (GRISO RCN) is advancing collective, integrated understanding of ice/ocean/atmospheric dynamics around Greenland by facilitating knowledge exchange, increasing data accessibility, and leading synthesis activities. These activities include working with the wider research community, and across several disciplines, to define data needs and to support new platforms for distributing and collecting these data. For example, the GRISO RCN is supporting definition and development of a Greenland Ice Sheet Ocean Observing System (GrIOOS), which would provide consistent, coincident long-term data on critical variables across the ice sheet and ocean interface. This effort has led to a new collaboration with the Interagency Ocean Observation Commission (IOOC), and strengthened collaboration with the U.S. Arctic Observing Network (US AON). Another example is ongoing work to provide improved freshwater flux data for ocean modelers, including the CORE Ocean modeling group. Similarly, the GRISO RCN is supporting integration of accurate ocean forcing conditions for use in the Ice Sheet Modeling Intercomparison Project (ISMIP6). These and other efforts of the GRISO RCN are improving interdisciplinary research across key system interfaces and supporting continued improvement to address the most important questions regarding interactions among ice, ocean, and atmosphere.
Karen Stocks, Stephen Diggs, Christopher Olson, Anh Pham
Scripps Institution of Oceanography, United States of America
SeaView is a consortium of ocean data centers and tool developers* working together to make oceanographic data more accessible and interoperable. Seaview produces thematic collections of integrated data in standard formats, aligns the formats and metadata to make them easy for scientists to integrate into common visualization and analysis tools, and serves them from the SeaView data site (www.seaviewdata.org).
In its last year (a no-cost extension), SeaView work has focused on two main areas. First, producing a final, multidisciplinary, integrated data package on the Southern Ocean and planning a ‘data hack’ to support scientists working with the package. The data package has been released, incorporating data not just from SeaView partners, but also from external programs such as Argo (a global network of ocean profiling floats) and the Palmer Long Term Ecological Research site. The hands-on data workshop will be held at the Polar 2018 conference in mid-June, as a collaboration with the Southern Ocean Observing System.
The second goal for this year is to develop a cookbook for future data projects, within and outside EarthCube, describing the processes we developed to align the data, the problems we solved, and the outstanding challenges that remain to be solved in future projects.
** BCO-DMO, CCHDO, OBIS, ODV, OOI, R2R
Deborah Khider1, Julien Emile-Geay1, Nicholas McKay2, Garijo Daniel1, Gil Yolanda1, Ratnakar Varun1
1University of Southern California, United States of America; 2Northern Arizona University, United States of America
Paleoclimate observations are crucial to assessing current climate change in the context of past variations. However, these observations often come in non-standard formats, forcing paleogeoscientists to spend a significant fraction of their time in the mneial tasks of data wrangling and reformatting. This is a drain of community resources, lowering the value of the datasets to scientists and society alike. People expected more of the 21st century.
The EarthCube-supported LinkedEarth project is helping manifest a better future by creating an online platform that (1) enables the curation of a publicly-accessible database by paleoclimate experts themselves, and (2) fosters the development of community data standards, including an ontology. In turn, these developments enable cutting-edge data-analytic tools to be built and applied to a wider array of datasets than ever possible before, supporting more rigorous assessments of the magnitude and rates of pre-industrial climate change.
We illustrate this with 3 case studies. First, we describe the process of collaboration and iterative development of the Past Global Changes past 2,000 years project (PAGES2k), where LinkedEarth cyberinfrastructure both supported the project, and how LinkedEarth activities identified unmet needs. Secondly, we describe how paleoclimate observations that cover the past 10,000 years are used along with analysis tools such as Pyleoclim and GeoChronR to quantify regional Holocene climate evolution and its uncertainties. Lastly, we demonstrate how the LinkedEarth platform can help put those tools in the hands of every paleoclimatologist, enable more efficient data curation via semantic web technologies and crowdsourcing, and support broader data syntheses that easily integrate published information.
LinkedEarth tools and standards are being adopted by various organizations like PAGES and WDS-Paleo (NOAA), and interconections to other EarthCube registries are explored as part of Project 418.
Ethan Davis1, Charlie Zender2, David Arctur3, Kevin O'Brien4, Aleksandar Jelenak5, Dave Santek6, Mike Dixon7, Timothy Whieaker3, Kent Yang5, Jonathan Yu8, Mark Hedley9, Adam Leadbetter10
1UCAR Unidata, United States of America; 2University of California, Irvine; 3University of Texas, Austin; 4University of Washington/JISAO and NOAA/PMEL; 5The HDF Group; 6University of Wisconsin/SSEC; 7NCAR/EOL; 8CSIRO; 9The Met Office, UK; 10Marine Institute, Ireland
NetCDF-CF is a community-developed convention for storing and describing earth system science data in the netCDF binary data format. It is an OGC recognized standard with numerous existing FOSS (Free and Open Source Software) and commercial software tools which can explore, analyze, and visualize data that is stored and described as netCDF-CF data.
The EarthCube netCDF-CF project has lead and supported the development of several extensions to netCDF-CF that have or soon will be proposed to the netCDF-CF community. Work on these extensions involved broad participation by members of the existing netCDF-CF community as well as members of ESS domains not traditionally represented in the netCDF-CF community. Several of the extensions that are furthest along the development / proposal / acceptance process include:
The Geometries proposal (which has been accepted by CF) allows Hydrologists to represent, e.g., river flow data for a network of river segments as well as precipitation for a collection of drainage basins.
The Satellite Swath proposal supports satellite remote sensing scientists to represent satellite swath data in the original instrument viewing geometry.
The CF-Radial proposal supports storing radar and lidar data in polar coordinates and with metadata important to represent data from pulsed, scanning instruments.
The netCDF-LD proposal enables encoding Linked Data descriptions in netCDF files with explicit bindings to conventions, vocabularies and other online Linked Data resources.
The Group proposal enables data and metadata to be stored in a way that captures hierarchical directory-like structures.
This presentation will provide an overview and update of this work. It will present some of the data analysis and visualization tools that have been prototyped as well as work to improve performance and usability.
Yu Pan1, Jin Wang1, Michael L. Rilee2, Lina Yu1, Feiyu Zhu1, Kwo-Sen Kuo3, Hongfeng Yu1
1University of Nebraska-Lincoln, United States of America; 2Rilee Systems Technologies LLC, Derwood, MD, United States of America; 3University of Maryland, College Park, United States of America
Few research fields may claim a longer history than geoscience regarding the challenges of voluminous and diverse data. Today, the severity of these challenges is intensifying and the urgency to address them heightening. While technology advancements in computation are leading to model simulations with higher spatiotemporal resolutions, similar advancements in instrumentation have increased observation resolutions and, in addition, greater observation diversity. While the corresponding increase in data volume requires more storage to archive and more computer power to process and analyze, it is the increase in data variety, under the existing data practice, that requires disproportionately more labor and time in order to perform integrative, interdisciplinary analyses that are required for a complex system of systems like our Earth.
In this poster, we present a unified computation and storage framework to holistically address both the volume and variety challenges of Big Earth Data. In the design of the framework, we employ SciDB, an array-based parallel database management system, to tightly couple the analytics engine with the storage system, in which the detailed map of data partition locations is directly accessible by the analysis engine at run time. Our framework co-locates and co-aligns multiple datasets with diverse data models and resolutions in a distributed environment and thereby significantly reduce data movement, the principal cause of performance bottleneck in parallel and distributed analytics. We demonstrate that our unified framework has made it possible to conduct interactive geophysical analytics on large-scale heterogeneous data by multiple concurrent users.
Kwo-Sen Kuo1, Hongfeng Yu2
1University of Maryland, College Park, United States of America; 2University of Nebraska-Lincoln, United States of America
The two principal sources of data in geoscience are observation and model simulation output. To better understand the Earth system and exploit technology advancements in instrumentation and computation, we are continuously refining the spatiotemporal resolutions of both data sources. The result is ever increasing data production rates and volumes of data. Meanwhile, the multiplicity of observations and the diverse varieties of data resulted from it incur a heavy toll on interoperability. Different instruments—in accordance with their positions, purposes, requirements, and physical or practical constraints—often utilize different sensing geometries generating observations with resolutions varying both spatially and temporally. Further processing of these data produces even greater varieties. These challenges confronting geoscience data analysis become formidable barriers to realizing the potential of advancements in observational and simulation techniques for geoscience data analysis.
We present a set of new techniques to address these Big Earth Data challenges in a distributed, data-intensive environment. We have developed the SpatioTemporal Adaptive-Resolution Encoding (STARE), a new indexing scheme for accessing and managing datasets with different data model and resolutions in a unified manner. STARE indexed datasets are partitioned and distributed in a cluster of computing units (nodes) with local storage, where partitioned data chunks of all datasets are spatiotemporally co-located and co-aligned, i.e. according to the most prevalent data access pattern of geoscience analysis, thereby minimizing runtime repartition and communication overhead and maximizing performance. Based on this tightly coupled solution optimizing data-compute affinity, we have developed a web-based visual analytics interface that allows users to intuitively conduct visual explorations and execute customized queries to search for features of interest across multiple heterogeneous datasets. With our end-to-end analysis system, scientists are equipped with a new interactive and scalable capability to tackle Big Earth Data and make new discoveries without being overburdened by complex data computing and storage management.
Kerstin Lehnert1, Suzanne Carbotte1, Vicki Ferrini1, Megan Orlando1, Stephen Richard1, Neville Shane1, Ilya Zaslavsky2
1Lamont-Doherty Earth Observatory, United States of America; 2San Diego Supercomputer Center, UCSD
The goal of the Alliance Testbed Integrative activity is to support publication of data products through workflows that reuse components that have cross-domain/multidisciplinary utility, and provide hooks to acquire science domain, resource, and community specific information necessary to support data reuse. We are testing the idea that reuse of software components can increase efficiency by reducing duplication of effort in software development and user training. Various social and organizational challenges have become apparent in the course of this project. Because operational data systems have existing accession workflows with working software and trained users, there is little motivation for them to adopt a new general purpose solution. New workflows thus target ‘long tail’ data producers, a large user base with heterogeneous requirements, but small volumes of data. This community requires extensive outreach and training, resulting in high costs per dataset for specialized software and support staff. Implementing a dynamic, configurable, modular workflow engine for data documentation and repository accession that is flexible enough to support heterogeneous data is complex. Our project has only been able to implement the simplest aspects of such a workflow. The promise of smart software to automate extraction of data documentation looms in the future, but flexible, widely applicable metadata extraction tools are not yet available. Significant resources would be required to realize the benefits of a general purpose geoscience data submission hub application, and the benefit to be gained does not clearly outweigh the cost at this point. Technology solutions from other domains might provide a basis for moving forward. Emerging systems (e.g. Figshare, Dryad, TACC, Cyverse, XSEDE) could provide domain-agnostic repository resources for simple file-based data packages (e.g. W3C Data on the Web), while the science community focuses on allocating resources for data curation activities that require domain expertise in the preparation of these packages.
Ruth D Gates1, Ouida W Meier1, Megan Donahue1, Judy Lemus1, Gwen Jacobs1, Erik Franklin1, Ilya Zaslavsky2, CRESCYNT Data Science for Coral Reefs Workshop_Participants1
1Hawaii Inst of Marine Biology / Univ of Hawaii Manoa, United States of America; 2San Diego Supercomputer Center, Univ of California San Diego, United States of America
The CRESCYNT Coral Reef Science and Cyberinfrastructure Network held a workshop on Data Science for Coral Reefs: Data Rescue workshop to broadly teach some essential data management skills, consistently indicated by coral reef researchers is a persistent need, and simultaneously offer an opportunity for data rescue of older coral reef data sets in danger of being lost to science. High quality observations of reefs taken decades ago become more valuable with time, as those particular intersections of space, time, organism, community, and physical environment will never be repeated again. A need to develop a set of data rescue workflows that could be widely shared coincided with a need frequently expressed by coral reef researchers for improved data management skills. This workshop was designed as two days of training followed by two days of workathon, and was held at NCEAS in March 2018. Participants undertook basic data management training and metadata creation, and then made progress in salvaging, archiving, and linking collections of related data sets. The opportunity allowed us to develop specific recommendations for suitable repositories, metadata structures consistent with the needs of coral reef researchers, workflows for capturing, preserving, and maximizing future accessibility of valuable reef records, and revelations about how to keep data fresh and therefore actively curated. A data discovery exercise produced researcher assessment of several repositories and a metadata aggregator (CINERGI), and made researchers more aware of how to write thoughtful metadata for future discovery by others. Participants included a fortuitous combination of senior scientists, post docs, graduate students, and skilled technical specialists, who all contributed to what we expect will be a long tail of positive workshop outcomes as participants committed to sharing their new skills with others. Materials developed for the workshop have been designed to be shared with the coral reef community and other researchers.
Keshav Arogyaswamy, Emma Aronson
University of California, Riverside, United States of America
The Critical Zone is defined as the three-dimensional region of the biosphere from the highest treetops to the lowest groundwater—in other words, the zone of greatest heterogeneity. Our project has launched a cross-disciplinary research activity involving many universities affiliated with the 10 Critical Zone Observatories spread across the country. The scientific goal of this project is to gain insights into the differences between soil microbial communities as they vary across ecosystems, and with depth within a given soil profile. To that end, we are using a wide range of soil and environmental methods, as well as both metagenomic and targeted-amplicon high throughput sequencing, to analyze nearly 200 unique soil samples. To assemble and share this huge and diverse dataset, we are working with EarthCube and related projects to make the data accessible across the Critical Zone network, as well as throughout the broader community. We have developed procedural and datastream workflows to enable these goals, and are expanding the scope of our project with relevant EarthCube tools. Here, we are presenting one of several projects associated with the Critical Zone Integrative Microbial Ecology Activity (CZIMEA) program.
Soil microbes produce and consume large amounts of greenhouse gases, so understanding the factors that influence those activities is valuable for accurately modeling climate change. With EarthCube support, CZIMEA was able to analyze ~200 unique samples across a range of geospatial locations and depths. Counter to previous works, which were limited in scale and methodology, we found more change between locations than across depth—i.e., microbial communities differ more between diverse ecosystems than they do with increasing depth. These findings provide crucial data for models of climate change, because they demonstrate that microbial contributions to greenhouse gas levels vary widely by ecosystem types and across continental scales.
Anthony M Castronova, Phuong Doan, Martin Seul, Jonathan Goodall, David Tarboton
CUAHSI, United States of America
Innovative water research often requires multiple teams examining large amounts of diverse data. Recent advances in cyberinfrastructure (CI) for water science research are transforming the way scientists approach large collaborative studies. CI efforts for supporting reproducible science, enabling open collaboration across traditional domain and institutional boundaries, and extending the data life of cycle have been examples of focus areas. Project Jupyter is one such effort that combines an interactive programming environment with expressive text fields, and has been successfully leveraged in various educational settings. Our work combines the Jupyter software with the CUAHSI HydroShare web based hydrologic information system, a platform for sharing, publishing, discovering, and analyzing water data. The result is an open source cloud environment for designing, executing, and disseminating scientific toolchains for solving complex hydrologic problems along with supporting data and documentation. The overall goal of this work is to design a platform using open source software that will enable domain scientists to (1) conduct data intensive and computationally intensive collaborative research, (2) utilize high performance libraries, models, and routines within a pre-configured cloud environment, and (3) enable dissemination of research products. This presentation will discuss our approach for supporting cloud-based hydrologic model configuration and execution for educational and research purposes as well as the challenges and pitfalls therein, and our vision for future work.
Reproducing Applications at Large-scale: A case study of parameter sweep with Geounits
Zhihao Yuan1, Bakinam Essawy2, Tanu Malik1, Jonathan Goodall2, Scott Peckham3, David Tarboton4
1DePaul University, United States of America; 2University of Virginia, United States of America; 3University of Colorado, Boulder, USA; 4Utah State University, USA
Scientific research progresses when discoveries are reproduced and verified. Of- ten times, exactly repeating a computation does not guarantee the correctness of results. The computation must also be reproduced—verified by subjecting it to extensional data inputs to establish correctness of results. This oral session will describe how shared reusable research objects can be used to conduct parameter sweeps and senitivity analysis of shared hydrologic models (SUMMA), and verify the model across multiple ranges of parameters.
Janet Fredericks1, Felimon Gayanilo2, Carlos Rueda3
1Woods Hole Oceanographic Institution, United States of America; 2Texas A&M Corpus Christi; 3Monterey Bay Aquarium Research Institute
Over the past few years, the X-DOMES team has updated our ontology software, which has recently been adopted by the ESIP community as their Community Ontology Registry. We have also created tools to create SensorML and to create RelaxNG models to guide content creation. A SensorML Registry has been created to provide a tool for managing cross-domain sensor descriptions, as well as to manage content within agencies or projects. We have developed a role-based vision of how manufacturers, instrument owners, field operators and data managers can develop, reference and manage descriptions of observational lineage. These tools and mechanisms will be presented.
Steven Ryan Gooch1, V. Chandra1, Mike Daniels2
1Colorado State University, United States of America; 2National Center for Atmospheric Research, United States of America
The Cloud-Hosted Real-Time Data Services for the Geosciences project (CHORDS) has through now focused on delivering a cost-effective, cloud-deployable, real-time data streaming service for 1-dimensional time-series data measured with remote sensors. For upcoming versions, however, the focus will be shifting in part to accommodate the introduction of image data, specifically, weather radar data. This integration with image-style data is a large component of CHORDS Releases 1.0 and 1.1. This Poster will demonstrate the major goals, breakthroughs, limitations, and progress towards intelligently and efficiently incorporating weather radar data through the streaming service. Specifically, data handling both prior to ingest and through ingest are focal points of this research. Specifications and advancements will be shown regarding back-end developments for data storage, organization, and metadata for data discovery according to Linked Data standards. Front-end specifications and advances regarding real-time data visualization as part of this work will also be demonstrated. We will demonstrate the concepts and advancements in this project utilizing the CASA DFW Urban Testbed X-band radar network. Finally, future plans for further integration of weather radar data use cases will be presented.
Jianwu Wang, Matthias Gobbert, Zhibo Zhang, Aryya Gangopadhyay
University of Maryland, Baltimore County, United States of America
We present a new initiative to create a training program or graduate-level course (cybertraining.umbc.edu) in big data applied to atmospheric sciences as application area and using high-performance computing as an indispensable tool. The training consists of instruction in all three areas of "Big Data + HPC + Atmospheric Sciences" supported by teaching assistants and followed by faculty-guided project research in a multidisciplinary team of participants from each area. Participating graduate students, post-docs, and junior faculty from around the nation will be exposed to multidisciplinary research and have the opportunity for significant career impact. The poster discusses the challenges, proposed solutions, practical issues of the initiative, and how to integrate high-quality developmental program evaluation into the improvement of the initiative from the start to aid in ongoing development of the program.
Chris Mattmann1,2, SiriJodha Khalsa3, Ruth Duerr4, Wayne Burke2, Omid Davtalab1, Simin Ahmadi Karvigh1
1USC, United States of America; 2NASA JPL, United States of America; 3National Snow and Ice Data Center, United States of America; 4Ronin Institute, United States of America
Research Applications Laboratory, National Center for Atmospheric Research, Boulder, USA
Xarray is an open-source Python package that provides data structures for N-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines core libraries from the greater Scientific Python ecosystem (Numpy, SciPy, Pandas, Dask, NetCDF) to provide an intuitive and powerful platform for scientific analysis of large multi-dimensional geoscientific datasets. As part of the ongoing NSF EarthCube Integration Project: “Pangeo: An Open Source Big Data Climate Science Platform”, we have been improving Xarray’s integration with the Dask library to enhance parallel computations and streaming computation on datasets that don’t fit into memory. In this presentation, I will discuss how recent development on the Xarray package is enabling new scalable scientific analysis using increasingly large datasets in both high performance computing and cloud computing environments. I will also highlight how Xarray’s open-source community origins are facilitating its uptake and development across the earth science community and in other data science domains including physics, astronomy, and finance.
Bento Goncalves1, Bradley Spitzbart1, Shantenu Jha2, Heather Lynch1
1Stony Brook University, United States of America; 2Rutgers University, United States of America
The polar regions are critical to our understanding of climate and biogeochemical cycling, but traditional constraints imposed by their remoteness have made it difficult to even map some of these areas, no less to understand the mechanisms that link the region’s geology, hydrology, and biology. Over the last decade, however, there has been an extraordinary increase in the capture and use of high-resolution satellite imagery in polar areas. As the community moves from smaller-scale projects to demonstrate feasibility, to regular (or even real-time) pan-Arctic and pan-Antarctic surveys, it has become clear that further progress in imagery-enabled science requires the development of cyberinfrastructure to unite high-performance and distributed computing resources with polar imagery and the tools required for their study (software and code for analysis).
Here we demonstrate the use of our new developing cyberinfrastructure (ICEBERG - Imagery Cyberinfrastrcture and Extensible Building-Blocks to Enhance Research in the Geosciences) for a pan-Antarctic pack-ice seal survey. To accomplish this survey, we are using convolutional nueral networks for imagery annotation, an approach of broad utility for a range of biological and geological applications involving imagery interpretation and one that requires the careful and efficient coordination of imagery and high performance and distributed computing. We will also introduce several of the other use cases being used to develop ICEBERG's functionality, which we expect will include much of the functionality required by the larger EarthCube community.
iMicrobe: A place of data discovery, integration, and best practices
Elisha Wood-Charlson1, Bonnie Hurwitz2
1University of Hawai'i at Manoa, Honolulu, HI, USA; 2University of Arizona, Tucson, Arizona, USA
iMicrobe is a platform designed to provide scientists with modular tools, high-performance compute, and data storage for analyzing data from the microbial world through partnerships with community-driven cyberinfrastructure. iMicrobe leverages CyVerse cyberinfrastructure for compute and data storage and integrates community-developed tools from Biocontainers. iMicrobe adds search capabilities on top of these data to allow users to quickly and dynamically traverse data sets based on taxonomy, function, or project/sample data associated with ‘omics data sets. This robust architecture and design allows users to discover and search massive data sets, scale compute, and integrate data and tools in a manner that promotes community-driven efforts using modern technologies. We are currently using iMicrobe as a platform to connect and integrate resources for oceanographic data streams from ‘omics to physiochemical datasets.
Alexander Kosovichev1, Gelu Nita1, Vincent Oria1, Viacheslav Sadykov1, Wei Wang1, Sheetal Rajgure1, Shubha Ranjan2
1New Jersey Institute of Technology, United States of America; 2NASA Ames Research Center, United States of America
The primary goal is to develop tools for data access and analysis that can be easily used by the Geoscience community for studying and modeling various components of the coupled Sun-Earth system. The project will develop innovative tools to extract and analyze the available observational and modeling data in order to enable new physics-based and machine-learning approaches for understanding and predicting solar activity and its influence on the geospace and Earth systems. The geospace data are abundant: several terabytes of solar and space observations are obtained every day. Finding the relevant information from numerous spacecraft and ground-based data archives and using it is a paramount, and currently a difficult task.
The scope of the project is to develop and evaluate data integration tools to meet common data access and discovery needs for two types of Heliophysics data: 1) long-term synoptic activity and variability, and 2) extreme geoeffective solar events caused by solar flares and eruptions. The methodology consists in the development of a data integration infrastructure and access methods capable of 1) automatic search and identification of image patterns and event data records produced by space and ground-based observatories, 2) automatic association of parallel multi-wavelength/multi-instrument database entries with unique pattern or event identifiers, 3) automatic retrieval of such data records and pipeline processing for the purpose of annotating each pattern or event according to a predefined set of physical parameters inferable from complimentary data sources, and 4) generation of a pattern or catalog and associated user-friendly graphical interface tools that are capable to provide fast search, quick preview, and automatic data retrieval capabilities.
The Team has developed and implemented the Helioportal that provides a synergy of solar flare observations, taking advantage of big datasets from ground- and space-based instruments, and allows the larger research community to significantly speed up investigations of flare events, perform a broad range of new statistical and case studies, and test and validate theoretical and computational models. The Helioportal stores, integrates, and presents records of physical descriptors of solar flares from various catalogs of observational data from different observatories and heliophysics missions.
Ryan Michael May, John Leeman
UCAR/Unidata, United States of America
MetPy’s is a Python toolkit for meteorology, encompassing tools for reading data, performing calculations, and making plots. As part of the Pangeo project, whose goal is to provide a framework for analyzing earth system model output that scales to petabyte-scale datasets, MetPy serves as a set of domain-specific functionality that rests on a foundation based on other scientific Python libraries, such as numpy and matplotlib.
In order to scale to the needs of large datasets, Pangeo has identified the need to leverage the XArray and Dask libraries as part of this foundation. XArray provides a standard data model for n-dimensional gridded data based on the netCDF data model, similar to the Common Data Model used within the netCDF-Java library. Dask provides a framework for distributed computation that greatly simplifies the task of doing out of core computation, necessary to work with petabyte-scale datasets.
This work discusses the experience of integrating MetPy with this broader foundation, including concrete examples of the user-facing benefits that have been achieved, as well as the challenges encountered, such as the integration of MetPy’s support for physical units with XArray.
Brian Mapes, Yuan Ho
University of Miami, United States of America
We will describe the state of our project linking the IDV (a Java-based, mature, advanced software system) to the popular c-Python ecosystem. GitHub repos, Python-world standard documentation, and Python package managers are the distribution mechanisms. New features have also been added to the IDV under project support. Students have been used to test the system on diverse platforms and with limited skill sets.
D. Sarah Stamps1, James Gallagher2, Scott Peckham3, Anne Sheehan3, Nathan Potter2, Maria Stoica3, Sean Malloy1, Emmanuel Njinju1, Zachary M. Easton1, Daniel R. Fuka1
1Virginia Tech, United States of America; 2OPeNDAP; 3University of Colorado, Boulder
The objective of the EarthCube broker BALTO (Brokered Alignment of Long-Tailed Observations) is to provide a robust, community-extensible interoperability framework that fully embraces multi-domain interoperability. Here, we present our progress towards this objective, which encompasses 4 activities: First, we seek integration with EarthCube project P418 by developing a site map generator that lists both catalogs and datasets so that the P418 crawler can find these resources. We are also developing an approach to inject JSON-LD into the catalogs in the 'dataset form', so the P418 crawler can index the broker's datasets; Second, we are applying FAIR data principles (Findable, Accessible, Interoperability, and Reusable) to Global Navigational Satellite System / Global Positioning System (GNSS/GPS) velocity solutions that will be accessible via the NSF Geodesy Facility UNAVCO (www.unavco.org); Third, We are in the process of making BALTO available within the XSEDE container engine environments. We are implementing a containerized versions of Hyrax and the resulting BALTO brokering engine, which aligns with the EarthCube P418 efforts because we are utilizing the same NSF XSEDE services; Finally, we have been able to expand broader impacts efforts and initiate our proposed development of training materials geared towards geo-scientists with a range of programming skill levels. The training involves using multi-institution co-developed Hydro-Ecological modeling courses that are currently being developed at Virginia Tech and Cornell University. Training sessions are being integrated into process modeling courses to address comparing traditional data search, data access, data pre-processing methods, and enhanced brokering methods. As an open-source brokering capability based on the existing, widely used open-source software Hyrax, we expect adoption of BALTO across the GEO domains due to its existing adoption by data centers throughout the US and internationally.
Tanu Malik1, Jonathan Goodall2, David Tarboton3, Scott Peckham4, Eunseo Choi5, Asti Bhatt6
1DePaul University, United States of America; 2University of Virginia, United States of America; 3Utah State University, USA; 4University of Colorado, Boulder, USA; 5University of Memphis, USA; 6SRI International, USA
Recent requirements of scholarly communication emphasize the reproducibility of scientific claims. Text-based research papers are considered poor mediums to establish reproducibility. Papers must be accompanied by “research objects”, aggregation of digital artifacts that together with the paper provide an authoritative record of a piece of research. We will present GeoTrust, an integrated workbench for creating, sharing, and reproducing reusable research objects. GeoTrust provides tools for scientists to create ‘geounits’---reusable research objects. Geounits are self-contained, annotated, and versioned containers that describe and package computational experiments in an efficient and light-weight manner. Geounits can be shared on public repositories such as HydroShare, and also using their respective APIs reproduce on provisioned clouds. The latter feature enables science applications to have a lifetime beyond sharing, wherein they can be independently verified and trust be established as they are repeatedly reused.
Through research use cases from several geoscience laboratories across the United States, we will demonstrate how tools provided from GeoTrust along with Hydroshare as its public repository for geounits is advancing the state of reproducible research in the geosciences. For each use case, we will address different computational reproducibility requirements. Our first use case will be an example of setup reproducibility which enables a scientist to set up and reproduce an output from a model with complex configuration and development environments. Our second use case will be an example of algorithm/data reproducibility, where in a shared data science model/dataset can be substituted with an alternate one to verify model output results, and finally an example of interactive reproducibility, in which an experiment is dependent on specific versions of data to produce the result. Toward this we will use software and data used in preparing data for the MODFLOW model in Hydrology, JupyterHub used in Hydroshare, PyLith used in Computational Infrastructure for Geodynamics, and GeoSpace Collaborative Observations and Assimilative Modeling used in space science.
Weiming Hu1, Guido Cervone1,2, Michael Mann3, Vivek Balasubramanian4, Matteo Turilli4, Shantenu Jha4
1Dept. of Geography and Institute for CyberScience, Geoinformatics and Earth Observation Laboratory, The Pennsylvania State University, University Park, PA; 2Research Application Laboratory, National Center for Atmospheric Research, Boulder, CO; 3Department of Meteorology and Atmospheric Science, The Pennsylvania State University, University Park, PA; 4Research in Advanced Distributed Cyberinfrastructure and Applications, The State University of New Jersey, NJ
The Analog Ensemble is a statistical technique to generate probabilistic forecasts. This is a computationally efficient solution to ensemble modeling because it does not require multiple NWP simulations, but a single model realization. However, the required computation can grow very large because atmospheric models are routinely run with increasing resolutions. For example, the NAM contains over 262,792 grids to generate a 12 km prediction. NWP models generally use a structured grid to represent the domain, despite the fact that certain physical changes occur non-uniformly across space and time. For example, temperature changes tend to occur more rapidly in mountains than plateaus. A new machine learning based algorithm is proposed to dynamically and automatically learn the optimal unstructured grid pattern. This iterative algorithm is guided by machine learning rule generation and instantiation to identify grid vertices. Analog computations are performed only at vertices, therefore minimizing the number of vertices. Identifying their locations are paramount to optimize the available computational resources, minimize queue time, and ultimately achieve better results. The optimal unstructured grid is then used to perform probabilistic forecasts for a variety of applications like uncertainty quantification or renewable energy prediction. In this work, the short-term temperature is used as a study case.
A unified experimental-natural digital data system for analysis of rock microstructures
Julie Newman1, Basil Tikoff2, J. Douglas Walker3, Jason Ash3, Jessica Good Novak3, Randolph T. Williams4, Nicolas M. Roberts2, Cunningham Hannah1, Snell Alexandra1
1Texas A&M University, United States of America; 2University of Wisconsin - Madison, United States of America; 3University of Kansas, United States of America; 4McGill University, Canada
The StraboSpot data system (StraboSpot.org), initially developed as a field app for Structural Geology and Tectonics, enables the collection and sharing of field data and images. We are expanding StraboSpot to a desktop environment that allows collection, sharing, analysis and comparison of microstructural data. Rock microstructures relate processes at the microscopic scale to phenomena at the outcrop, orogen, and plate scales. Interpretation of microstructures formed in nature during deformation is aided by comparison with those formed during rock deformation experiments, under known conditions of pressure, temperature, stress, strain and strain rate. Interpretation of experimental rock deformation, likewise, benefits from the ground truth offered through comparison with rocks deformed in nature. However, the ability to search for relevant naturally or experimentally deformed microstructures requires a database that contains both types of these data. By developing a single digital data system for rock microstructures deformed in experiment and in nature, we can enable the critical interaction between practitioners of experimental deformation, those studying natural deformation and the cyberscience community.
To facilitate the collection of microstructural data, and to accommodate the workflow of these communities, this system requires: 1) The development of a common system to communicate the orientation of samples used for microanalysis (e.g., thin sections) and the location of features within samples; 2) Modification of the StraboSpot data system to accept microstructural data from both naturally and experimentally deformed rocks; and 3) Linking the microstructural data to its geologic context – either in nature, or its experimental data/parameters. As first steps to meet these goals, we have designed an orientation and grid system for samples used for microanalysis and are engaging the relevant communities to establish metadata, data standards and protocols for data collection.
U.S. Geological Survey, United States of America
EarthCube and the U.S. Geological Survey’s Community for Data Integration (CDI) are two communities of practice with the common goal of providing tools, infrastructure, and knowledge for Earth Science data and science integration. The CDI is a group that helps its members gain the skills for handling and integrating Earth and biological data through activities like monthly presentations, collaboration areas, and trainings. Through an annual request for proposals, the CDI also provides financial support to USGS data integration and data management activities, which helps to put new technologies to practical use. Can EarthCube and the CDI use each other’s networks and outputs to more quickly achieve their goals? Both Earth Science communities have been in existence for several years and are working to develop pilot demonstrations, technology integration, definitions of success, and ways to enhance engagement in the scientific community and with external partners. This poster identifies areas of potential connection and collaboration between the different funded projects and working groups in the CDI and EarthCube, with the goal of starting conversations that will mutually benefit both of these communities. Potential connections include presentations to share knowledge on common interests and challenges, leveraging outputs and adopting products and best practices, collaboration on future projects, and increased dual membership.
Jason Ash1, Julie Newman2, Jessica Good Novak1, Duncan Casey3, Marjorie Chan3, Diane Kamola1, Elizabeth Hajek4, Kristin Bergmann5, Allen Glazner6, Blair Schoene7, Frank Spear8, Basil Tikoff9, J. Douglas Walker1
1University of Kansas, United States of America; 2Texas A&M University, United States of America; 3University of Utah, United States of America; 4Pennsylvania State University, United States of America; 5Massachusetts Institute of Technology, United States of America; 6University of North Carolina, Chapel Hill, United States of America; 7Princeston University, United States of America; 8Rensselaer Polytechnic Institute, United States of America; 9University of Wisconsin, United States of America
StraboSpot is a data system to collect and share geologic data. It consists of an app for iOS and Android devices and a graph database for the persistence layer. The main website at StraboSpot.org allows user management and data exploration and downloads. StraboSpot is being developed using extensive community input to define workflow, vocabulary, and specifications. The first iteration of StraboSpot was developed for structural geology and tectonics, and was tuned to collecting data on maps and images in the field. We are now expanding functionality to serve the petrology and sedimentary geology communities.
We have focused on the needs of these communities for using the field app. Critical to this process is ensuring that StraboSpot can accommodate the workflow of professionals and students collecting data in the field. For both the sedimentary geology and petrology communities we organized workshops to define vocabulary and workflow, then developed prototypes for field-testing and examination by experts and students. Working with data for igneous and metamorphic petrology proved relatively straightforward, and the workflow and visualization in the app was appropriate with the addition of more vocabulary. For sedimentary geology we have built a new data entry interface that is based on the use of measured stratigraphic sections. This new mode is in addition to developing new specifications for vocabulary. The construction of the stratigraphic section uses the same technology as the original mapping functions, but also incorporates integration of images into the column.
The addition of measured section has led us to develop what we refer to as a new mode for data collection. The mode relies on a column rather than a map. We anticipate that additional data collection modes will be needed as StraboSpot expands to serve the needs of additional field and laboratory communities.
Janine Krippner1, Stephen Kuehn1, Simon Goring2, Kerstin Lehnert3, Douglas Fils4, Amy Myrbo5, Anders Noren5, Cheryl Cameron6
1University of Concord, United States of America; 2University of Wisconsin - Madison; 3Lamont-Doherty Earth Observatory; 4Consortium for Ocean Leadership; 5LacCore Facility, University of Minnesota; 6State of Alaska, Division of Geological and Geophysical Surveys, Alaska Volcano Observatory
Tephra layers are beds of volcanic ash and pyroclastic material that can travel great distances, providing a near-instant time marker that can reach across oceans and continents. These tephra layers are used as stratigraphic and chronologic markers in fields including, but not limited to, volcanology, tephrochronology, archaeology, and paleolimnology. This is enhanced with recent advances in the study of widespread cryptotephras, tephra shards that are so sparse that they take additional techniques to detect. Each of these fields stores data in online databases, published material, and more commonly in personal repositories. These data may include physical (particle size, bed thickness), location, geochemical, mineralogical, time-stratigraphic, and interpretive information, that can be used to match samples from different sites and to calculate age models. THROUGHPUT is a collaborative project working to increase discoverability and access to tephra data across disciplines. Best practices for tephra data capture, reporting, and storage need to be developed and implemented across fields for everyone to benefit, not only from access to global data, but from the integration of database tools. In order to do this, schema across existing databases must be understood and incorporated into a system that can generate citable, reproducible workflows that draw information from across data resources, with workflows made available through keyword searches, with full attribution. This collaborative resource will allow data comparisons across data fields, utilizing a much more comprehensive dataset for all involved.
Daniel Garijo1, Jo Martin2, Natalie Freed3, Suzanne A. Pierce3,4, Yolanda Gil1, David R. Thompson5, Ibrahim Demir6, Imme Ebert-Uphoff7
1Information Sciences Institute, University of Southern California; 2Department of Geology, Oberlin College; 3Texas Advanced Computing Center, Universtiy of Texas Austin; 4Environmental Science Institute, University of Texas Austin; 5Jet Propulsion Laboratory, California Institute of Technology; 6Department of Civil and Environmental Engineering, University of Iowa; 7Electrical & Computer Engineering, Colorado State University
The EarthCube Research Coordination Network for Intelligent Systems for Geosciences (IS-GEO RCN) catalyzes collaborations to enable advances in our understanding of Earth systems through innovative applications of intelligent and information systems to fundamental geoscience problems. The uncertain, heterogeneous and disparate nature of geoscience data paired with recent IS advances and increases in observational data offer unique opportunities for new approaches and discoveries through joint efforts.
The IS-GEO RCN has jumpstarted interdisciplinary research collaborations through various activities. For example, convening a lightning session and townhall at the 2016 American Geophysical Union Fall meeting (Dec 2016), hosting an initial IS-GEO workshop with 30 participants (Jan 2017), completing a lexithon challenge event at the workshop (Jan 2017), organizing a workshop at the 2017 SIAM International Conference on Data Mining (April 2017), the pilot IS-GEO Summer institute (July 2017), jointly writing overview papers on challenges, opportunities and resources in this emerging field, and establishing an Early Career Committee.
Three active working groups foster collaborations between computer scientists and geoscientists. The working group on Case Studies (CASES) seeks to assemble a collection of case studies from the geosciences, to create easy entry points for IS researchers to work on geoscience applications. The working group on Education (EDU) is coordinating summer schools and writing a survey paper on existing interdisciplinary IS-GEO courses and degree programs. The working group on Models (MODELS) seeks to design a repository of geoscience models, with semantic descriptions of their characteristics using ontologies.
How to get involved: The IS-GEO RCN always welcomes new members. Interest researchers are invited to visit our website at https://is-geo.org/ to 1) learn about our activities; 2) attend monthly telecons with invited IS-GEO speakers; 3) sign up for the general mailing list and 4) join any working group.
James M Done, Cindy L Bruyère
National Center for Atmospheric Research, United States of America
The EarthCube project Accelerating Scientific workflowS using EarthCube Technologies (ASSET) accelerates scientific discovery through integrated use of EarthCube technologies. This is achieved by building capacity for scientists to sketch their workflows, identify the bottlenecks, and access and integrate EarthCube tool solutions.
Initial understanding is being developed through a demonstration use case. This use case is a multi-disciplinary investigation of whether climate change has contributed to the recent increase in U.S. hurricane losses. This presentation sketches the human activity workflow, the computational workflow, and their points of intersection. The workflow identifies the challenges of data discovery, big data, data connectivity, and also lost efficiency, metadata and reproducibility. The workflow sketch uses the Common Motifs in Scientific Workflows framework developed by Garijo et al. (2014) to allow for generalizable outcomes later in the project.
Once the workflow has been analysed, this use case will be used to explore how EarthCube tools may i) reduce time-to-science, and ii) integrate data, analysis and publication. Scientists disagree on whether there has been a climate change contribution to US hurricane losses despite starting with the same data. This use case therefore also presents opportunity to use EarthCube tools to explore scientific disagreement.
The purposes of the poster are to introduce ASSET and to facilitate conversations among scientists and cyberinfrastructure experts about the major roles for unique combinations of EarthCube tools and technologies across science domains.
Eugene Yu1, Liping Di1, Ziheng Sun1, Chen Zhang1, Juozas Gaigalas1, Benjamin A. Cash2, James L. Kinter2, David H. Bromwich3
1CSISS - George Mason University, United States of America; 2COLA - George Mason University, United States of America; 3Byrd Polar and Climate Research Center, The Ohio State University , United States of America
Climatic modeling and their teleconnection studies to link different events across spatial and temporal space often start with discovery of right data and observations to be fed as input in right projection, (temporal and/or spatial) resolution, and format. Currently, science community mainly relies on connections through particular channels (e.g. peer-reviewed paper, specialized conferences) in their specific domains or personal communication to find the proper data and observations for modeling. The process to find data and prepare data takes a lot of efforts out of scientific studies. This study focuses on utilizing the semantic data discovery and access capabilities realized in the EarthCube building block projects – BCube, CyberConnector, and GeoWS. Through CyberConnector, a federated search can be realized against THREDDS, CWIC, and FedEO. Their semantic information, especially taxonomy under GCMD, can be used to induce or refine the search when the scientist gives a phrase or a series of topics. The federated, semantic search capability is developed as a function of CyberWay – a service system based on integrating existing EarthCube building block components. The capability of refined or induced semantic search is applied in the teleconnection analysis to study the associations of climate events with other events (climatic or non-climatic). The search success with improved relevance is demonstrated.
Ilya Zaslavsky1, Stephen Richard2, David Valentine1, Thomas Whitenack1, Gary Hudman7, Karen Stocks4, Jeffrey Grethe3, Amarnath Gupta3, Ouida Meier5, Bernhard Peucker-Ehrenbrink6, Burak Ozyurt3
1San Diego Supercomputer Center; 2Lamont Doherty Earth Observatory; 3University of California, San Diego; 4Scripps Institution of Oceanography; 5University of Hawaii; 6Woods Hole Oceanographic Institution; 7Arizona Geological Survey
Most discovery tools to date have focused on searching resource metadata and related indexed content to locate data files or services, or software tools, of interest for addressing geoscience research problems. A typical scenario involves downloading data files and then preparing them for use in a locally-installed or a server-based software application. Accessing, exploring and configuring the data for use in research applications often takes significant time due to inconsistent or incomplete metadata, poorly described data semantics, and lack of mechanisms for bringing data directly into a workbench environment.
The EarthCube Data Discovery Hub is developing approaches and infrastructure to reduce time to science by linking search results in the catalog directly to software tools and environments. Implementing such linkages can be done in several ways: 1) Generate machine-actionable links that will open web-accessible applications and load the data resource and include them in metadata; 2) Augment metadata descriptions to make the data easier to interpret and incorporate in a research workflow; 3) Provide standardized, structured descriptions of data distribution options ('affordances') that client applications could use to match with applications dynamically, and 4) Develop bi-directional interfaces for invoking online applications and workbenches directly from the data discovery environment, and subsequently updating the catalog with information about data usage on the workbench.
Inclusion of the additional information systematically in metadata is performed by CINERGI metadata augmentation pipeline, with the metadata subsequently indexed and published in the DDH catalog – which now provides options for 'Workbench' linkage from search results. Adoptions of conventions for such additional metadata, and interfacing search results with one or several research workbenches, can enable EarthCube to streamline the workflow from data discovery to data utilization.
Christopher J. Crosby1, Paola Passalacqua2, Nancy Glenn3
1UNAVCO; 2University of Texas at Austin; 3Boise State University
This EarthCube Research Coordination Network (RCN) – Advancing the Analysis of High Resolution Topography Data (A2HRT) - will bring together communities that create and use high resolution topographic data, including those that conduct research on earth surface processes and those that create the technology to make these types of complex data usable. Members of this network will share currently available resources and best practices, and develop new tools to make data more available to researchers. Training will focus on teaching graduate student and early career researchers to access and use high-resolution topographic data to answer earth science research questions.
Vast quantities of HRT data have been collected, with applications ranging from research to engineering. Full scientific utilization of HRT data is still limited due to challenges associated with the storage, manipulation, processing, and analysis of these data. The cyberinfrastructure community, including computer vision, computer science, informatics, and related engineering fields are developing advanced tools for visualizing, cataloging, and classifying imagery data including point clouds. Yet, many of these tools are most applicable to engineered structures and small datasets, and not to heterogeneous landscapes. Together the earth science and cyberinfrastructure communities have the opportunity to test and validate emerging tools in challenging landscapes (e.g., heterogeneous and multiscale landforms, vegetation structures, urban footprints). In particular, A2HRT will be focused on four themes: (1) coordination of analysis of HRT data across the earth surface processes and hydrology communities to identify work-flows and best practices for data analysis; (2) identification of cyberinfrastructure development needs as new technologies for HRT data acquisition emerge; (3) use of HRT data for numerical models validation and integration of HRT data information in models; (4) training in HRT best practices, and data processing and analysis work-flows.
An initial A2HRT RCN workshop will be held August 2018 in Boulder, CO.
Juozas Gaigalas, Liping Di, Ziheng Sun, Eugene Yu, Chen Zhang
GMU, United States of America
The ability to accurately describe and predict the future evolution of our planetary geophysical system is a major challenge and an urgent goal for the scientific community. The dynamics of ocean, atmosphere, polar and other Earth systems can be effectively studied using computer models. Earth System modeling is an active domain of research that joins multiple geophysical and computational disciplines. Earth System computer models require heteronegous datasets to determine the initial system conditions and to verify model performance. Modeling studies also generate large datasets that represent the predicted physical parameters of Earth System. In some cases, the results of models are used as inputs by other models. The data consumed and produced by Earth System models is not stored in a single repository but is spread across multiple archives managed by diverse research organizations and academic institution. Research data archives have individual policies for cataloging the metadata for datasets they manage. This means that modellers who want to incorporate new and unfamiliar datasets into their model research might not be able to utilize metadata catalogs provided by research data repositories without significant additional costs. This adds barriers to experimentation in Earth System modeling research. Our talk will describe the process of combining metadata from several data sources to support model researchers studying the connection between Artic warming and changes in seasonal ocean and atmosphere dynamics. We will discuss conceptual and technical challenges encountered in our effort to catalog data from multiple research data archives into a single centralized searchable catalog as part of EarthCube CyberWay building block development.
Mak Saito1, Danie Kinkade1, Adam Shepherd1, David Gaylord1, Jacyln Saunders1, Noelle Held1, Michael Chagnon2, Nick Symmonds1, Matthew McIlvin1
1Woods Hole Oceanographic Institution, United States of America; 2RPS Ocean Science, South Kingstown, RI
Achieving EarthCube’s vision of a dynamic cyberinfrastructure requires leveraging existing infrastructure, and creating new components to fill gaps in capabilities. The development of needed data infrastructures enabling interdisciplinary data to be discovered, accessed, analyzed and visualized is critical to fulfilling EarthCube’s vision. Here, we describe new EarthCube data infrastructure in support of ocean proteomics research, an emerging field possessing unique data challenges.
The study of proteins has great potential as a tool for ocean scientists interested in detecting changes in ocean ecosystems. Protein measurements can be used as biomarkers of key biochemical processes, as well as allow broad diagnosis of entire ecosystems. The ability to share these data with non-expert users can increase scientific discovery and understanding biogeochemical change over time. Yet, protein datasets have specific metadata and contextual data that are critical to their interpretation and are not supported by current biomedical resources.
The Ocean Protein Portal (OPP) aims to make ocean protein data more discoverable and accessible to domain researchers and non-experts. Through collaboration with the Biological and Chemical Data Management Office (BCO-DMO), we are constructing a web portal that allows sequence and text based searches of ocean protein datasets. Protein data and metadata are submitted to BCO-DMO where they are indexed in ElasticSearch. This search index is connected to a website for data discovery and investigation. An API allows the OPP to leverage previously developed software for determining taxonomic relationships. The Portal allows users to query the occurrence of proteins of interest within the ocean. Search results are returned in tabular geospatial forms and available for export. Users are able to refine their query to generate taxonomic assignments.
Protein results from a 2011 oceanographic expedition have been ingested into BCO-DMO, indexed in the OPP, and are searchable. Associated cruise data at BCO-DMO will be discoverable and accessible through OPP search results. Because BCO-DMO is a participant in the EarthCube P418 Project, the protein-related data, once ingested into BCO-DMO, will not only be discoverable through the OPP, but also through schema.org enabled search engines.
Vivekanandan Balasubramanian1, Matteo Turilli1, Weiming Hu3, Matthieu Lefebvre2, Wenjie Lei2, Guido Cervone3, Ryan Modrak2, Michael Mann3, Jeroen Tromp2, Shantenu Jha1
1Rutgers University, United States of America; 2Princeton University, United States of America; 3Penn State University, United States of America
Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. In the “Power of Many” project, we discuss two scientific applications: seismic inversion and adaptive Analog Ensembles.
Seismic inversion is the most powerful tomographic technique to study the Earth’s interior. Scaling this technique is challenging because of the scale of computational resources required, number of failures encountered at large scale and human labor needed.
Adaptive Analog Ensemble generates probabilistic weather forecasts on a dynamically optimized unstructured grid. The adaptive requirement of this algorithm poses challenges in resource and workload management which relates to the size of the search space, the desired prediction accuracy, and the predictability of the weather variable. The goal is to generate predictions for suitability of photovoltaic power generation from various climate simulations.
We developed the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability of these two scientific applications, generalizing the proposed solutions to all ensemble-based applications. EnTK offers four main capabilities: (i) abstractions to describe and execute ensemble applications; (ii) execution of ensembles on heterogeneous computing infrastructures; (iii) scalability up to O(10^4) tasks; and (iv) task and resource fault tolerance. We used EnTK to provide automation and fault tolerance for the seismic inversion application and to enable the development of a novel Adaptive Unstructured Analog algorithm for the adaptive analog ensemble application.
We described the achievements of project year one in “Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications”, in a publication at IPDPS18. In the second year, we have initiated the support of the seismic inversion application at production scale on ORNL Titan and Summit, and the development of the adaptive Analog Ensemble application on NCAR Cheyenne.
Moges Berbero Wagena
Virginia Polytechnic Institute, United States of America
Watershed models that inform water quality management require data from the atmospheric sciences (i.e. weather data, historical, current conditions, short-term forecasts, and climate forecasts), plant biology (regional and environment based plant growth characteristics), ecosystem science (landcover surface interactions), and economic data (census, commodity sales, fertilizer sales), and surface topography and in situ hydrological measurements. In addition, this use case involves collecting and storing of data, Structure from Motion (SfM) topography, in situ hydrological and surface saturation data, and others that are manually inputted into CUAHSI, USGS, and or NOAA as appropriate. This use-case will apply metadata injection mechanisms developed in the framework of this project and develop accessors for the hydrologic models. Rather than focusing on a single watershed with local data to test surficial effects on water quality, BALTO will provide that capability to assess surface data from a global project (US, Africa, and several tropical regions) to deepen our understanding of the agile implementation of the use case… i.e. does following the workflow all the way through one time lead to new datasets being required (such as changing a topographic dataset, TOPOSWAT, RSWAT, and others).
Chad Trabant, Tim Ahern, Mike Stults, Inge Watson, Robert Weekly
IRIS Data Management Center
The IRIS Data Management Center (DMC) has operated a public repository of seismological data for 3 decades supporting thousands of researchers. Since its founding, the DMC has operated its own infrastructure to support the computational and storage resources needed to support its mission. In the EarthCube GeoSciCloud project the DMC is deploying a subset of its archive and key software components into two cloud environments. This project allowed the DMC to evaluate the realities of operating in the cloud, explore the potential advantages and disadvantages and compare costs. The two cloud environments selected for this project are Amazon’s AWS and XSEDE’s Jetstream and Wrangler systems. The XSEDE resources are operated on behalf of NSF by Indiana University jointly with the Texas Advanced Computing Center. The DMC deployed a ~40 terabyte test data set and a subset of its web service-based data access architecture to both environments. The DMC is conducting an extensive evaluation of the capabilities of these deployments. To ensure these systems support and, ideally, improve upon real-world research use cases, the DMC collaborated with scientists who performed their own tests designed to meet their research needs. A promising, expected gain from cloud-like environments over DMC-operated systems is the ability to scale-out in order to handle more simultaneous users, both with respect to storage I/O and processor intensive tasks. Another potential advantage is providing data within, or very near to, a powerful computing environment that researchers may also use. Also, evaluating the relative costs of the cloud environments against the DMC’s own infrastructure will be critical. We will report on the status of this work and lessons learned so far.
Jeremiah Marsicek, Simon Goring, Shaun Marcott, Stephen Meyers, Shanan Peters, Ian Ross, Brad Singer, Jack Williams
University of Wisconsin-Madison, United States of America
GeoDeepDive is an expanding digital library that allows for automated, secure acquisition and managing of original documents from publishers and supports large-scale text and data mining of published, peer-reviewed journal articles. GeoDeepDive development continues, but now major efforts are focusing on deployment and using GeoDeepDive to enable large-scale synthetic geoscientific research. Here, we use GeoDeepDive to study the behavior of Northern and Southern Hemisphere ice sheets over the Pliocene (last 5.3 million years) with respect to ice-rafted debris (IRD) found in ocean cores, and present the workflow and framework for using GeoDeepDive to answer other important scientific questions. Many publications document the existence of IRD at the level of individual marine drilling sites, but assembling this information across publications into large-scale mapped syntheses is a non-trivial task that has traditionally taken years of painstaking literature compilation. We have been using this cyberinfrastructure, GeoDeepDive, to mine publications (over 6,000,000 from many different publishers including Wiley, Elsevier, etc.) using optical character recognition and applying natural language processing utilities to the documents. Once obtained, we generate code to scan through thousands or millions of publications to extract information relevant to a set of search queries (e.g., a certain term, time period, or location). The workflow, which will help researchers design their own synthetic data analyses using GeoDeepDive, includes obtaining a set of documents that meet our search criteria, cleaning the corpus of documents to ensure that they contain instances of IRD in cores, and using regular expressions to extract information about the location of the core and the timing of events. This work will ultimately result in summary statistics about the cleaned set of documents and a summary map of IRD for key time periods from across the Pliocene and Pleistocene.