Project Summary

We will deliver a Polar Deep Insights system that collects, analyzes, and makes interactive the wealth of textual and scientific Polar data collected to date in systems such as the: (1) Advanced Cooperative Arctic Data and Information System (ACADIS); (2) NASA Antarctic Master Directory (AMD); and (3) NSIDC Arctic Data Explorer (ADE). These repositories represent the characteristics of the Deep / Dark data domain and are indicative of the diversity of data that we see in EarthCube. For example ACADIS includes 50% textual data and 50% actual science data (that crawlers normally cannot extract) and requires a login to see the data; AMD is a metadata directory of links to science data; ADE has much Javascript and AJAX calls and dynamic page rendering. All of these are characteristics of the Deep Web and capabilities to extract and understand this information will be of use throughout EarthCube and its scientific domains.

Our project combines three existing EarthCube Buliding Blocks – Bcube, GeoDeepDive, and OntoSoft/GPF – and builds on NSF Polar CyberInfrastructure prior work and community workshops in 2013 and in 2014 along with investments from the DARPA MEMEX effort. Our system will crawl the Deep Polar Web, and will extract knowledge from text using information retrieval and data science (IRDS) techniques that bring together unstructured and structured science data to provide transformative insights. The web data we collect will be actively reviewed by the EarthCube Polar geoscientists at ESIP, AGU, via NSIDC and as part of the Polar Research Coordination Network (RCN). We will also build and deliver extracted data from a rich extraction pipeline that provides the following derived data: (1) Named Entity Recognition and – of people, places and Polar topics from data descriptions and abstracts; (2) Automatic Enrichment of Scientific Literature; (3) Automatic Identification of Scientific Measurements – such as “7 cm”; and (4) Automatic Multimedia Text and Metadata extraction. Extractions and crawled web data will drive an interactive, D 3 based web interface combining data visualization, science, and CyberInfrastructure in response to a key recommendation from the NSF DataViz Hackathon.

Our preliminary work in this area has shown that the unstructured textual data, when combined with structured scientific information can inform answers to grand challenge problems such as identifying ice sheet breakage/melt over decadal time spans; bird migration around Greenland, oil spills and natural disasters, sea ice decline and its relation to natural disasters, and other critical questions for the Polar community derived from the President’s National Strategy for the Arctic Region. The strategy identifies the Polar regions as critical natural/commercial resources including oil, iron and other ore, national security interests (maritime/air/sea and land), and as highly impacted area by climate change especially reduction of sea ice, caused by warming.