top of page

What kind of discoveries might be hidden in the growing sea of ‘dark data’?

What kind of discoveries might be hidden in the growing sea of ‘dark data’? GeoDeepDive might be able to tell us.

As a field, science has been growing, developing, and evolving – much like a living organism. The body of evidence and data in geological sciences alone has become large and complex, growing and transforming over generations. The internet has changed research to such a degree that it would likely be unrecognizable to early scientists. The interconnectivity that scientists and science administrators take for granted has made creating and sharing data immensely easier, resulting in an enormously complex and dynamic body of data.


Already our accelerated output has outpaced our ability to ingest and digest all of the information we are generating. As science continues to inundate itself in data, the different parts of the scientific body have become difficult to understand and balance. Those who work in science need increasing amounts of training just to arrive a basic understanding of how to manage their data, and they must constantly follow and refresh the work of others while performing, editing, and repeating their own research.

It is challenging enough to stay up to date with all of the moving parts in one’s own specialty, much less any other disciplines. This results both in overwhelmed researchers and missing or overlooked data. Many datasets go unexamined and unnoticed to those who could use them, simply because there are so many out there.


These unexamined datasets, called “Dark Data”, are one of the emergent properties of the rapid evolution of scientific research. Dark Data encompasses datasets that become buried in the immense volume of information that is being generated and disseminated in publications. These datasets essentially disappear after publication, much like old letters that get pushed to the bottom of the drawer. This means that an unimaginable amount of potentially relevant scientific research is being underutilized, and important discoveries could be buried in the sea of Dark Data, if only it could be aggregated.

As science has developed into a multi-modal, dynamic, reactive organism, understanding its smallest parts has become more difficult and more important; just as understanding the human body depends on examining minuscule chemical structures and enzymatic reactions. In order to understand the small parts, we need to access, organize, and analyze Dark Data.


Understanding the micro anatomy of any field, from earth science, to physics to astronomy, to psychology, takes advanced techniques and tools that can search millions of haystacks, find hundreds of needles, and them separate them all efficiently. Much the same way that CT Scan and X-rays revolutionized the study and diagnosis of disease in the human body, making previously unthinkable discoveries commonplace, the geosciences are in dire need of new tools that will more efficiently plumb the depths of our growing sea of data.


One new tool that has already been used by researchers to descend into their Dark Data is called GeoDeepDive. Designed by both geoscientists and computer scientists, GeoDeepDive is a digital library and machine reading system that can be coupled to software capable of searching – and understanding – scientific publications.

GeoDeepDive is a digital library and machine reading system that can be coupled to software capable of searching – and understanding – scientific publications.

Many GeoDeepDive applications begin with a customized full-text keyword search, similar to Google Scholar, but it then uses the full text of documents as input for machine learning and data mining applications. The goal is to bring back and aggregate information from across thousands or even millions of documents that is relevant to the specific questions a scientist is asking, as opposed to simply retrieving a list of articles that use that same word (though lists of articles can be generated in this fashion too). This allows not only for retrieval of data, but also for discovery of data and connections that a researcher may not have been aware of. GeoDeepDive can retrieve and organize information in a way that would take any scientist months, or even years to do on their own.


One of GeoDeepDive’s first team members is Jon Husson, now at the University of Victoria. He estimated that simply searching for just one keyword, say, “stromatolite”, and sifting through all of the the articles to find those that are described from name rock formations would take him roughly 16 months of time-consuming tedium, assuming that was all he did in his professional life. GeoDeepDive performed the process in about an hour. That’s a 99.96% decrease in working time to perform a critical discovery task, the results of which are published in Geology (doi:10.1130/G38931.1).


An area of challenge for the GeoDeepDive team has been working with publishing companies to gain access to publications and the rights to use them. This alone is a serious challenge because many journals of critical importance to science are copyrighted and difficult to access.

The team has been working steadily to increase its publishing partners, and as of now, GeoDeepDive includes a growing body of over six million articles. It ingests around 6,000 – 10,000 articles every 24 hours. GeoDeepDive eclipses collections like PubMed Central as the absolute largest collection of full-text published scientific content in the world. With a library this large to search through, and machine ability to do so quickly, a researcher can let GeoDeepDive find and organize the information, leaving exponentially more time to perform the task most researchers are eager to get to – derivation of meaning.


This ease of search and information retrieval also adds to the reliability of scientific data. Overlapping datasets are compared and verified much more easily, allowing for enhanced peer-review and replicability. GeoDeepDive has been used to assess the reliability of existing data conglomeration tools like the Paleobiology Database.

In addition to that, it simplifies the lives of scientists by bringing the information they care about to them, even if it isn’t a ‘top search’. It lowers the metaphorical blood pressure of the increasingly clogged system of data, and maybe the literal blood pressure of a few scientists who can find the information they need and still have time for the rest of their interests. Most excitingly, by performing analysis that couldn’t happen otherwise, GeoDeepDive can expose connections and ideas in data that weren’t apparent, or even conceivable, before.


GeoDeepDive may very well be the beginnings of the world’s next form of digital library, changing the way we store and use information; any type of information.

Funded in 2013 by the National Science Foundation as an EarthCube Building Block Project, GeoDeepDive is currently intended for geosciences, but is certainly not exclusive and the library spans most scientific disciplines. The infrastructure can be applied to any dataset, and in fact, may very well be the beginnings of the world’s next form of digital library, changing the way we store and use information; any type of information.

Some of GeoDeepDive’s contributors like Andrew Zaffos, Senior Research Scientist at the Arizona Geological Survey and the University of Arizona, are already visualizing how it can apply to many other fields, essentially any discipline that depends on aggregating a large volume of data from the literature can use this tool. Zaffos even speculates how GeoDeepDive could handle a project by the Arizona Geological Survey to catalog and organize historical mining documents. These documents are not modern, digital documents; but historical, handwritten, on-paper ‘datasets’ that range from letters and photographs to reports and licenses.


Could GeoDeepDive help us uncover some of our own human history? Or possibly help us educate our children? Shanan Peters, a Geology and Paleobiology scientist at the University of Wisconsin and the Principal Investigator on the GeoDeepDive Project, states that GeoDeepDive infrastructure is being used in biomedical research and even to organize and analyze children’s books. Maybe in the future it will allow lawyers or doctors to find small yet related cases, or identify otherwise evasive patterns.

The imagination can go quite far and still only reach the fringes of what kinds of ideas could be flowing through the veins of scientific information. Might there be ideas for combating climate change, concepts that help resist carcinogens, methods for food resilience, or secrets to survival away from Earth? Before GeoDeepDive we could only hope that these breakthroughs were ambiguously ‘there’ and that we might somehow come across them, but by providing more manageable ways to view and qualify data, GeoDeepDive gives us the ability to actively study the parts of this complex system and seek those connections.


Zaffos emphasized the idea that scientists who want to use this tool should not be afraid to do so. Simon Goring, an Assistant Scientist at the University of Wisconsin, began using GeoDeepDive at the EarthCube-sponsored C4P Community Development Workshop in Boulder (2016). “I was interested in helping to augment the data holdings in the Neotoma Paleoecological Database. We wanted to find papers with fossil pollen records that had been published, but we wanted to avoid papers about the genetics, chemical analysis of pollen itself, or other allergy-type publications.”


Goring continued, “we struggled at first finding a balance between search terms that would cover our requirements, without being too broad. In the end we settled on the term ‘pollen diagram’ as a good target.” Goring and his colleagues submitted the search term to the GeoDeepDive team and received a small subset of records back, approximately 100 papers, out of the entire corpus. “The first set of records weren’t all the papers that GeoDeepDive found, but having just a few hundred papers made it easy to get started, without being overwhelmed.”


GeoDeepDive is related to, but different than the DeepDive project at Stanford University. While it is possible to use the tool to undertake “deep learning” applications, any user – whether geoscientists, bioscientists, and social scientists – can use elements of GeoDeepDive without needing a cutting-edge knowledge of computer science. “From the small set of papers, it was pretty easy to use basic string matching to find some of the things we wanted, and to figure out why some unrelated papers were included in the set of papers returned. For example, we wound up dropping any paper that used ‘pollen diagram’, the key term, only in the bibliography.”

“GeoDeepDive was intimidating at first, but once I realized how much could be done just by searching for text in the documents, I was amazed!”


One must have only a knowledge of data analytics and basic programming in order to create their own program that can guide GeoDeepDive to customize their results from its immense library. The more scientists contribute their terms and context, the more useful GeoDeepDive is. “We wanted to contribute our workflow so it could be used to help build a cookbook of sorts” says Goring, who documented his workflow using an RMarkdown notebook. “Neotoma is focused on connecting users to data, and to making sure that the paleoecological community has the most up-to-date inventory of sites available.”

From the GeoDeepDive corpus to an interactive Leaflet map showing locations of pollen records that are currently in Neotoma (red dots) and that have yet to be acquired by the user-contributed database. By identifying records it is possible to help mobilize data that may otherwise be lost to science, and to connect researchers to one another.

The more users contribute their analysis of the DeepDive corpus, the easier it will be to produce accurate and understandable analysis using GeoDeepDive. As more publishers work with GeoDeepDive, their papers will be ingested using Optical Character Recognition, processed and added to the GeoDeepDive library. “When our application, the RMarkdown document, is ready to be run against the full GeoDeepDive library we will get automatic updates as new documents are added and match our queries. That’s really exciting for us, because it takes a very time consuming job, it codifies it using scripts that can be managed and modified, and it provides us with high quality data at a volume we’ve never been able to really harness before. Paleoecology has been a ‘long tail’ discipline for so long, GeoDeepDive really brings us into the world of Big Data.”


The scientific enterprise is constantly evolving; data cannot be separated from interpretation, and interpretation changes as more data is added to our understanding of the Earth System. In this way, GeoDeepDive becomes a critical link between the user, publications, the raw data and authors’ interpretation of the data. As the living, complex body of scientific data grows, we have to understand its parts in order to keep the scientific process healthy. GeoDeepDive helps do that by organizing data in a way that makes it manageable, and transferable between nearly any dataset. All it needs is the input of the scientists who want to use it, both saving them time and enhancing their analyses. The future of GeoDeepDive is bright and entirely unpredictable thanks to the expanse of its potential, and we will get to be witnesses, if not participants in its growth.

Visit the GeoDeepDive website to learn more. Contact the GeoDeepDive Team.

Article Contributors:

  • Emily Villaseñor, EarthCube Science Support Office

  • Shanan Peters, Principal Investigator GeoDeepDive, University of Wisconsin

  • Jon Husson, University of Victoria

  • Andrew Zaffos, Arizona Geological Survey, University of Arizona

  • Simon Goring, University of Wisconsin

  • Julie Petro, EarthCube Science Support Office

170 views0 comments


bottom of page