Finalized:Thursday, October 29, 2015
Author(s):Lopez, L.A., R. Duerr and S. J. S. Khalsa
Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation's EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.
L. A. Lopez, R. Duerr and S. J. S. Khalsa, 2015. Optimizing apache nutch for domain specific crawling at large scale. 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971. doi: 10.1109/BigData.2015.7363976This material is based upon work supported by the National Science Foundation under Grant No. 1343802. Opinions, findings, conclusions or recommendations expressed are those of the authors and do not reflect the views of the NSF.