We investigate the impact of data placement on two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance.
Khoa Doan, Amidu Oloso, Kwo-Sen Kuo, Thomas Clune, Hongfeng Yu, Brian Nelson, Jian Zhang. Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 341-346. doi: 10.1109/BigData.2016.7840621This material is based upon work supported by the National Science Foundation under Grant No. 1541043. Opinions, findings, conclusions or recommendations expressed are those of the authors and do not reflect the views of the NSF.