Talks and Poster Presentations (with Proceedings-Entry):
T. Csar, R. Pichler, E. Sallinger, V. Savenkov:
"Using Statistics for Computing Joins with MapReduce";
Talk: 9th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW 2015),
- 05-08-2015; in: "Proceedings of the 9th Alberto Mendelzon International Workshop on Foundations of Data Management, Lima, Peru, May 6 - 8, 2015",
A. Cali, M. Vidal (ed.);
CEUR Workshop Proceedings,
Paper ID 13,
The MapReduce model has been designed to cope with ever-growing amounts of data . It has been successfully applied to various computational problems. In recent years, multiple MapReduce algorithms have also been developed for computing joins - one of the fundamental problems in managing and querying data. The main optimization goals of these algorithms for distributing the computation tasks to the available reducers are the replication rate and the maximum load of the reducers. The HyperCube algorithm of Afrati and Ullman  minimizes the former by considering only the size of the involved tables. This algorithm was later enhanced by Beame et al.  to minimize the latter by taking into account also so-called "heavy hitters" (i.e., attribute values that occur particularly often). However, in contrast to most state-of-the-art database management systems, more elaborate statistics on the distribution of data values have not been used for optimization purposes so far. Recently, several approaches for handling skew in the computation of joins have been proposed, improving the partitioning of the data using histograms or varying a
cost model [6, 7], but there is still ample room for enhancements and optimization.
In  a survey of recent approaches for dealing with the weaknesses and limitations of
the MapReduce model can be found. The goal of this paper is to study the potential benefit of using more fine-grained statistics on the distribution of data values in MapReduce algorithms for join computation. To this end, we investigate the performance of known algorithms [1, 3] in the
presence of skewed data, and extend them by utilizing data statistics. We compare the original algorithms with a modified one that makes use of additional statistical measures. Our initial study shows that our approach can indeed improve existing methods.
Electronic version of the publication:
Project Head Reinhard Pichler:
Heterogene Information Integration
Project Head Stefan Woltran:
Created from the Publication Database of the Vienna University of Technology.