Optimizing the Performance of
Data Analytics Frameworks

Thesis

My thesis research was featured in the COMMIT/ Newsletter in 2017. The digital version of my thesis can be accessed through this link.

Abstract

Data analytics frameworks enable users to process large datasets while hiding the complexity of scaling out their computations on large clusters of thousands of machines. Such frameworks parallelize the computations, distribute the data, and tolerate server failures by deploying their own runtime systems and distributed filesystems on subsets of the datacenter resources. Most of the computations required by data analytics applications are conceptually straight-forward and can be performed through massive parallelization of jobs into many fine-grained tasks. Providing efficient and fault-tolerant execution of these tasks in datacenters is ever more challenging and a variety of opportunities for performance optimization still exist. In this thesis we optimize the job performance of data analytics frameworks by addressing several fundamental challenges that arise in datacenters. The first challenge is multi-tenancy: having a large number of users may require isolating their workloads across multiple frameworks. Nevertheless, achieving performance isolation is difficult, because different frameworks may deliver very unbalanced service levels to their users. Second, users have become very demanding from these frameworks, thus expecting timely results for jobs that require only limited resources. However, even with a few long jobs that consume large fractions of the datacenter resources, short jobs may be delayed significantly. Third, improving the job performance in the face of failures is harder still, as we need to allocate extra resources to recompute work which was already done. In order to address these challenges we design, implement, and test several scheduling policies for the evolving usage trends that are derived from the analysis of basic theoretical models. We take an experimental approach and we evaluate the performance of our policies with real-world experiments in a datacenter, using representative workloads and standard benchmarks. Furthermore, we bridge the gap between those experiments and prior theoretical work by performing large-scale simulations of scheduling policies.

Thesis Committee

Prof. dr. ir. D. H. J. Epema Delft University of Technology (promotor)
Prof. dr. ir. Piet Van Mieghem Delft University of Technology
Prof. dr. ir. Henri Bal Vrije University of Amsterdam
Prof. dr. ir. Alexandru Iosup Vrije University of Amsterdam
Dr. Giuliano Casale Imperial College London
Prof. dr. ir. Erik Elmroth Umea University
Dr. Asser Tantawi IBM Research, USA