Bogdan Ghit
I'm a computer scientist and tech lead at Databricks broadly working on big data, distributed systems, and cloud computing. In the role of tech lead at Databricks, I seek to build cloud computing infrastructure to speed-up SQL workloads in the Databricks SQL warehouse.
Before joining Databricks, I was awarded a PhD in Computer Science from Delft University of Technology. My research was focused on scheduling and resource management for big data processing frameworks.
Impact
At Databricks I designed Cloud Fetch, a new architecture for high-throughput connectivity with Business Intelligence tools that enables parallel extracts of query results from data warehouses. I also incorporated the Dynamic Partition Pruning optimization in Apache Spark which detects and avoids scanning data that is irrelevant to the query.
One of the highlights of my research was Fawkes, a resource manager for allocating and balancing resources for big data frameworks. My work also advocates for using multi-queuing scheduling with bias for short jobs to optimize the job slowdown in data centers.
At IBM Research I co-invented Capri a new cloud spot market abstraction based on bribe scheduling that provides good fairness guarantees based on bids.
Publications
2022
2021
2020
-
EdgeFrame: Worst-Case Optimal Joins for Graph-Pattern Matching in Spark. P. Fuchs, P. Boncz, and B. Ghit.
GRADES-NDA, Portland, 2020.
-
SparkFuzz: Searching Correctness Regressions in Modern Query Engines. B. Ghit, N. Poggi, J. Rosen, R. Xin, and P. Boncz. DBTest, Portland, 2020.
2019
2018
2017
-
Optimizing the Performance of Data Analytics Frameworks.
B. Ghit. PhD Dissertation. Delft, May 2017.
-
Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks.
B. Ghit and D. Epema.
ACM HPDC. Washington D.C, June 2017.
-
An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows.
A. Ilyushkin, A. Ali-Eldin, N. Herbst, A. Papadopoulos, B. Ghit, D. Epema, and A. Iosup.
IEEE/SPEC ICPE, L'Aquila, 2017.
Best Paper Award Runner-Up.
2016
2015
2014
-
KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments. L. Fei, B. Ghit, A. Iosup, D. Epema.
IEEE Cluster. Madrid, 2014.
-
V for Vicissitude: The Challenge of scaling Complex Big Data Workflows.
B. Ghit, M. Capota, T. Hegeman, J. Hidders, D. Epema, and A. Iosup.
IEEE/ACM CCGrid Scale Challenge, Chicago, 2014.
Scale Challenge Award Winner.
-
Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters.
B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema.
ACM SIGMETRICS, Austin, 2014.
2013
-
The BTWorld use case for big data analytics: Description, MapReduce logical workflow, and empirical evaluation.
T. Hegeman, B. Ghit, M. Capota, J. Hidders, D. Epema, and A. Iosup.
IEEE Big Data. Santa Clara, October 2013.
-
Dynamic Resource Provisioning for Concurrent MapReduce Frameworks.
B. Ghit and D. Epema.
IFIP Performance (Poster), Vienna, 2013.
-
Towards an Optimized Big Data Processing System.
B. Ghit, A. Iosup, and D. Epema.
IEEE/ACM CCGrid (Doctoral Symposium), Delft, 2013.
2012
Talks
- Capri: Achieving Predictable Performance in Cloud Spot Markets. Mascots. November 2021.
- An Approximate Bribe Queueing Model for Bid Advising in Cloud Spot Markets. QEST. August 2021.
- Making Apache Spark SQL Fast with Dynamic Partition Pruning. Big Data for Data Science course at University of Amsterdam. Amsterdam, February 2020.
- Scaling Data Analytics Workloads on Databricks. Spark+AI Summit. Amsterdam, October 2019.
- Dynamic Partition Pruning in Apache Spark. Spark+AI Summit. Amsterdam, October 2019.
- Fast and Reliable Apache Spark SQL Releases. DataWorks Summit. Barcelona, March 2019.
- How We Test Correctness and Performance. Guest lecture at Vrije Universiteit (Distributed Systems). Amsterdam, December 2018.
- Correctness and Performance of Apache Spark SQL. Spark+AI Summit. London, October 2018.
- Better Safe than Sorry: Checkpointing In-Memory Data Analytics Applications. ACM HPDC. Washington D.C, June 2017.
- Checkpointing In-Memory Data Analytics Applications with Panda. Cloud Control Workshop. Enköping, June 2017.
- Tyrex: Size-based Resource Allocation in MapReduce Frameworks. ACM/IEEE CCGrid. Cartagena, May 2016.
- Reducing Job Slowdown Variability for Data-Intensive Workloads. IEEE Mascots. Atlanta, October 2015.
- Achieving Fairness and High-Performance in Datacenter Scheduling. Guest lecture at TU Eindhoven. Eindhoven, June 2015.
- Dynamic Scheduling of Hadoop Clusters in Datacenters. Invited talk at Ortec. Zoetermeer, April 2015.
- Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM Sigmetrics. Austin, June 2014.
- Performance Evaluation of Dynamic MapReduce Clusters. ASCI Graduate School. Eindhoven, November 2013.
- Towards an Optimized Big Data Processing System. ACM/IEEE CCGrid. Delft, May 2013.
- Dynamic MapReduce Clusters on Demand. Supercomputing MTAGS workshop. Salt Lake City, November 2012.
Research Supervision
Awards
Service
- Program Co-Chair: ICAC/SASO Demo Session 2019
- Program Committee Member: PDSW 2022, PDSW 2021, JSSPP 2021, HotCloudPerf 2021, DEBS 2020, HotCloudPerf 2020, CCGRID 2019
- Invited Reviewer: VLDB, TPDS, PEVA, TCC