ICDE 2024 Invited Talks

Invited Talks

Tuesday May 14th, 2024 @ 10:15

Cloud Database Systems Invited Talk [In Theater 2, Chair: Renata Borovica-Gajic]

Vector Search and Databases
by Yannis Papakonstantinou (Google).

Semantic search ability, via embedding (vectors) and vector indexing, has been added to Google Cloud Platform (GCP) databases in order to enable GenAI applications. The inclusion of vectors in databases enables developers to build GenAI applications on their familiar and trusted SQL environment, while being sure that the vectors are up-to-date and transactionally consistent. The inclusion of vectors in databases raises two R&D questions: First, can databases with vector abilities perform as well as purpose-built vector databases in pure vector search? Second, what are the opportunities and respective R&D challenges that emerge at the intersection of structured data and vectors? In response to the first question, we present the GCP AlloyDB vector indexing (see whitepaper). In response to the second question, we discuss how the GCP AlloyDB enables unified SQL access to structured data and vectors. For example, queries that involve joins, filters and vector similarity. We show that databases with vector abilities have fundamental intrinsic ease-of-use and performance advantages (over standalone purpose-built vector databases) in processing such queries while novel query optimization and plan execution work turns the fundamental advantages into material ones.

Yannis Papakonstantinou is a Distinguished Engineer, working on Query Processing and GenAI, at Google Cloud. He is also an Adjunct Professor of Computer Science and Engineering at the University of California, San Diego, following many years of having been a UCSD regular faculty member. Previously he was an architect in query processing & ETL at Databricks. Earlier, he was a Senior Principal Scientist at Amazon Web Services from 2018-2021 and was a consultant for AWS since 2016. He was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early Enterprise Information Integration platform for structured and semistructured data. The Enosys Software was OEM'd and sold under the BEA Liquid Data and BEA Aqualogic brand names, eventually acquired in 2003 by BEA Systems. His R&D work has been mostly on query processing with focus on querying semistructured data. He has published over one hundred twenty research articles that have received over 20,000 citations. Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997).

Tuesday May 14th, 2024 @ 16:21

Machine Learning and Data Science Invited Talk [In Theater 2, Chair: Essam Mansour]

From Truck to Racecar: Revving up Transactional Throughput 1000x in an Analytical Engine
by Jonathan Dees (Snowflake).

Originally designed for efficient analytical data processing, Snowflake has recently introduced hybrid tables, seamlessly integrating transactional and analytical workloads. We explore the nuanced approach required to optimize performance for short-running transactional queries in contrast to analytical queries, resulting in a 1000x speedup. We'll share the challenges we encountered, the strategic steps we implemented, and the insights gained along the way.

Jonathan Dees is a Principal Engineer at Snowflake, based in Berlin, where he specializes in SQL query processing. His focus involves optimizing the performance of Snowflake's query execution platform for analytical workloads and, more recently, hybrid transactional and analytical workloads. With prior experience at SAP, Jonathan worked on query processing for SAP HANA including building and tuning new query operators, applying just in time code compilation and parallel processing. His general interests include database systems, performance, benchmarks and algorithms.

Wednesday May 15th, 2024 @ 10:00

Infrastructure for Machine Learning Invited Talk [In Theater 2, Chair: Peter Boncz]

Deletion Vectors: No-Regrets Row-Level Updates in Delta Lake
by Bart Samwel (Databricks).

Fine-grained updates to traditional Parquet data lakes are inefficient because they require rewriting entire Parquet files even to update only a single row. In recent years, open Lakehouse table formats such as Delta Lake, Apache Iceberg and Apache Hudi have each introduced support for row-level updates. In this talk, we will discuss why row-level updates are important for common workloads. We will then dive into Delta Lake's Deletion Vectors and how they enable row-level updates with virtually no overhead. And finally, we will do a comparative analysis with the techniques that other Lakehouse table formats use to support row-level updates: Iceberg's Position Deletes, and Hudi's Merge-on-Read tables, and show how Deletion Vectors improves on these techniques.

Bart Samwel is a Principal Engineer at Databricks, based in Amsterdam. He leads an engineering team that focuses on the performance of DML operations on Delta Lake tables such as MERGE, UPDATE and DELETE. With his team, he created Delta Lake's Deletion Vectors. He takes at least partial blame for many other recent innovations in Delta Lake. Outside work, he likes to tinker with coffee and to exercise his vocal chords.

Wednesday May 15th, 2024 @ 17:00

Query Performance Invited Talk [In Theater 2, Chair: Peter Boncz]

How I Learned to Stop Worrying About Benchmarks
by Hannes Muhleisen (DuckDB Labs).

Data management systems developers are obsessed with benchmarks. As a result, their systems are, too. Moving only slightly away from the over-trodden paths of TPC-H can yield surprising results. Yet users in the real world are unlikely to only ever run benchmarks, instead, their workloads are diverse, messy, and hard to capture in a consensus benchmark specification. In my talk, I will discuss the DuckDB teams' approach to benchmarks and performance, which prioritizes robustness over peak performance on specific queries.

Hannes Muhleisen is a creator of the DuckDB database management system and Co-founder and CEO of DuckDB Labs, a consulting company providing services around DuckDB. Hannes is also Professor of Data Engineering at Radboud Universiteit Nijmegen. His main interest is analytical data management systems.