Databricks unifies OLTP and OLAP, depending on what counts as a copy

Jul 03, 2026 - 13:11
0 0
Databricks unifies OLTP and OLAP, depending on what counts as a copy

When Databricks claimed to have cracked an age-old database problem, it came with a clear marketing message: "One data, zero compromises, zero copies." Inevitably, that led engineers to search for clarity. After all, the company claimed to have unified OLTP and OLAP with "no data duplication."

Databricks, which was founded around the open source unified analytics engine Apache Spark, called its invention LTAP, which stands for lake transactional/analytical processing. It works with Reyden – a new compute engine – and Lakebase, its serverless PostgreSQL on open object storage.

Databricks is attempting to address a fundamental database challenge. OLTP (online transactional processing) performs small, row-oriented reads and frequent writes, while OLAP (online analytical processing) performs large, column-oriented reads and batch writes. Down to the physical level, it is challenging to get the two to coexist in a single system. The issue is seen as more pressing now as the database market chases workloads created by the booming deployment of AI agents, both in software development and business applications.

What did Databricks claim? The publicity material said that rather than forcing both OLTP and OLAP workloads into one engine or concealing the pipeline, it unifies data at the storage layer, thereby unifying transactions, analytics, streaming, and operational data on a single copy of storage in the data lakehouse, a concept Databricks created to describe the marriage of data lakes and data warehouses.

Does that mean there are "zero copies" of the data, as claimed in several promotional LinkedIn pieces and a Forbes CEO interview? Well, not quite.

The transactional side of LTAP is based on Databricks' first fully managed PostgreSQL database, Lakebase, which in turn is based on technology from Neon, which Databricks bought last year to provide copy-on-write branching and autoscaling serverless compute.

In his search for clarity, one data engineer in financial services posted that LTAP proposes that the current PostgreSQL data stays in the pageserver format as local storage then is propagated to object storeage for long-term durability in the Parquet file format, where it can be queried in a columnar format. PostgreSQL/Lakebase can retrieve data from the object store and reconvert the Parquet data to a pageserver if it needs data from cold storage. In this way, Databricks has "unified" the OLTP storage and OLAP storage.

"Two copies of data, not one," quipped one commenter from a Databricks rival.

Slides made available at a PostgreSQL conference in May make the link clear. Under the header "Analytics directly on OLTP data," Databricks engineers Hristo Stoyanov and Jonathan Katz said that pageserver provides storage while the Spark analytics executor pulls layer files containing full page images from the image layers in object storage.

On a private messaging community seen by The Register, one Databricks engineer responded to the question about whether there was one copy of the data or two copies in object storage and pageservers respectively. Technically two, they responded, since pageservers act as a cache or materialization layer in the Neon architecture. PostgreSQL reads from pageservers, while the analytics engine reads PostgreSQL pages from object storage (Apache Parquet or Iceberg table format) and pageservers.

Databricks is far from alone in trying to crack this nut. Unifying OLTP and OLAP has been tried before, and solved, according to some companies. For example, in 2014, SingleStore began working on an in-memory row store and an on-disk column store with tiered storage, "meaning transactions hit memory first and then they roll off to disk storage," allowing analytics and transactions on a single system. It launched a cloud database service (on AWS, Azure, or GCP) in 2020, which "automatically manages data across a three-tiered storage architecture comprised of memory, local cache, and storage." It moves data "seamlessly" between memory, persistent cache, and object storage without the user being aware, the company says.

Not surprisingly, SingleStore was quick to post its reaction to Databricks' claim that hybrid transactional/analytical processing (HTAP) had effectively failed.

"You don't get to call HTAP a failure and then spend the next 20 minutes describing why the world needs exactly what HTAP promised. Unifying OLTP and OLAP so an agent can read and write in one place is the HTAP goal, whatever you print on the slide. Renaming it LTAP changes the marketing. It doesn't change the physics, and it doesn't retire the questions," SingleStore CTO Nadeem Asghar said in a blog post.

He pointed out that Databricks' claim of "one copy" of the data is about storage, not about the engine. "Three engines still sit on top, each with its own cache, its own sense of how fresh the data is, and its own way of failing at the worst possible moment. Databricks' own framing: a row layout and a columnar layout are different things. If a write lands in a row representation for Postgres and analytics reads a columnar representation, then you have two physical shapes of the same data, and something has to keep them in step," Asghar said.

There are other examples of efforts to bring together analytics and transactional systems. MongoDB offers column-store indexes to help developers build analytical queries into their applications. Oracle's HeatWave for MySQL runs on Oracle Cloud Infrastructure and helps customers run analytics on transactional applications without having to export data to a specialist analytics system such as Teradata, Snowflake, or AWS Redshift. SAP has talked about real-time analytics since 2011, and bases its concept around its in-memory database, HANA, which supports the latest iteration of SAP's enterprise applications.

Databricks maintains its "zero copy" claim is true because it avoids having two authoritative copies of the data that need to be kept in sync.

In a statement to The Register, a Databricks spokesperson said: "In LTAP, the user only operates on one authoritative copy of the data. [It has] one source of truth data in Iceberg (an open source table format which contains Parquet files). Yes, any database system, even a single individual database, always has many intermediate internal copies of data, ranging from memory L1/L2/L3 cache, to DRAM memory, to non-volatile storage, to blob storage etc. This is referred to as 'the database storage hierarchy.'"

In presentations at its recent conference, Databricks qualifies the claim in several ways. There's only one "authoritative" copy of the data, or there is one copy of the data "in storage" or "in the lake." In effect, it is the same approach SingleStore employs when it says its storage tiers are "transparent to the user."

Regardless of the marketing ding-dong, Databricks has done some impressive engineering in the way its new Lakehouse execution engine, Reyden, can read PostgreSQL pages, according to Andy Pavlo, associate professor of databaseology at Carnegie Mellon University.

"They are copying data out eventually," he told The Register. "But initially Databricks is able to have the Neon/PostgreSQL front end read the writes as it normally would, but then the Reyden engine can read those writes, and that part is not easy."

"The Reyden analytics engine has the ability to now interpret the contents of the PostgreSQL pages, which is a non-trivial thing to do, because the pages are not entirely self-contained, meaning that information about what you're allowed to see, or even what the data is, is stored in separate pages, so they have the mechanism to then go back into Neon/PostgreSQL and get that metadata from the catalog."

"Anybody can go and read a PostgreSQL page. It's not hard to write code to read a single page of data. The challenging part is being able to understand what you're allowed to see or what the query is allowed to read from that page, because they intermix all the different versions, then [Databricks] has got to resolve that as well. All that is not trivial."

"Basically, it allows you to do faster analytics, or more timely analytics, without the delay of waiting for things to get shoved out to S3 and you do it in a transaction-safe manner."

Meanwhile, the Reyden analytics engine is stateless and can scale horizontally "very well" by adding more compute, Pavlo said.

Databricks might have produced some impressive technology by bringing transactional and analytic workloads closer together. But in the way it presents its work, critics might argue it should be careful what it wishes for. It would be a shame if its overzealous marketing claims cast a shadow over its significant engineering achievements. ®

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Wow Wow 0
Sad Sad 0
Angry Angry 0

Comments (0)

User