The Thlog!: Materialized

Introduction

Traditional Transactional Databases don’t scale well and may take hours to run complex SQLs with joins, aggregations, and transformations in analytical databases.

Streaming Databases serve results for such complex SQLs with sub-second latency, and provide fast continuous data transformation capabilities not possible in traditional databases.

Streaming Databases use SQL and familiar RDBMS abstractions (viz. tables, columns, rows, views, indexes), but have a completely different engine (a stream processor) and computing model (dataflows) inside.

Traditional Databases	Streaming Databases
store data in tables matching the structure of the inserts & updates and all the computation work happens on read queries.	ask for the queries upfront in the form of Materialized Views, incrementally update the results of these queries as input data arrives. So, Streaming databases move the computation work to the write side.

Origins

Early came about in the capital markets vertical, where value of fast computation over continuous data is very high e.g. StreamBase and KX System . These early generation of were more event processing frameworks than databases, and optimised for unique requirements of hedge funds and trading desks and not universality and accessibility.

While SQL-like control languages were implemented by the early ones (e.g. StreamBase - created with DDL statements like CREATE INPUT STREAM ), but the users had to be streaming systems experts.

SQL below doesn’t care if the data is static or actively updating. It has the info a streaming database needs to continually provide updated result sets as the soon as the data changes.

--- Sum revenue by product category

SELECT categories.name as category, SUM(line_items.amount) AS total_revenue
FROM purchases
JOIN line_items ON purchases.id = line_items.purchase_id
JOIN products ON products.id = line_items.product_id

JOIN categories ON product.category_id = categories.id
WHERE purchases.is_complete
GROUP BY 1;

ksqlDB and Flink allow users to define transformations in SQL but users still need to understand challenging streaming concepts to work around, like eventual consistency.

Recent focus in streaming databases is on expanding access to streaming computation by simplifying the control interface so that it can be operated by those familiar with traditional databases. Thus making application of streaming databases easier.

Example Architectures

Streaming databases are often used “downstream” of primary transactional databases and message brokers, similar to how a Redis cache or a data warehouse might be used.

A message broker reliably and continuously feeds streams of data into the database
A Change Data Capture (CDC) service translates primary DB updates into structured data messages in to the message broker
the SQL transformations are managed in dbt, as is in data warehouses
user-facing applications and tools connect directly to the streaming database, with no need for caching and with more flexibility as compared to data warehouses

Are useful...

to build realtime views with ANSI SQL to serve realtime analytics dashboards, APIs & apps
to build notifications/alerting e.g. in fraud and risk models, or in building automated services that use event driven SQL primitives
to build engaging experiences with customer data aggregations that should always be up-to-date e.g. personalisation, recommendations, dynamic pricing etc.

Not useful for solutions...

that need columnar optimisation
using window functions and non-deterministic SQL functions like RANK(), RANDOM() (While straightforward in traditional databases, running a these functions may result in continuous chaotic noise for streaming databases)
with ad-hoc querying, as response times are compromised since the computation plan is not optimized for point-in-time results

Perform and scale well because...

incremental updates ensure that DB update does not slow down as the dataset scales
pre-computed query pattern as a persistent transformation ensure that reads are fast as no computation is required, it is just key-value lookups in memory, like Redis cache
high frequency of concurrent reads from materialized views has minimal performance impact as complex queries with joins & aggregations are handled as persisted computation
Aggregations are dramatically improved since

Resources requirement to handle persistent transformations.

∝

Number of rows in the output (instead of the scale of the input)

May not perform and scale well...

since SQL transformations are always running, joins over large datasets need a significant amount of memory to maintain intermediate state/s (Imagine how you would incrementally maintain a join between two datasets: You never know what new keys will appear on each side, so you have to keep the entirety of each dataset around in memory.)
when a single change in inputs triggers change in output in many views, (or when many layers of views depend on each other) more CPU is required for each updat
data updates trigger more work in DB requiring more CPU, than with data changes rarely
high number of total unique keys slows down read queries in traditional databases. In streaming databases, high-cardinality increases initial “cold-start” time when a persistent SQL transformation is first created, and requires more memory on an ongoing basis

Introduction

Common, frequent queries against a database can become expensive. When the same query is run again and again, it makes sense to ‘virtualize’ the query. Materialized views address this need by enabling common queries to be represented by a database object that is continuously updated as data changes.

A View...

is a derived relation defined in terms of stored base relations (generally tables)
defines a SQL transformation from a set of base tables to a derived table; this transformation is typically recomputed / re-compiled every time the view is referenced to in a query
when created, does not compute any results nor does it change how data is stored or indexed
is a saved query on tables of a DB
is referenced to, in queries, as if it were a table

Example:

CREATE VIEW user_purchase_summary AS SELECT
  u.id as user_id,
  COUNT(*) as total_purchases,
  SUM(purchases.amount) as lifetime_value

FROM users u
JOIN purchases p ON p.user_id = u.id;

Every time a query referencing view/s is executed, it first computes the results of the view, and then computes the rest of the query using those results.

A Materialized View...

takes a regular view and materializes it by upfront computing and storing its results in a “virtual” table
is like a cache, i.e. a copy of the data that can be accessed quickly
is a regular view “materialized” by storing tuples of the view in the database
can have index structures and hence database access to materialized views can be much faster than recomputing the view

Example:

CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
  u.id as user_id,
  COUNT(*) as total_purchases,
  SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as cancelled_purchases

FROM users u
JOIN purchases p ON p.user_id = u.id;

A regular view is a saved query, and, a materialized view is a saved query along with its results stored as a table.

Implications of materializing a view

When referenced in a query, a materialized view is not recomputed as the results are pre-stored and hence querying materialized views tends to be faster
Because it’s stored as if it were a table, indexes can be built on the columns of a materialized view
Once a view is materialized, it is only accurate until the underlying base relations are modified. The process of updating a materialized view in response to changes in underlying is called view maintenance.

A “view” is an anchored perspective on changing inputs, results are constantly changing as the underlying data changes. Materialization just implies that the transformation is done proactively. So, "materialized views" should update automatically.

However, in practice, some databases need materialized views to be manually refreshed and others have implemented automatic updates, albeit with limitations.

Note: MySQL does not support materialized view as of now. Oracle, Snowflake, MongoDB, Redshift, PostgreSQL all others do.

Materialized views are used...

when SQL query is known ahead of time and needs to be repeatedly recalculated
primarily for caching the results of extremely heavy and complex queries that cannot be run frequently as regular views
as ability to define (using SQL) any complex transformation of data in DB, and let the DB maintain the results in a “virtual” table. when low end-to-end latency is required between when data originates to when it is reflected in a query
when low-latency query response times with high concurrency or high volume of queries is expected

Use of materialized views in...

Applications: Incrementally updated materialized views can be used to replace the caching and denormalization traditionally done to “guard” OLTP databases from read-side latency and overload. Instead of waiting for a query and doing computation to get the answer, we are now asking for the query upfront and doing the computation to update the results as the writes (creates, updates and deletes) come in. This inverts the constraints of traditional database architectures, allowing developers to build data-intensive applications without complex cache invalidation or denormalization.

Analytics: ELT bulk loads raw data into a warehouse and then transforms it via some complex SQLs. The transformation may use regular views (i.e. no caching - used when it is not overly slow), or cached tables built from the results of a SELECT query (used when regular views slow down the queries due to re-computations), or incrementally updated table/s (but user is responsible for writing the update strategy).

OR, use the fourth option i.e.

Use "materialized views", always remain more up-to-date, more automated and less error-prone to cached tables (the end user burden of deciding when and how to update is minimized).

The Thlog!

Sunday, October 29, 2023

Streaming Databases

Introduction

Origins

Example Architectures

Are useful...

Not useful for solutions...

Perform and scale well because...

May not perform and scale well...

Saturday, October 28, 2023

Materialized Views

Introduction

A View...

Implications of materializing a view

Materialized views are used...

Use of materialized views in...