Sunday, October 29, 2023

Streaming Databases

Introduction

Traditional Transactional Databases don’t scale well and may take hours to run complex SQLs with joins, aggregations, and transformations in analytical databases.

Streaming Databases serve results for such complex SQLs with sub-second latency, and provide fast continuous data transformation capabilities not possible in traditional databases.

Streaming Databases use SQL and familiar RDBMS abstractions (viz. tables, columns, rows, views, indexes), but have a completely different engine (a stream processor) and computing model (dataflows) inside.

Traditional Databases Streaming Databases
store data in tables matching the structure of the inserts & updates and all the computation work happens on read queries.
ask for the queries upfront in the form of Materialized Views, incrementally update the results of these queries as input data arrives. So, Streaming databases move the computation work to the write side.

Origins

Early came about in the capital markets vertical, where value of fast computation over continuous data is very high e.g. StreamBase and KX System . These early generation of were more event processing frameworks than databases, and optimised for unique requirements of hedge funds and trading desks and not universality and accessibility.

While SQL-like control languages were implemented by the early ones (e.g. StreamBase - created with DDL statements like CREATE INPUT STREAM ), but the users had to be streaming systems experts.

SQL below doesn’t care if the data is static or actively updating. It has the info a streaming database needs to continually provide updated result sets as the soon as the data changes.

--- Sum revenue by product category
SELECT categories.name as category, SUM(line_items.amount) AS total_revenue
FROM purchases
JOIN line_items ON purchases.id = line_items.purchase_id
JOIN products ON products.id = line_items.product_id
JOIN categories ON product.category_id = categories.id
WHERE purchases.is_complete
GROUP BY 1;

ksqlDB and Flink allow users to define transformations in SQL but users still need to understand challenging streaming concepts to work around, like eventual consistency.

Recent focus in streaming databases is on expanding access to streaming computation by simplifying the control interface so that it can be operated by those familiar with traditional databases. Thus making application of streaming databases easier. 

Example Architectures

Streaming databases are often used “downstream” of primary transactional databases and message brokers, similar to how a Redis cache or a data warehouse might be used.

  • A message broker reliably and continuously feeds streams of data into the database
  • A Change Data Capture (CDC) service translates primary DB updates into structured data messages in to the message broker
  • the SQL transformations are managed in dbt, as is in data warehouses
  • user-facing applications and tools connect directly to the streaming database, with no need for caching and with more flexibility as compared to data warehouses

Are useful...

  • to build realtime views with ANSI SQL to serve realtime analytics dashboards, APIs & apps
  • to build notifications/alerting e.g. in fraud and risk models, or in building automated services that use event driven SQL primitives
  • to build engaging experiences with customer data aggregations that should always be up-to-date e.g. personalisation, recommendations, dynamic pricing etc.

Not useful for solutions...

  • that need columnar optimisation
  • using window functions and non-deterministic SQL functions like RANK(), RANDOM() (While straightforward in traditional databases, running a these functions may result in continuous chaotic noise for streaming databases)
  • with ad-hoc querying, as response times are compromised since the computation plan is not optimized for point-in-time results

Perform and scale well because...

    • incremental updates ensure that DB update does not slow down as the dataset scales
    • pre-computed query pattern as a persistent transformation ensure that reads are fast as no computation is required, it is just key-value lookups in memory, like Redis cache
    • high frequency of concurrent reads from materialized views has minimal performance impact as complex queries with joins & aggregations are handled as persisted computation
    • Aggregations are dramatically improved since
                        Resources requirement to handle persistent transformations.   
                                                  ∝                           
                        Number of rows in the output (instead of the scale of the input)

    May not perform and scale well...

      • since SQL transformations are always running, joins over large datasets need a significant amount of memory to maintain intermediate state/s (Imagine how you would incrementally maintain a join between two datasets: You never know what new keys will appear on each side, so you have to keep the entirety of each dataset around in memory.)
      • when a single change in inputs triggers change in output in many views, (or when many layers of views depend on each other) more CPU is required for each updat
      • data updates trigger more work in DB requiring more CPU, than with data changes rarely
      • high number of total unique keys slows down read queries in traditional databases. In streaming databases, high-cardinality increases initial “cold-start” time when a persistent SQL transformation is first created, and requires more memory on an ongoing basis

      No comments: