Showing posts with label Streaming. Show all posts
Showing posts with label Streaming. Show all posts

Sunday, October 29, 2023

Streaming Databases

Introduction

Traditional Transactional Databases don’t scale well and may take hours to run complex SQLs with joins, aggregations, and transformations in analytical databases.

Streaming Databases serve results for such complex SQLs with sub-second latency, and provide fast continuous data transformation capabilities not possible in traditional databases.

Streaming Databases use SQL and familiar RDBMS abstractions (viz. tables, columns, rows, views, indexes), but have a completely different engine (a stream processor) and computing model (dataflows) inside.

Traditional Databases Streaming Databases
store data in tables matching the structure of the inserts & updates and all the computation work happens on read queries.
ask for the queries upfront in the form of Materialized Views, incrementally update the results of these queries as input data arrives. So, Streaming databases move the computation work to the write side.

Origins

Early came about in the capital markets vertical, where value of fast computation over continuous data is very high e.g. StreamBase and KX System . These early generation of were more event processing frameworks than databases, and optimised for unique requirements of hedge funds and trading desks and not universality and accessibility.

While SQL-like control languages were implemented by the early ones (e.g. StreamBase - created with DDL statements like CREATE INPUT STREAM ), but the users had to be streaming systems experts.

SQL below doesn’t care if the data is static or actively updating. It has the info a streaming database needs to continually provide updated result sets as the soon as the data changes.

--- Sum revenue by product category
SELECT categories.name as category, SUM(line_items.amount) AS total_revenue
FROM purchases
JOIN line_items ON purchases.id = line_items.purchase_id
JOIN products ON products.id = line_items.product_id
JOIN categories ON product.category_id = categories.id
WHERE purchases.is_complete
GROUP BY 1;

ksqlDB and Flink allow users to define transformations in SQL but users still need to understand challenging streaming concepts to work around, like eventual consistency.

Recent focus in streaming databases is on expanding access to streaming computation by simplifying the control interface so that it can be operated by those familiar with traditional databases. Thus making application of streaming databases easier. 

Example Architectures

Streaming databases are often used “downstream” of primary transactional databases and message brokers, similar to how a Redis cache or a data warehouse might be used.

  • A message broker reliably and continuously feeds streams of data into the database
  • A Change Data Capture (CDC) service translates primary DB updates into structured data messages in to the message broker
  • the SQL transformations are managed in dbt, as is in data warehouses
  • user-facing applications and tools connect directly to the streaming database, with no need for caching and with more flexibility as compared to data warehouses

Are useful...

  • to build realtime views with ANSI SQL to serve realtime analytics dashboards, APIs & apps
  • to build notifications/alerting e.g. in fraud and risk models, or in building automated services that use event driven SQL primitives
  • to build engaging experiences with customer data aggregations that should always be up-to-date e.g. personalisation, recommendations, dynamic pricing etc.

Not useful for solutions...

  • that need columnar optimisation
  • using window functions and non-deterministic SQL functions like RANK(), RANDOM() (While straightforward in traditional databases, running a these functions may result in continuous chaotic noise for streaming databases)
  • with ad-hoc querying, as response times are compromised since the computation plan is not optimized for point-in-time results

Perform and scale well because...

    • incremental updates ensure that DB update does not slow down as the dataset scales
    • pre-computed query pattern as a persistent transformation ensure that reads are fast as no computation is required, it is just key-value lookups in memory, like Redis cache
    • high frequency of concurrent reads from materialized views has minimal performance impact as complex queries with joins & aggregations are handled as persisted computation
    • Aggregations are dramatically improved since
                        Resources requirement to handle persistent transformations.   
                                                  ∝                           
                        Number of rows in the output (instead of the scale of the input)

    May not perform and scale well...

      • since SQL transformations are always running, joins over large datasets need a significant amount of memory to maintain intermediate state/s (Imagine how you would incrementally maintain a join between two datasets: You never know what new keys will appear on each side, so you have to keep the entirety of each dataset around in memory.)
      • when a single change in inputs triggers change in output in many views, (or when many layers of views depend on each other) more CPU is required for each updat
      • data updates trigger more work in DB requiring more CPU, than with data changes rarely
      • high number of total unique keys slows down read queries in traditional databases. In streaming databases, high-cardinality increases initial “cold-start” time when a persistent SQL transformation is first created, and requires more memory on an ongoing basis

      Saturday, October 28, 2023

      Materialized Views

      Introduction

      Common, frequent queries against a database can become expensive. When the same query is run again and again, it makes sense to ‘virtualize’ the query. Materialized views address this need by enabling common queries to be represented by a database object that is continuously updated as data changes. 

      A View...

      • is a derived relation defined in terms of stored base relations (generally tables) 
      • defines a SQL transformation from a set of base tables to a derived table; this transformation is typically recomputed / re-compiled every time the view is referenced to in a query 
      • when created, does not compute any results nor does it change how data is stored or indexed
      • is a saved query on tables of a DB 
      • is referenced to, in queries, as if it were a table

      Example:

      CREATE VIEW user_purchase_summary AS SELECT
        u.id as user_id,
        COUNT(*) as total_purchases,
        SUM(purchases.amount) as lifetime_value
      
      FROM users u
      JOIN purchases p ON p.user_id = u.id;
      

      Every time a query referencing view/s is executed, it first computes the results of the view, and then computes the rest of the query using those results.

      A Materialized View...

      • takes a regular view and materializes it by upfront computing and storing its results in a “virtual” table 
      • is like a cache, i.e. a copy of the data that can be accessed quickly
      • is a regular view “materialized” by storing tuples of the view in the database
      • can have index structures and hence database access to materialized views can be much faster than recomputing the view

      Example:

      CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
        u.id as user_id,
        COUNT(*) as total_purchases,
        SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as cancelled_purchases
      
      FROM users u
      JOIN purchases p ON p.user_id = u.id;
      

      A regular view is a saved query, and, a materialized view is a saved query along with its results stored as a table.

      Implications of materializing a view

      1. When referenced in a query, a materialized view is not recomputed as the results are pre-stored and hence querying materialized views tends to be faster
      2. Because it’s stored as if it were a table, indexes can be built on the columns of a materialized view
      3. Once a view is materialized, it is only accurate until the underlying base relations are modified. The process of updating a materialized view in response to changes in underlying is called view maintenance.

      A “view” is an anchored perspective on changing inputs, results are constantly changing as the underlying data changes. Materialization just implies that the transformation is done proactively. So, "materialized views" should update automatically.

      However, in practice, some databases need materialized views to be manually refreshed and others have implemented automatic updates, albeit with limitations. 

      Note: MySQL does not support materialized view as of now. Oracle, Snowflake, MongoDB, Redshift, PostgreSQL all others do.

      Materialized views are used...

      • when SQL query is known ahead of time and needs to be repeatedly recalculated
      • primarily for caching the results of extremely heavy and complex queries that cannot be run frequently as regular views
      • as ability to define (using SQL) any complex transformation of data in DB, and let the DB maintain the results in a “virtual” table. when low end-to-end latency is required between when data originates to when it is reflected in a query
      • when low-latency query response times with high concurrency or high volume of queries is expected

      Use of materialized views in...

      Applications: Incrementally updated materialized views can be used to replace the caching and denormalization traditionally done to “guard” OLTP databases from read-side latency and overload. Instead of waiting for a query and doing computation to get the answer, we are now asking for the query upfront and doing the computation to update the results as the writes (creates, updates and deletes) come in. This inverts the constraints of traditional database architectures, allowing developers to build data-intensive applications without complex cache invalidation or denormalization.

      Analytics: ELT bulk loads raw data into a warehouse and then transforms it via some complex SQLs. The transformation may use regular views (i.e. no caching - used when it is not overly slow), or cached tables built from the results of a SELECT query (used when regular views slow down the queries due to re-computations), or incrementally updated table/s (but user is responsible for writing the update strategy).

      OR, use the fourth option i.e.

      Use "materialized views", always remain more up-to-date, more automated and less error-prone to cached tables (the end user burden of deciding when and how to update is minimized).