Sunday, October 29, 2023

Streaming Databases

Introduction

Traditional Transactional Databases don’t scale well and may take hours to run complex SQLs with joins, aggregations, and transformations in analytical databases.

Streaming Databases serve results for such complex SQLs with sub-second latency, and provide fast continuous data transformation capabilities not possible in traditional databases.

Streaming Databases use SQL and familiar RDBMS abstractions (viz. tables, columns, rows, views, indexes), but have a completely different engine (a stream processor) and computing model (dataflows) inside.

Traditional Databases Streaming Databases
store data in tables matching the structure of the inserts & updates and all the computation work happens on read queries.
ask for the queries upfront in the form of Materialized Views, incrementally update the results of these queries as input data arrives. So, Streaming databases move the computation work to the write side.

Origins

Early came about in the capital markets vertical, where value of fast computation over continuous data is very high e.g. StreamBase and KX System . These early generation of were more event processing frameworks than databases, and optimised for unique requirements of hedge funds and trading desks and not universality and accessibility.

While SQL-like control languages were implemented by the early ones (e.g. StreamBase - created with DDL statements like CREATE INPUT STREAM ), but the users had to be streaming systems experts.

SQL below doesn’t care if the data is static or actively updating. It has the info a streaming database needs to continually provide updated result sets as the soon as the data changes.

--- Sum revenue by product category
SELECT categories.name as category, SUM(line_items.amount) AS total_revenue
FROM purchases
JOIN line_items ON purchases.id = line_items.purchase_id
JOIN products ON products.id = line_items.product_id
JOIN categories ON product.category_id = categories.id
WHERE purchases.is_complete
GROUP BY 1;

ksqlDB and Flink allow users to define transformations in SQL but users still need to understand challenging streaming concepts to work around, like eventual consistency.

Recent focus in streaming databases is on expanding access to streaming computation by simplifying the control interface so that it can be operated by those familiar with traditional databases. Thus making application of streaming databases easier. 

Example Architectures

Streaming databases are often used “downstream” of primary transactional databases and message brokers, similar to how a Redis cache or a data warehouse might be used.

  • A message broker reliably and continuously feeds streams of data into the database
  • A Change Data Capture (CDC) service translates primary DB updates into structured data messages in to the message broker
  • the SQL transformations are managed in dbt, as is in data warehouses
  • user-facing applications and tools connect directly to the streaming database, with no need for caching and with more flexibility as compared to data warehouses

Are useful...

  • to build realtime views with ANSI SQL to serve realtime analytics dashboards, APIs & apps
  • to build notifications/alerting e.g. in fraud and risk models, or in building automated services that use event driven SQL primitives
  • to build engaging experiences with customer data aggregations that should always be up-to-date e.g. personalisation, recommendations, dynamic pricing etc.

Not useful for solutions...

  • that need columnar optimisation
  • using window functions and non-deterministic SQL functions like RANK(), RANDOM() (While straightforward in traditional databases, running a these functions may result in continuous chaotic noise for streaming databases)
  • with ad-hoc querying, as response times are compromised since the computation plan is not optimized for point-in-time results

Perform and scale well because...

    • incremental updates ensure that DB update does not slow down as the dataset scales
    • pre-computed query pattern as a persistent transformation ensure that reads are fast as no computation is required, it is just key-value lookups in memory, like Redis cache
    • high frequency of concurrent reads from materialized views has minimal performance impact as complex queries with joins & aggregations are handled as persisted computation
    • Aggregations are dramatically improved since
                        Resources requirement to handle persistent transformations.   
                                                  ∝                           
                        Number of rows in the output (instead of the scale of the input)

    May not perform and scale well...

      • since SQL transformations are always running, joins over large datasets need a significant amount of memory to maintain intermediate state/s (Imagine how you would incrementally maintain a join between two datasets: You never know what new keys will appear on each side, so you have to keep the entirety of each dataset around in memory.)
      • when a single change in inputs triggers change in output in many views, (or when many layers of views depend on each other) more CPU is required for each updat
      • data updates trigger more work in DB requiring more CPU, than with data changes rarely
      • high number of total unique keys slows down read queries in traditional databases. In streaming databases, high-cardinality increases initial “cold-start” time when a persistent SQL transformation is first created, and requires more memory on an ongoing basis

      AI Opportunities in 2023 - Lecture by Dr. Andrew Ng

      AI as a general purpose tech...

      • is useful for lots of different applications e.g. electricity is good for a lot of things
      • AI collection: Supervised learning and Generative AI (in focus today) + Unsupervised learning and Reinforcement learning
        • Supervised learning: Good for labelling things e.g. 
                          e-mail >> spam or not, ship-route >> fuel consumed, Ad & user-info >> will click
          • Workflow of Supervised learning apps: e.g. restaurant reviews classification

                                  Collect dataset >> label data >> train a model >> deploy >> run

      • Last decade was the decade of large scale supervised learning. Small AI models could be built on not very powerful computers, which had good performance for certain small amount of data but with even large amount of data the performance would flatten out. With large AI models, however, the performance scales better and better with large data
      • This decade is adding to it the excitement of Generative AI
        • When we train a very large AI model on a lot of data, we get a LLM like ChatGPT
        • RLHF and other techniques tune AI output to be more helpful, honest and harmless
        • And at the heart of Generative AI is (Supervised learning) repeated prediction of next sub-part patterns given the data it has seen
      • The power of LLMs as a developer (not programmer) tool: 
        • With prompt based AI - the workflow is:

                          Specify prompt >> Deploy to cloud (e.g. build restaurant review system in few days)

      • Opportunities: massive value will be created with Supervised learning and Generative AI together, by identifying and executing concrete use cases
        • Supervised learning will double in size and Generative AI will much more than double
        • for new start-ups and for large enterprises / companies
      • Lensa was an indefensible use case as it did not add value; AirBnB or Uber are defensible because these create value 
      • The work ahead is to find the many diverse, value adding and defensible use cases

      • Refer to the "Potential AI projects space curve"
        • Advertising, and web search are the only large money making domains, with millions of users
        • As we go to the right of the curve, some example projects of interest may be:
          • Food inspection: cheese spread evenly on a pizza
          • Wheat harvesting: how tall is the wheat crop, at what height should it be chopped off
          • Materials grading, cloth grading...
        • Clearly industries other than advertising and web-search have a very long tail of $5 mn projects but with a very high cost of customisation
          • So, AI community needs to continue building better tools to help aggregate such use cases and make it easy for end users to do the customisations at affordable costs
      Instead of needing to worry about pictures of pizza AI community will create tools to enable the IT department of the pizza factory to train an AI system on their own pizzas. Thus, to realise the value of $5 mn by leveraging some low/no code tools for AI

      Referring to the AI Stack...

      • H/W semi-conductor layer at bottom is very capital intensive and very concentrated
      • Infrastructure layer above the semi-conductor later is also highly capital intensive and very concentrated
      • Developer tools layer is hyper-competitive, and only a few will be winners
        All the above said layers can be successful only if the
      • All the above said layers can be successful only if the application layer on top is even more successful e.g. Amorai - app for romantic relationships coaching
      • Recipe for building startups, "don't rush to solutions", has been inverted and now we can just do that while still keeping it cost effective
                    Ideas >> Validate >> Get CEO >> Prototype (early users) >> Pre-seed Growth (MVP) >> Seed, Growth & Scale
      • Concrete ideas can be validated or falsified efficiently
      • Even highly profitable projects but low on ethics will / should be killed
      • AGI is still decades away
      • Other areas of interest may be predicting next pandemic, climate change predictions... 

      Saturday, October 28, 2023

      Materialized Views

      Introduction

      Common, frequent queries against a database can become expensive. When the same query is run again and again, it makes sense to ‘virtualize’ the query. Materialized views address this need by enabling common queries to be represented by a database object that is continuously updated as data changes. 

      A View...

      • is a derived relation defined in terms of stored base relations (generally tables) 
      • defines a SQL transformation from a set of base tables to a derived table; this transformation is typically recomputed / re-compiled every time the view is referenced to in a query 
      • when created, does not compute any results nor does it change how data is stored or indexed
      • is a saved query on tables of a DB 
      • is referenced to, in queries, as if it were a table

      Example:

      CREATE VIEW user_purchase_summary AS SELECT
        u.id as user_id,
        COUNT(*) as total_purchases,
        SUM(purchases.amount) as lifetime_value
      
      FROM users u
      JOIN purchases p ON p.user_id = u.id;
      

      Every time a query referencing view/s is executed, it first computes the results of the view, and then computes the rest of the query using those results.

      A Materialized View...

      • takes a regular view and materializes it by upfront computing and storing its results in a “virtual” table 
      • is like a cache, i.e. a copy of the data that can be accessed quickly
      • is a regular view “materialized” by storing tuples of the view in the database
      • can have index structures and hence database access to materialized views can be much faster than recomputing the view

      Example:

      CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
        u.id as user_id,
        COUNT(*) as total_purchases,
        SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as cancelled_purchases
      
      FROM users u
      JOIN purchases p ON p.user_id = u.id;
      

      A regular view is a saved query, and, a materialized view is a saved query along with its results stored as a table.

      Implications of materializing a view

      1. When referenced in a query, a materialized view is not recomputed as the results are pre-stored and hence querying materialized views tends to be faster
      2. Because it’s stored as if it were a table, indexes can be built on the columns of a materialized view
      3. Once a view is materialized, it is only accurate until the underlying base relations are modified. The process of updating a materialized view in response to changes in underlying is called view maintenance.

      A “view” is an anchored perspective on changing inputs, results are constantly changing as the underlying data changes. Materialization just implies that the transformation is done proactively. So, "materialized views" should update automatically.

      However, in practice, some databases need materialized views to be manually refreshed and others have implemented automatic updates, albeit with limitations. 

      Note: MySQL does not support materialized view as of now. Oracle, Snowflake, MongoDB, Redshift, PostgreSQL all others do.

      Materialized views are used...

      • when SQL query is known ahead of time and needs to be repeatedly recalculated
      • primarily for caching the results of extremely heavy and complex queries that cannot be run frequently as regular views
      • as ability to define (using SQL) any complex transformation of data in DB, and let the DB maintain the results in a “virtual” table. when low end-to-end latency is required between when data originates to when it is reflected in a query
      • when low-latency query response times with high concurrency or high volume of queries is expected

      Use of materialized views in...

      Applications: Incrementally updated materialized views can be used to replace the caching and denormalization traditionally done to “guard” OLTP databases from read-side latency and overload. Instead of waiting for a query and doing computation to get the answer, we are now asking for the query upfront and doing the computation to update the results as the writes (creates, updates and deletes) come in. This inverts the constraints of traditional database architectures, allowing developers to build data-intensive applications without complex cache invalidation or denormalization.

      Analytics: ELT bulk loads raw data into a warehouse and then transforms it via some complex SQLs. The transformation may use regular views (i.e. no caching - used when it is not overly slow), or cached tables built from the results of a SELECT query (used when regular views slow down the queries due to re-computations), or incrementally updated table/s (but user is responsible for writing the update strategy).

      OR, use the fourth option i.e.

      Use "materialized views", always remain more up-to-date, more automated and less error-prone to cached tables (the end user burden of deciding when and how to update is minimized). 

      Monday, July 10, 2017

      Ethereum mining on AWS


      • Ethereum mining works only on g2.2xlarge or g2.8xlarge instances with Ubuntu 14.04 or later
      • Port 30303 must be opened for both TCP and UDP connections from `anywhere` (in security group settings)
      • Default Ubuntu available with ec2 is minimal i.e. some drm files required for the OS to see GPU drivers are missing. SSH into your machine and run following steps to fix this:
      > sudo apt-get install linux-generic  
      (Click OK for default option/s when prompted)
      > sudo reboot
      • Download CUDA drivers for ec2 instances (use Nvidia units). Working with .deb package (instead of .run) is easier (local or network makes no difference) 
      > wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/rpmdeb/cuda-repo-ubuntu1404-7-0-local_7.0-28_amd64.deb

      (Newer versions are available here)

      > sudo dpkg -i <cuda repo package>
      > sudo apt-get update
      > sudo apt-get install cuda
      • Run the following command to check driver is installed: 
      > lshw -c video
      • A line that starts with "Configuration:" should mention "...driver=nvidia...", if it doesn't search carefully or try reboot. 
      • If you see "...driver=nouveau..." instead of "...driver=nvidia..." then something is wrong - google how to get rid of it and reinstall cuda.
      • Build geth from source, refer here 
      • run geth to allow it to catch up on the chain: 
      > ~/go-ethereum/build/bin/geth
      • install ethminer from cpp-ethereum dev PPAs, refer here 
      • Use following command to check the current hash rate (~6 MH/s) and benchmark ethminer to check that your system is in order: 
      > ethminer -G -M 
      • When geth catches up on the blockchain, use the following command to generate a new account: 
      > ~/go-ethereum/build/bin/geth account new
      • start geth again with RPC enabled by using command-line below 
      > ~/go-ethereum/build/bin/geth --rpc
      • Execute following command to start ethminer
      > ethminer -G
      • If using larger g2 instance with 4 GPUs, ethminer needs to be started 4 times. Each time adding a "--opencl-device <0..3>" argument
      • Check logs carefully, ethminer should be getting work packages from geth and be "mining a block"

      Sunday, December 27, 2015

      How to...

      Make YouTube videos run faster

      Open Google Chrome
      Ctrl + Shift + J - opens Developer Tools, ensure you are on Console tab
      On the prompt copy and paste script below:

      document.getElementsByTagName("video") [0].playbackRate = 2.5

      Instead of playbackRate = 2.5, you can set any other floating number between 1.00 to 4.00

      Create a rss feed for "The Thlog"

      https://thethlog.blogspot.com/feeds/posts/default?alt=rss

      Replace "thethlog" with name of any blog on blogspot, use it for another blog hosted on blogspt. Now, add it to a feed URL to an RSS reader (e.g. Feedly) OR be more creative with GPT.