Monday, October 30, 2023

Scratching the OLAP surface

Introduction

Long ago, business transactions were saved to relational databases and few ad-hoc decision support queries ran on the same database. As data size grew and time sensitivity of business decisions reports became apparent (e.g. fetching monthly sales reports), running complex and expensive SQL queries on the same database impacted saving business transactions to the DB.

Running complex and expensive SQL queries was moved to a purpose-built database called data warehouses (DWH). Data was sent from multiple business operational systems and integrated into central storage - the DWH, a single source of truth of business data.

Data warehouses were built on MPP principles (massively parallel processing) i.e. compute and storage were coupled to benefit from the data locality during query execution. To speed up the complex and expensive queries ETL pipelines pre-joined and normalise data before loading into DWHs. The storage footprint reduced and queries ran faster with tables laid out in two main strategies viz. star and snowflake schemas.

Finally business teams could run complex queries that filter and aggregate billions of records and analyse large volumes of data from different perspectives with help of BI tools. Thus business transaction processing separated from decision supporting analytical processing - former became OLTP and the later was the advent of OLAP.

The advent of OLAP

A data warehouse remains to be a large relational database storing data in a collection of tables unless the following additional layers make it into a multidimensional database:
  • Data is modelled by several dimensions to reduced storage footprint, improving query times e.g. tables are laid out in two main strategies viz. star and snowflake schemas.
  • ETL pipelines ensure pre-joined and normalised data is loaded into DWH
  • Complex queries with several joins and higher level aggregates are managed as materialised views
  • Dedicated UI to run complicated queries across several data dimensions
Such multidimensional database deployed on OLAP server, providing dedicated client interface to run complicated queries, visualise data and perform analytical operations across several data dimensions for generating business intelligence insights came to be called an OLAP Cube.

Key differences in OLAP and OLTP

OLTP Applications...

OLAP Systems...

Purpose

are used to run the business are used to understand the business

Purpose

handle large volumes of transactional data from multiple users e.g. online hotel bookings, mobile banking transactions, e-commerce purchases quickly process large amounts of data for in-depth data analysis across multiple dimensions for decision-making

Data sources

data is created by users as they complete business transactions pull data from OLTP databases via an ETL pipeline to provide insights such as analysing ATM activity and performance over time

Response times

milliseconds range from a second to several hours

Data storage capacity

usually have modest data storage requirements, as historical transaction data is archived require massive amounts of data storage capacity, a modern cloud data warehouse may accommodate these easily

Intended Users

are customer-facing and designed for use by frontline workers such as store clerks and hotel reservation specialists as well as online shoppers are business-facing and are used by data scientists, analysts, and business users such as team leads or executives who access data using analytics dashboards

Types of OLAP Systems

Depending on specific analytical needs of an organisation, the amount and complexity of data to handle, the required query response times, and the kind of analysis and reporting required determines which of the following OLAP systems be used:

Relational OLAP (ROLAP) systems... 

  1. facilitate multidimensional data analysis with high data efficiency
  2. SQLs retrieve and analyse data from relational tables of data warehouse
  3. are highly scalable and can handle large amounts of data
  4. have slower query response times and do not support complex calculations

Multidimensional OLAP (MOLAP) systems...

  1. are fast for multidimensional analysis and running complex calculations & aggregations
  2. store data in a multidimensional cube format, where each dimension represents a different attribute of the data e.g. time, geography, or product
  3. require extensive data preprocessing (as data is stored in multidimensional cubes)
  4. can handle limited data and are not as scalable as ROLAP systems

Hybrid OLAP (HOLAP) systems...

  1. combine the strengths of MOLAP and ROLAP systems
  2. store summary data in multidimensional cubes while detailed business data is stored in relational database, thus also improving data relevance
  3. provide fast data access for high speed querying and handle high volumes of data

Data storage strategies in OLAP systems

Star Schema...

  • is multidimensional data model, used in ROLAP systems
  • organises data into a central fact table surrounded by dimension tables
  • fact table contains measures being analysed i.e. quantitative data like sales revenue, quantity sold, profit margin etc.
  • dimension tables contain descriptive data that provide context for the measures e.g. time, geography, product information, etc.
  • each dimension table is joined to the fact table through a primary key-foreign key relationship
  • is popular as it is easy for business analysts and end users to understand and navigate through different levels of data

Snowflake Schema...

  • is ROLAP data model
  • organises data in a central fact table and normalised dimension tables (i.e. multiple broken but related tables)
  • normalises dimension tables to reduce data redundancy and improve data consistency
  • has more tables and relationships resulting in hard to understand, complex and slow queries
  • improves query performance and reduces storage requirements by eliminating redundant data

Fact Constellation (aka Galaxy) Schema...

  • contains multiple fact tables, each with its own set of dimension tables containing descriptive data
  • the fact tables have shared dimensions which links the fact tables and hence allow for even more complex queries and analyses
  • each fact table represents a different business process or measure e.g. sales or customer satisfaction
  • provides more flexibility in querying and analysing data, as users can analyse multiple business processes or metrics at the same time
  • is harder to use than the star or snowflake schema

Improving performance of OLAP systems

Pre-aggregating data for faster access

  • pre-calculating and storing summary data in OLAP cubes e.g. totals, averages, etc.
  • combining data at different levels of granularity for readily providing high-level overviews

Caching data for quick retrieval

  • repetitive queries and frequently accessed data & query results are stored in memory

Indexing on specific columns and dimensions

  • helps to quickly locate the required data rows without scanning entire storage system

Partitioning into smaller units

  • to optimise performace database engineers divide large tables or cubes into smaller, more manageable parts based on a partitioning key thus reducing the amount of data that needs to be scanned for each query

Parallel processing

  • a query is divided into parts, known as tasks, and distributed across multiple processors or cores. Each processor is assigned tasks simultaneously, allowing the query to be executed much faster than if it were processed sequentially on a single processor.

Materialized Views

  • pre-calculated views store results of complex queries as physical tables

Hardware and infrastructure

scalable infrastructure e.g. fast processors, large memory, high-speed storage etc., facilitate data discovery, unlimited report viewing, and complex analytical calculations. Cloud-based vendors for data analysis are now a default choice. They simplify integration, are reliable, easy to scale, and more affordable than on-premise data infrastructures.

Data preparation... 

Data gathering, storing and cleaning is done via two data integration methods viz. ETL and ELT.

ETL (Extract, Transform, Load) is a predefined sequence for extracting data from sources, transforming it to meet the target system's requirements, and loading it into a target data warehouse. It is complex, time-consuming and requires upfront planning for the data to be correctly transformed and loaded into the target system. 

ELT (Extract, Load and Transform) involves extracting data from data sources, loading it into target data warehouse or data lake, and then transforming it to meet the target system's requirements. 

Differences between ELT and ETL

Unlike ETL, ELT does not require a predefined sequence of steps. The extracted data is loaded into the target system as quickly as possible, and then the data transformation process is applied to it in-place.

ELT solutions are usually applied to modern cloud-based data warehouses that allow for massive parallel processing. So, ELT solutions can process large amounts of data much faster than traditional ETL solutions. Also, ELT solutions are more flexible than ETL solutions, as they allow for data transformation to be performed on the destination system in-place.

Once data has been loaded into a cloud data warehouse, engineers and analysts use modern data stack to prepare data for analysis. 

OLAP Operations

  • OLAP systems use a specific SQL language called MDX or Multidimensional Expressions
  • Also support standard SQL queries to perform OLAP analysis
Some standard multidimensional OLAP operations are:

Slice and Dice

  • Slicing: dividing one dimension within the cube into a separate table, enabling low-level and isolated analysis of a data set
  • Dicing: is dividing two or more dimensions within a cube to generate a separate cube

Drill down and Roll-up

  • drill down: is move from high-level data to view lower-level information
  • roll up: is move from detailed data to less detailed data or summarised information

Pivot

  • is to rotate data from rows to columns or from columns to rows, enabling multidimensional analysis from different perspectives and data comparisons across dimensions

Drill-through

  • creating data points to access detailed information faster. When users click on a data point in a summary, they are shown the underlying data that make up the summary

Drill-across

  • using a common dimension shared by different data sources to enable data analysis across multiple unrelated sources or cubes. This function allows analysts to perform analysis on data from multiple sources without integrating them into a single cube.

Aggregations and Calculations

  • calculations and aggregations such as sum, average, count, minimum, maximum, and variance. Users perform these operations across one or more dimensions.