RDBMS Modeling - (Materialized|Aggregate) (View|Table)

1 - About

Aggregate table are a persistence unit of aggregate data.

Aggregate tables are also known as :

A materialized view (“materialization”) is a table that contains the result of a query (generally an aggregated one but not always ...).

The DBMS maintains it, and uses it to answer queries on other tables.

Aggregate table are used to:

  • cache expensive queries in a data warehouse environment with summarized table. It's then a cache or pre-compute operations engine.
  • replicate data to non-master sites in a replication environment

Pre-aggregating and storing additive information is the standard practice for improving the query performance of relational databases.

When users request information at a high “grain” of aggregation, Pre-aggregating and storing additive information is the standard practice for improving the query performance of relational databases.

However when data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of distributed computing power (such as Hadoop), calculation jobs can leverage hundreds of thousands of nodes.

Nowadays, commercial DBMS support aggregation tables. They provide sophisticated techniques for query rewriting, incremental maintenance, and (usually implemented as an external administration tool ) recommending aggregation tables.

When the queries are known in advance and predictable, materialized views are very helpful.

See : RDBMS Modeling - View selection problem (recommending the best aggregation tables) - Data Warehousing

3 - Example

CREATE materialized VIEW EmpSummary AS
   SELECT deptno, COUNT(*) AS c, SUM(sal) AS s
   FROM Emp
   GROUP BY deptno

4 - Materialized views as summary table

Materialized views are the equivalent of a summary table.

In a Online Analytical Processing (Olap) approach, each of the elements of a dimension could be summarized using a hierarchy.

The end user queries the tables and views in the database. The query rewrite mechanism in a database automatically rewrites the SQL query to use this summary tables.

This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.

This is relatively straightforward and is answered in a single word - performance. By calculating the answers to the really hard questions up front (and once only), we will greatly reduce the load on the machine, We will experience:

  • Less physical reads - There is less data to scan through.
  • Less writes - We will not be sorting/aggregating as frequently.
  • Decreased CPU consumption - We will not be calculating aggregates and functions on the data, as we will have already done that.
  • Markedly faster response times - Our queries will return incredibly quickly when a summary is used, as opposed to the details. This will be a function of the amount of work we can avoid by using the materialized view, but many orders of magnitude is not out of the question.

Materialized views will increase your need for one resource - more permanently allocated disk. We need extra storage space to accommodate the materialized views, of course, but for the price of a little extra disk space, we can reap a lot of benefit.

Also notice that we may have created a materialized view, but when we ANALYZE, we are analyzing a table. A materialized view creates a real table, and this table may be indexed, analyzed, and so on.

5 - Materialized view as replica

A materialized view is a replica of a target master from a single point in time. The master can be either a master table at a master site or a master materialized view at a materialized view site. Whereas in multimaster replication tables are continuously updated by other master sites, materialized views are updated from one or more masters through individual batch updates, known as a refreshes, from a single master site or master materialized view site.

6 - Algoritm

Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm:

  • Based on research [2]
  • Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement
  • Models “benefit” of the table as query time saved over simulated query load
  • The “cost” of a table is its size

7 - Documentation / Reference

data/type/relation/modeling/aggregate_table.txt · Last modified: 2017/09/13 16:31 by gerardnico