Analytic - Data Densification (sparse to dense data) - Preservation of the dimensions

1 - About

You may want to view the data in dense form, with rows for all combination of dimension values displayed even when no fact data exist for them.

The densification is also :

  • known as the preservation of the dimensions
  • describe as show me the null values

For example, if a product did not sell during a particular time period, you may still want to see the product for that time period with zero sales or null value next to it.

Moreover, time series calculations can be performed most easily when data is dense along the time dimension.

This is because dense data will fill a consistent number of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets.

Data densification is the process of converting sparse data into dense form.

3 - Sparse data to densify

A typical situation in the sh schema with a sparse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product Bounce for weeks 20-30 in 2000 and 2001:

SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, 
       t.Calendar_Year          YEAR,
       t.Calendar_Week_Number   Week, 
       SUM(Amount_Sold)         Sales
FROM   
       Sales s, 
       Times t, 
       Products p
WHERE  
       s.Time_id = t.Time_id AND 
       s.Prod_id = p.Prod_id AND
       p.Prod_name IN ('Bounce') AND 
       t.Calendar_Year IN (2000,2001) AND
       t.Calendar_Week_Number BETWEEN 20 AND 30
GROUP BY 
       p.Prod_Name, 
       t.Calendar_Year, 
       t.Calendar_Week_Number
ORDER BY 
       product_name, 
       YEAR, 
       week ASC
PRODUCT_NAME          YEAR       WEEK      SALES
--------------- ---------- ---------- ----------
Bounce                2000         20        801
Bounce                2000         21    4062.24
Bounce                2000         22    2043.16
Bounce                2000         23    2731.14
Bounce                2000         24    4419.36
---- Miss one row for the week 25
---- Miss one row for the week 26
Bounce                2000         27    2297.29
Bounce                2000         28    1443.13
Bounce                2000         29    1927.38
Bounce                2000         30    1927.38
Bounce                2001         20     1483.3
Bounce                2001         21    4184.49
Bounce                2001         22    2609.19
Bounce                2001         23    1416.95
Bounce                2001         24    3149.62
Bounce                2001         25    2645.98
---- Miss one row for the week 26
Bounce                2001         27    2125.12
---- Miss one row for the week 28
Bounce                2001         29    2467.92
Bounce                2001         30    2620.17

In this example, we would expect 22 rows of data (11 weeks each from 2 years) if the data were dense. However we get only 18 rows because weeks 25 and 26 are missing in 2000, and weeks 26 and 28 in 2001.

4 - Densification : A right outer join : the first step

To overcome the problem of the sparsity, the first idea is to perform an outer join with the time table to fill the gap

SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) dense_sales
FROM
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name,
  t.Calendar_Year YEAR, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
  FROM Sales s, Times t, Products p
  WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND
   p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
   t.Calendar_Week_Number BETWEEN 20 AND 30
  GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) v
RIGHT OUTER JOIN
 (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year YEAR
  FROM Times
  WHERE Calendar_Year IN (2000, 2001)
  AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON (v.week = t.week AND v.Year = t.Year)
ORDER BY t.year, t.week;

Unfortunately, the product name is missing and we have to go further to find a solution.

Result :

PRODUCT_NAME          YEAR       WEEK DENSE_SALES
--------------- ---------- ---------- -----------
Bounce                2000         20        1602
Bounce                2000         21     8124.48
Bounce                2000         22     4086.32
Bounce                2000         23     5462.28
Bounce                2000         24     8838.72
                      2000         25           0
                      2000         26           0
Bounce                2000         27     4594.58
Bounce                2000         28     2886.26
Bounce                2000         29     3854.76
Bounce                2000         30     3854.76
Bounce                2001         20      2966.6
Bounce                2001         21     8368.98
Bounce                2001         22     5218.38
Bounce                2001         23      2833.9
Bounce                2001         24     6299.24
Bounce                2001         25     5291.96
                      2001         26           0
Bounce                2001         27     4250.24
                      2001         28           0
Bounce                2001         29     4935.84
Bounce                2001         30     5240.34

5 - Densification with partitioned outer join

To overcome the problem of sparsity and to fill the product column, Oracle recommend to use a partitioned outer join.

During a partition outer join, the database :

  • partitions the rows in your query based on the expression you specify in the PARTITION BY clause (in our example by product name)
  • then apply the outer join on each logical partition defined in the query.
  • and retrieve the result by performing an UNION of the outer joins of each of the partitions in the logically partitioned table on the other side of the join.
SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) dense_sales
FROM
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name,
  t.Calendar_Year YEAR, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
  FROM Sales s, Times t, Products p
  WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND
   p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
   t.Calendar_Week_Number BETWEEN 20 AND 30
  GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) v
PARTITION BY (v.Product_Name)
RIGHT OUTER JOIN
 (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year YEAR
  FROM Times
  WHERE Calendar_Year IN (2000, 2001)
  AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON (v.week = t.week AND v.Year = t.Year)
ORDER BY t.year, t.week;

The result :

PRODUCT_NAME          YEAR       WEEK DENSE_SALES
--------------- ---------- ---------- -----------
Bounce                2000         20        1602
Bounce                2000         21     8124.48
Bounce                2000         22     4086.32
Bounce                2000         23     5462.28
Bounce                2000         24     8838.72
Bounce                2000         25           0
Bounce                2000         26           0
Bounce                2000         27     4594.58
Bounce                2000         28     2886.26
Bounce                2000         29     3854.76
Bounce                2000         30     3854.76
Bounce                2001         20      2966.6
Bounce                2001         21     8368.98
Bounce                2001         22     5218.38
Bounce                2001         23      2833.9
Bounce                2001         24     6299.24
Bounce                2001         25     5291.96
Bounce                2001         26           0
Bounce                2001         27     4250.24
Bounce                2001         28           0
Bounce                2001         29     4935.84
Bounce                2001         30     5240.34

Nevertheless, this clause can not be always available and we can see a partition outer join as a outer join between the sparse data and a cross join data set of the dimension. See the following paragraph for an example.

6 - Densification with cross join and outer join

A cross join between the dimensions is always dense. The idea is therefore to perform it first and then outer join it with the sparse data.

First, the cross join :

SELECT product_name, week, YEAR FROM
 (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year YEAR 
   FROM Times 
  WHERE Calendar_Year IN (2000, 2001)  AND 
        Calendar_Week_Number BETWEEN 20 AND 30) t,
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name 
   FROM Products p 
  WHERE p.Prod_name IN ('Bounce'))

The result :

PRODUCT_NAME          WEEK       YEAR
--------------- ---------- ----------
Bounce                  29       2000
Bounce                  21       2001
Bounce                  26       2001
Bounce                  29       2001
Bounce                  30       2001
Bounce                  20       2001
Bounce                  24       2001
Bounce                  23       2000
Bounce                  27       2000
Bounce                  30       2000
Bounce                  28       2001
Bounce                  25       2000
Bounce                  26       2000
Bounce                  22       2001
Bounce                  25       2001
Bounce                  27       2001
Bounce                  21       2000
Bounce                  22       2000
Bounce                  23       2001
Bounce                  20       2000
Bounce                  24       2000
Bounce                  28       2000

And then an outer join to retrieve the amount sold.

SELECT cross_join.Product_Name, cross_join.Year, cross_join.Week, NVL(data_to_densify.Sales,0) dense_sales
FROM
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name,
  t.Calendar_Year YEAR, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
  FROM Sales s, Times t, Products p
  WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND
   p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
   t.Calendar_Week_Number BETWEEN 20 AND 30
  GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) data_to_densify
RIGHT OUTER JOIN
(
  SELECT week, YEAR, product_name FROM
 (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year YEAR 
   FROM Times 
  WHERE Calendar_Year IN (2000, 2001)  AND 
        Calendar_Week_Number BETWEEN 20 AND 30) t,
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name 
   FROM Products p 
  WHERE p.Prod_name IN ('Bounce') ) p
) cross_join
ON (
        data_to_densify.product_Name = cross_join.Product_Name AND 
        data_to_densify.week = cross_join.week AND 
        data_to_densify.Year = cross_join.Year
   )
ORDER BY YEAR, week

In our example above, a right outer join is used but of course all outer joins as left and full can be used instead.

The result :

PRODUCT_NAME          YEAR       WEEK DENSE_SALES
--------------- ---------- ---------- -----------
Bounce                2000         20        1602
Bounce                2000         21     8124.48
Bounce                2000         22     4086.32
Bounce                2000         23     5462.28
Bounce                2000         24     8838.72
Bounce                2000         25           0
Bounce                2000         26           0
Bounce                2000         27     4594.58
Bounce                2000         28     2886.26
Bounce                2000         29     3854.76
Bounce                2000         30     3854.76
Bounce                2001         20      2966.6
Bounce                2001         21     8368.98
Bounce                2001         22     5218.38
Bounce                2001         23      2833.9
Bounce                2001         24     6299.24
Bounce                2001         25     5291.96
Bounce                2001         26           0
Bounce                2001         27     4250.24
Bounce                2001         28           0
Bounce                2001         29     4935.84
Bounce                2001         30     5240.34

7 - Reference

  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at del.icio.us
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Digg
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Ask
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Google
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at StumbleUpon
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Technorati
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Live Bookmarks
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Yahoo! Myweb
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Facebook
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Yahoo! Bookmarks
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at Twitter
  • Bookmark "Analytic - Data Densification  (sparse to dense data) - Preservation of the dimensions" at myAOL
olap/data_densification_sparse_to_dense.txt ยท Last modified: 2017/04/18 15:21 by gerardnico