Database - Data Densification (sparse to dense data)
About
You may want to view the data in dense form, with rows for all combination of dimension values displayed even when no fact data exist for them.
The densification is also :
- known as the preservation of the dimensions
- describe as show me the null values
For example, if a product did not sell during a particular time period, you may still want to see the product for that time period with zero sales or null value next to it.
Moreover, time series calculations can be performed most easily when data is dense along the time dimension.
This is because dense data will fill a consistent number of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets.
Data densification is the process of converting sparse data into dense form.
Articles Related
Sparse data to densify
A typical situation in the sh schema with a sparse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product Bounce for weeks 20-30 in 2000 and 2001:
SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number ORDER BY product_name, year, week ASC
PRODUCT_NAME YEAR WEEK SALES --------------- ---------- ---------- ---------- Bounce 2000 20 801 Bounce 2000 21 4062.24 Bounce 2000 22 2043.16 Bounce 2000 23 2731.14 Bounce 2000 24 4419.36 ---- Miss one row for the week 25 ---- Miss one row for the week 26 Bounce 2000 27 2297.29 Bounce 2000 28 1443.13 Bounce 2000 29 1927.38 Bounce 2000 30 1927.38 Bounce 2001 20 1483.3 Bounce 2001 21 4184.49 Bounce 2001 22 2609.19 Bounce 2001 23 1416.95 Bounce 2001 24 3149.62 Bounce 2001 25 2645.98 ---- Miss one row for the week 26 Bounce 2001 27 2125.12 ---- Miss one row for the week 28 Bounce 2001 29 2467.92 Bounce 2001 30 2620.17
In this example, we would expect 22 rows of data (11 weeks each from 2 years) if the data were dense. However we get only 18 rows because weeks 25 and 26 are missing in 2000, and weeks 26 and 28 in 2001.
Densification : A right outer join : the first step
To overcome the problem of the sparsity, the first idea is to perform an outer join with the time table to fill the gap
SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) dense_sales FROM (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) v RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001) AND Calendar_Week_Number BETWEEN 20 AND 30) t ON (v.week = t.week AND v.Year = t.Year) ORDER BY t.year, t.week;
Unfortunately, the product name is missing and we have to go further to find a solution.
Result :
PRODUCT_NAME YEAR WEEK DENSE_SALES
--------------- ---------- ---------- -----------
Bounce 2000 20 1602
Bounce 2000 21 8124.48
Bounce 2000 22 4086.32
Bounce 2000 23 5462.28
Bounce 2000 24 8838.72
2000 25 0
2000 26 0
Bounce 2000 27 4594.58
Bounce 2000 28 2886.26
Bounce 2000 29 3854.76
Bounce 2000 30 3854.76
Bounce 2001 20 2966.6
Bounce 2001 21 8368.98
Bounce 2001 22 5218.38
Bounce 2001 23 2833.9
Bounce 2001 24 6299.24
Bounce 2001 25 5291.96
2001 26 0
Bounce 2001 27 4250.24
2001 28 0
Bounce 2001 29 4935.84
Bounce 2001 30 5240.34
Densification with partitioned outer join
To overcome the problem of sparsity and to fill the product column, Oracle recommend to use a partitioned outer join.
During a partition outer join, the database :
- partitions the rows in your query based on the expression you specify in the PARTITION BY clause (in our example by product name)
- then apply the outer join on each logical partition defined in the query.
- and retrieve the result by performing an UNION of the outer joins of each of the partitions in the logically partitioned table on the other side of the join.
SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) dense_sales FROM (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) v PARTITION BY (v.Product_Name) RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001) AND Calendar_Week_Number BETWEEN 20 AND 30) t ON (v.week = t.week AND v.Year = t.Year) ORDER BY t.year, t.week;
The result :
PRODUCT_NAME YEAR WEEK DENSE_SALES --------------- ---------- ---------- ----------- Bounce 2000 20 1602 Bounce 2000 21 8124.48 Bounce 2000 22 4086.32 Bounce 2000 23 5462.28 Bounce 2000 24 8838.72 Bounce 2000 25 0 Bounce 2000 26 0 Bounce 2000 27 4594.58 Bounce 2000 28 2886.26 Bounce 2000 29 3854.76 Bounce 2000 30 3854.76 Bounce 2001 20 2966.6 Bounce 2001 21 8368.98 Bounce 2001 22 5218.38 Bounce 2001 23 2833.9 Bounce 2001 24 6299.24 Bounce 2001 25 5291.96 Bounce 2001 26 0 Bounce 2001 27 4250.24 Bounce 2001 28 0 Bounce 2001 29 4935.84 Bounce 2001 30 5240.34
Nevertheless, this clause can not be always available and we can see a partition outer join as a outer join between the sparse data and a cross join data set of the dimension. See the following paragraph for an example.
Densification with cross join and outer join
A cross join between the dimensions is always dense. The idea is therefore to perform it first and then outer join it with the sparse data.
First, the cross join :
SELECT product_name, week, year FROM (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001) AND Calendar_Week_Number BETWEEN 20 AND 30) t, (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name FROM Products p WHERE p.Prod_name IN ('Bounce'))
The result :
PRODUCT_NAME WEEK YEAR --------------- ---------- ---------- Bounce 29 2000 Bounce 21 2001 Bounce 26 2001 Bounce 29 2001 Bounce 30 2001 Bounce 20 2001 Bounce 24 2001 Bounce 23 2000 Bounce 27 2000 Bounce 30 2000 Bounce 28 2001 Bounce 25 2000 Bounce 26 2000 Bounce 22 2001 Bounce 25 2001 Bounce 27 2001 Bounce 21 2000 Bounce 22 2000 Bounce 23 2001 Bounce 20 2000 Bounce 24 2000 Bounce 28 2000
And then an outer join to retrieve the amount sold.
SELECT cross_join.Product_Name, cross_join.Year, cross_join.Week, NVL(data_to_densify.Sales,0) dense_sales FROM (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) data_to_densify RIGHT OUTER JOIN ( SELECT week, year, product_name FROM (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001) AND Calendar_Week_Number BETWEEN 20 AND 30) t, (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name FROM Products p WHERE p.Prod_name IN ('Bounce') ) p ) cross_join ON ( data_to_densify.product_Name = cross_join.Product_Name AND data_to_densify.week = cross_join.week AND data_to_densify.Year = cross_join.Year ) ORDER BY year, week
The result :
PRODUCT_NAME YEAR WEEK DENSE_SALES --------------- ---------- ---------- ----------- Bounce 2000 20 1602 Bounce 2000 21 8124.48 Bounce 2000 22 4086.32 Bounce 2000 23 5462.28 Bounce 2000 24 8838.72 Bounce 2000 25 0 Bounce 2000 26 0 Bounce 2000 27 4594.58 Bounce 2000 28 2886.26 Bounce 2000 29 3854.76 Bounce 2000 30 3854.76 Bounce 2001 20 2966.6 Bounce 2001 21 8368.98 Bounce 2001 22 5218.38 Bounce 2001 23 2833.9 Bounce 2001 24 6299.24 Bounce 2001 25 5291.96 Bounce 2001 26 0 Bounce 2001 27 4250.24 Bounce 2001 28 0 Bounce 2001 29 4935.84 Bounce 2001 30 5240.34