Data Warehouse - Staging Area

1 - About

The first destination of the data that has been extracted from source is the staging area. Sometimes staging areas are also called landing zones for flat files, XML files, Cobol files and the like.

This logical layer:

  • acts as a temporary storage area for data manipulation before it enters the data Warehouse
  • and serves to isolate the rate at which data is received into the data warehouse from the frequency at which data is refreshed and made available to the end user.

It is also possible that in some implementations this layer is not necessary, as all data transformation processing will be done “on the fly” as data is extracted from the source system before it is inserted directly into the Operational data layer (Foundation).

3 - Implementation

3.1 - Data Structure

The most basic approach for the staging layer is to have it be an identical schema (a replicate) of the data source:

  • with the data that we need.
  • and may be with some structural changes to the tables, such as range partitioning

3.2 - Process

INDEPENDENT TABLES / FILES that “arrive” when the data is ready on the source.

By having them be independent, (no foreign keys, no referential integrity, no lookups, no matches, no checks of any kind of ) the processes are optimized to scale at high speed by adding no overhead to the data load process.

Truncate and re-load makes the processes completely restartable and fast.

The whole point to staging batch data is:

  1. Data Ready
  2. Logon to source
  3. Get Data in parallel process as fast as possible
  4. Put data in staging area as fast as possible
  5. Logout of source.

3.2.1 - Quality / Data Type

The staging area may also the layer where rejected data for data type reasons are retained. If we report directly on this layer, we might see data quality problems.

4 - Documentation / Reference

data/warehouse/staging_area.txt · Last modified: 2017/12/14 10:48 by gerardnico