Data Warehousing Concepts II

In continuation to the previous posting, Data Warehousing Concepts I, here are the topics related to high level information in building a data warehouse covered in this post.

Concepts Covered In This Course

1. Star Schema
2. Snowflake Schema
3. Data Integrity
4. What is OLAP, MOLAP, ROLAP, and HOLAP?


Star Schema

This is the simplest form of a data warehouse schema. In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables). Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table. The entity relationship diagram of this schema resembles a star, thus deriving its name.

Clipboard01

Sample Star Schema

All measures in the fact table are related to all the dimensions that the fact table is related to. In other words, they all have the same level of granularity.

A simple star schema design consists of a single fact table. A complex star schema design can have more than one fact table.

Snowflake Schema

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table. In a snowflake schema, the dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Clipboard02

Sample Snowflake Schema

Data Integrity

Data integrity refers to the validity of data – meaning the data is consistent and correct. The building of a data warehouse is dependent on the source having high quality data. If an incorrect data source is fed into a data warehouse, the reports generated from this data warehouse will not make meaningful sense. This highlights the importance of data integrity.

To maintain data integrity, a data warehouse or a data mart has to be enforced with data quality measures at the following levels:

a) Database level:

  • Where the relationship of tables have to be maintained in terms of primary and foreign keys which is termed as referential integrity.
  • Primary key and unique constraints are maintained as needed
  • Handling NULLs
  • Tables containing valid values

b) ETL process: The key process responsible for building a data warehouse with quality data. This process should be well maintained and checked to output quality data.

c) Data access levels: Any unauthorized access to databases, either via ETL process or by manual means should be restricted.

What is OLAP, MOLAP, ROLAP, and HOLAP?

OLAP stands for On-Line Analytical Processing, which is a protocol for processing business data. The purpose of using OLAP is to allow businesses to analyze data in different ways for budgeting, planning, trend analysis, and data warehouse reporting.

An added advantage in OLAP is multidimensional calculations known as MOLAP. MOLAP typically completes in just seconds, a process which would otherwise take a much longer time. Apart for quick multidimensional calculations, another major benefit of this type of OLAP is its consistency in calculations. The reporting results help higher management know what to look for and where, which is very helpful while comparing current reports with past and projected future reports. There are also some disadvantages, such as:

  • The amount of data that can be handled is limited
  • The cube technology used to derive the information is an additional investment

Another feature in OLAP is Relational OLAP known as ROLAP. This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality. Advantages of using ROLAP include handling large amounts of data and leveraging the functionalities inherent in the relational database. ROLAP also has some disadvantages:

  • Performance wise, since each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
  • SQL functionality limitation:  ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs.

A combination of MOLAP and ROLAP is called HOLAP (Hybrid OLAP) which can use cube technology for faster performance when summary information is needed and can also drill down into the cube to the underlying relational data for detailed information.

References Used:

Link 1