Data Warehousing Concepts I

Update: A follow up to this post has been written. Read the Data Warehousing Concepts II post.

This post will provide you with high level information about building a data warehouse (DW) and the associated maintenance. The main focus will be on the concepts around implementing a data warehouse. A future blog will provide information about some more advanced data warehousing concepts.

Concepts Covered In This Post

  1. Dimensional Data Model
  2. Slowly Changing Dimension
  3. Conceptual Data Model
  4. Logical Data Model
  5. Physical Data Model
  6. Conceptual, Logical, and Physical Data Model

Concept Covered In A Future Posting

  1. Star Schema
  2. Snowflake Schema
  3. Data Integrity
  4. What is OLAP
  5. MOLAP, ROLAP, and HOLAP
  6. Bill Inmon vs. Ralph Kimball

Dimensional Data Model

A dimensional data model is the most often used model in data warehousing systems. This is stored in a different form than a 3rd normal form data model. Here is some of the commonly used terminology within this type of data modeling:

  • Dimension: A category of information. For example, the Time dimension.
  • Attribute: A unique level within a dimension. For example, Month is an attribute in the Time dimension.
  • Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.
  • Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: date, store, and sales amount.
  • Lookup Table: The lookup table provides detailed information about the individual attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as “Q1 2001” or “2001 Q1”).

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

Star and Snowflake are the most commonly used schemas in designing a data warehouse. The one chosen for implementation depends on the organization’s requirements.

Slowly Changing Dimension (SCD)

This is a common problem in maintaining a data warehouse where the attribute for a record varies with time. There are three types of SCDs.

  • Type 1: The new record replaces the original record. No trace of the old record exists.
  • Type 2: A new record is added into the customer dimension table. Therefore, the customer is essentially treated as two people.
  • Type 3: The original record is modified to reflect the change.

Conceptual Data Model

A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include:

  • Includes the important entities and the relationships among them.
  • No attribute is specified.
  • No primary key is specified.

The only information shown via the conceptual data model is the entities that describe the data and the relationships between those entities. No other information is shown through the conceptual data model.

Logical Data Model

A logical data model describes the data in as much detail as possible, without regard to how they will be physically implemented in the database. Features of a logical data model include:

  • Includes all entities and relationships among them.
  • All attributes for each entity are specified.
  • The primary key for each entity is specified.
  • Foreign keys (keys identifying the relationship between different entities) are specified.
  • Normalization occurs at this level.

The steps for designing the logical data model are as follows:

  1. Specify primary keys for all entities.
  2. Find the relationships between different entities.
  3. Find all attributes for each entity.
  4. Resolve many-to-many relationships.
  5. Normalization.

Comparing the logical data model above with the conceptual data model diagram, we see the following differences between the two:

  • In a logical data model primary keys are present. In a conceptual data model no primary key is present.
  • In a logical data model all attributes are specified within an entity. No attributes are specified in a conceptual data model.
  • Relationships between entities are specified using primary keys and foreign keys in a logical data model. In a conceptual data model, the relationships are simply stated, not specified, so we simply know that two entities are related, but we do not specify what attributes are used for this relationship.

Physical Data Model

Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables. Features of a physical data model include:

  • Specification of all tables and columns.
  • Foreign keys are used to identify relationships between tables.
  • Denormalization may occur based on user requirements.
  • Physical considerations may cause the physical data model to be quite different from the logical data model.
  • Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.

The steps for designing the physical data model are as follows:

  1. Convert entities into tables.
  2. Convert relationships into foreign keys.
  3. Convert attributes into columns.
  4. Modify the physical data model based on physical constraints / requirements.

Comparing the logical data model above with the physical data model diagram, we see the following differences between the two:

  • Entity names are now table names.
  • Attributes are now column names.
  • Data type for each column is specified. Data types can be different depending on the actual database being used

Conceptual, Logical, and Physical Data Models

The three levels of data modeling: conceptual data model, logical data model, and physical data model were discussed in the prior sections. Here we compare these three types of data models. The table below compares the different features:

Feature Conceptual Logical Physical
Entity Names
Entity Relationships
Attributes
Primary Keys
Foreign Keys
Table Names
Column Names
Column Data Types

 

We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at a high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable.

References Used:

Link 1