Data Warehouse Concepts

Datawarehouse concepts 1.
Define Datawarehouse: A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Integrated Nonvolatile Time Variant
Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
2 .Schemas in Data Warehouses A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. One data warehouse schema model is a star schema. The Sales History sample schema (the basis for most of the examples in this book) uses a star schema. However, there are other schema models that are commonly used for data warehouses. The most prevalent of these schema models is the third normal form (3NF) schema. Additionally, some data warehouse schemas are neither star schemas nor 3NF schemas, but instead share characteristics of both schemas; these are referred to as hybrid schema models. The Oracle9i database is designed to support all data warehouse schemas. Some features may be specific to one schema model (such as the star transformation feature, described in "Using Star Transformation", which is specific to star schemas). However, the vast majority of Oracle's data warehousing features are equally applicable to star schemas, 3NF schemas, and hybrid schemas. Key data warehousing capabilities such as partitioning (including the rolling window load technique), parallelism, materialized views, and analytic SQL are implemented in all schema models. The determination of which schema model should be used for a data warehouse should be based upon the requirements and preferences of the data warehouse project team. Comparing the merits of the alternative schema models is outside of the scope of this book; instead, this chapter will briefly introduce each schema model and suggest how Oracle can be optimized for those environments. Third Normal Form Although this guide primarily uses star schemas in its examples, you can also use the third normal form for your data warehouse implementation. Third normal form modeling is a classical relational-database modeling technique that minimizes data redundancy through normalization. When compared to a star schema, a 3NF schema
typically has a larger number of tables due to this normalization process. For example, in Figure 17-1, orders and order items tables contain similar information as sales table in the star schema in Figure 17-2. 3NF schemas are typically chosen for large data warehouses, especially environments with significant data-loading requirements that are used to feed data marts and execute long-running queries. The main advantages of 3NF schemas are that they: Provide a neutral schema design, independent of any application or data-usage considerations May require less data-transformation than more normalized schemas such as star schemas Figure 17-1 presents a graphical representation of a third normal form schema. Figure 17-1 Third Normal Form Schema
Text description of the illustration dwhsg108.gif
Optimizing Third Normal Form Queries Queries on 3NF schemas are often very complex and involve a large number of tables. The performance of joins between large tables is thus a primary consideration when using 3NF schemas. One particularly important feature for 3NF schemas is partition-wise joins. The largest tables in a 3NF schema should be partitioned to enable partition-wise joins. The most common partitioning technique in these environments is composite range-hash partitioning for the largest tables, with the most-common join key chosen as the hash-partitioning key. Parallelism is often heavily utilized in 3NF environments, and parallelism should typically be enabled in these environments.
Star Schemas The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables. A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table. A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them. A typical fact table contains keys and measures. For example, in the sh sample schema, the fact table, sales, contain the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions. The product dimension table, for example, contains information about each product number that appears in the fact table.
A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they: Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables Star schemas are used for both simple data marts and very large data warehouses. Figure 17-2 presents a graphical representation of a star schema. Figure 17-2 Star Schema
Text description of the illustration dwhsg007.gif Snowflake Schemas The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Figure 17-3 presents a graphical representation of a snowflake schema.
Figure 17-3 Snowflake Schema
Text description of the illustration dwhsg008.gif
Note: Oracle Corporation recommends you choose a star schema over a snowflake schema unless you have a clear reason not to. 3 data mart A data mart is a collection of subject areas organized for decision support based on the needs of a given department A data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, the data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources. 4 metadata Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata. Examples include: for data, the definition of a source to target transformation that is used to generate and populate the data warehouse; for information, definitions of tables, columns and associations that are stored inside a relational modeling tool; for business rules, discount by 10 percent after selling 1,000 items. 5 OLAP Once we model our DWH in the form of Multidimensional Data Cube, it is necessary to explore the different analytical tools with which to perform the complex analysis of data. These analysis tools are called OLAP(On-Line Analytical Processing). OLAP is mainly used to access the live data online and to analyze it. OLAP tools are designed in order to accomplish such analyses on very large databases.
6 SLICING These operations are used for reducing the data cube by one or more dimensions. The Slice operation performs a selection on one dimension of the given cube, resulting in a subcube
7 DICING This operation is for selecting a smaller data cube and analyzing it from different perspectives. The dice operation defines a subcube by performing a selection on two or more dimensions 8. Dimensional data model Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level. Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: Determine which dimensions will be included. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. Which Dimensions To Include Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions. For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made. What Level Within Each Dimensions To Include Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. This is where user requirement (both stated and possibly future) plays a major role. In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage. Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data
warehousing team to design the fact table such that lower-level information is included. This will avoid possibly needing to re-design the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the "dumping the lowest level of detail into the data warehouse" symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here.
Fact And Fact Table Types Types of Facts There are three types of facts: Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.
Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.
Types of Fact Tables Based on the above classifications, there are two types of fact tables: Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Star Schema In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a forieng key in the fact table.
10
Sample star schema All measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table. Let's look at an example: Assume our data warehouse keeps store sales data, and the different dimensions are time, store, product, and customer. In this case, the figure on the left repesents our star schema. The lines between two tables indicate that there is a primary key / foreign key relationship between the two tables. Note that different dimensions are not related to one another.
Snowflake Schema The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.
11
Sample snowflake schema For example, the Time Dimension that consists of 2 different hierarchies: 1. Year Month Day 2. Week Day We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.
Slowly Changing Dimensions
12
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below: Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record: Customer Key 1001 Name Christina State Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. There are in general three ways to solve this type of problem, and they are categorized as follows: Type 1: The new record replaces the original record. No trace of the old record exists. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Type 3: The original record is modified to reflect the change. We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives. Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key Name State
13
1001
Christina
California
Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes. Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key Name State
14
1001 1005 Advantages:
Christina Christina
Illinois California
This allows us to accurately keep all historical information.
Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key
15
Name Original State Current State Effective Date After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Key 1001 Name Christina Original State Illinois Current State California Effective Date 15-JAN-2003
Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time. Conceptual, Logical, And Physical Data Models
The three level of data modeling, conceptual data model, logical data model, and physical data model, were discussed in prior sections. Here we compare these three types of data models. The table below compares the different features:
16
Feature Entity Names Entity Relationships Attributes Primary Keys Foreign Keys Table Names Column Names Column Data Types
Conceptua Logical Physical l
Below we show the conceptual, logical, and physical versions of a single data model. Conceptual Model Design Logical Model Design Physical Model Design
17
We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable. Conceptual Data Model
A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include: Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified.
The figure below is an example of a conceptual data model.
18
Conceptual Data Model
From the figure above, we can see that the only information shown via the conceptual data model is the entities that describe the data and the relationships between those entities. No other information is shown through the conceptual data model. Logical Data Model A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Features of a logical data model include: Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity is specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level. The steps for designing the logical data model are as follows: Specify primary keys for all entities. Find the relationships between different entities. Find all attributes for each entity. Resolve many-to-many relationships.
19
Normalization. The figure below is an example of a logical data model.
Logical Data Model
Comparing the logical data model shown above with the conceptual data model diagram, we see the main differences between the two: In a logical data model, primary keys are present, whereas in a conceptual data model, no primary key is present. In a logical data model, all attributes are specified within an entity. No attributes are specified in a conceptual data model. Relationships between entities are specified using primary keys and foreign keys in a logical data model. In a conceptual data model, the relationships are simply stated, not specified, so we
20
simply know that two entities are related, but we do not specify what attributes are used for this relationship.
Physical Data Model Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables. Features of a physical data model include: Specification all tables and columns. Foreign keys are used to identify relationships between tables. Denormalization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model. Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server. The steps for physical data model design are as follows: Convert entities into tables. Convert relationships into foreign keys. Convert attributes into columns. Modify the physical data model based on physical constraints / requirements. The figure below is an example of a physical data model.
21
Physical Data Model
Comparing the logical data model shown above with the logical data model diagram, we see the main differences between the two: Entity names are now table names. Attributes are now column names. Data type for each column is specified. Data types can be different depending on the actual database being used.
Data integrity refers to the validity of data, meaning data is consistent and correct. In the data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity in the data warehouse, any resulting report and analysis will not be useful. In a data warehouse or a data mart, there are three areas of where data integrity needs to be enforced: Database level
22
We can enforce data integrity at the database level. Common ways of enforcing data integrity include: Referential integrity The relationship between the primary key of one table and the foreign key of another table must always be maintained. For example, a primary key cannot be deleted if there is still a foreign key that refers to this primary key. Primary key / Unique constraint Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely identified. Not NULL vs NULL-able For columns identified as NOT NULL, they may not have a NULL value. Valid Values Only allowed values are permitted in the database. For example, if a column can only have positive integers, a value of '-1' cannot be allowed. ETL process For each step of the ETL process, data integrity checks should be put in place to ensure that source data is the same as the data in the destination. Most common checks include record counts or record sums. Access level We need to ensure that data is not altered by any unauthorized means either during the ETL process or in the data warehouse. To do this, there needs to be safeguards against unauthorized access to data (including physical access to the servers), as well as logging of all data access history. Data integrity can only ensured if there is no unauthorized access to the data. OLAP OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity. The OLAP Report has proposed the FASMI test, Fast Analysis of Shared Multidimensional Information. For a more detailed description of both Dr. Codd's rules and the FASMI test, please visit The OLAP Report.
23
For people on the business side, the key feature out of the above list is "Multidimensional." In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company is up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis. Depending on the underlying technology used, OLAP can be braodly divided into two different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP, ROLAP, and HOLAP section.
MOLAP, ROLAP, And HOLAP In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
24
ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages: Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions. HOLAP HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summarytype information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data. In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two. Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
25
There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.
Data Mining
Data Mining refers to using a variety of techniques to identify nuggets of information or decision making knowledge in the databases and extracting these in such a way that they can be put to use in areas such as decision support,prediction,forecasting and estimation
REPOSITORY The place where you store the Metadata is called a Repository. The more sophisticated your repository, the more complex and detailed metadata you can store in it CUBES Cubes are logical representation of multidimensional data.The edge of the cube contains dimension members and the body of the cube contains data values. What is Normalization? Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they
26
reduce the amount of space a database consumes and ensure that data is logically stored. The Normal Forms The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen and won't be discussed in this article. Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms. First Normal Form (1NF) First normal form (1NF) sets the very basic rules for an organized database: Eliminate duplicative columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key). Second Normal Form (2NF) Second normal form (2NF) further addresses the concept of removing duplicative data: Meet all the requirements of the first normal form. Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys. Third Normal Form (3NF) Third normal form (3NF) goes one large step further: Meet all the requirements of the second normal form. Remove columns that are not dependent upon the primary key.
Fourth Normal Form (4NF)
27
Finally, fourth normal form (4NF) has one additional requirement: Meet all the requirements of the third normal form. A relation is in 4NF if it has no multi-valued dependencies.
Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.
What do these rules mean when contemplating the practical design of a database? Its actually quite simple. The first rule dictates that we must not duplicate data within the same row of a table. Within the database community, this concept is referred to as the atomicity of a table. Tables that comply with this rule are said to be atomic. Lets explore this principle with a classic example a table within a human resources database that stores the manager-subordinate relationship. For the purposes of our example, well impose the business rule that each manager may have one or more subordinates while each subordinate may have only one manager. Intuitively, when creating a list or spreadsheet to track this information, we might create a table with the following fields: Manager Subordinate1 Subordinate2 Subordinate3 Subordinate4
However, recall the first rule imposed by 1NF: eliminate duplicative columns from the same table. Clearly, the Subordinate1-Subordinate4 columns are duplicative. Take a moment and ponder the problems raised by this scenario. If a manager only has one subordinate the Subordinate2-Subordinate4 columns are simply wasted storage space (a precious database commodity). Furthermore, imagine the case where a manager already has 4 subordinates what happens if she takes on another employee? The whole table structure would require modification. At this point, a second bright idea usually occurs to database novices: We dont want to have more than one column and we want to allow for a flexible amount of data storage. Lets try something like this: Manager Subordinates
where the Subordinates field contains multiple entries in the form "Mary, Bill, Joe"
28
This solution is closer, but it also falls short of the mark. The subordinates column is still duplicative and non-atomic. What happens when we need to add or remove a subordinate? We need to read and write the entire contents of the table. Thats not a big deal in this situation, but what if one manager had one hundred employees? Also, it complicates the process of selecting data from the database in future queries. Heres a table that satisfies the first rule of 1NF: Manager Subordinate
In this case, each subordinate has a single entry, but managers may have multiple entries. Now, what about the second rule: identify each row with a unique column or set of columns (the primary key)? You might take a look at the table above and suggest the use of the subordinate column as a primary key. In fact, the subordinate column is a good candidate for a primary key due to the fact that our business rules specified that each subordinate may have only one manager. However, the data that weve chosen to store in our table makes this a less than ideal solution. What happens if we hire another employee named Jim? How do we store his managersubordinate relationship in the database? Its best to use a truly unique identifier (such as an employee ID) as a primary key. Our final table would look like this: Manager ID Subordinate ID
Second Normal Form (2NF) Now, let's continue our journey and cover the principles of second normal form (2NF). Recall the general requirements of 2NF: Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys. These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables. Let's look at an example. Imagine an online store that maintains customer
29
information in a database. They might have a single table called Customers with the following elements: CustNum FirstName LastName Address City State ZIP
A brief look at this table reveals a small amount of redundant data. We're storing the "Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. Now, that might not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we'd need to make that change in many places throughout the database. In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might have the following fields: ZIP City State
If we want to be super-efficient, we can even fill this table in advance -- the post office provides a directory of all valid ZIP codes and their city/state relationships. Surely, you've encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency. Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table: CustNum FirstName LastName Address ZIP
We've now minimized the amount of redundant information stored within the database and our structure is in second normal form!
Third Normal Form (3NF)
30
There are two basic requirements for a database to be in third normal form: Already meet the requirements of both 1NF and 2NF Remove columns that are not fully dependent upon the primary key.
Imagine that we have a table of widget orders that contains the following attributes: Order Number Customer Number Unit Price Quantity Total
Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF. Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there. What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes: Order Number Customer Number Unit Price Quantity
Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and it's best not to store it in the database at all. We can simply compute it "on the fly" when performing database queries. For example, we might have previously used this query to retrieve order numbers and totals: SELECT OrderNumber, Total FROM WidgetOrders We can now use the following query:
31
SELECT OrderNumber, UnitPrice * Quantity AS Total FROM WidgetOrders to achieve the same results without violating normalization rules.
De-normalization is the process of attempting to optimize the performance of a database by adding redundant data. It is sometimes necessary because current DBMSs implement the relational model poorly. A true relational DBMS would allow for a fully normalized database at the logical level, while providing physical storage of data that is tuned for high performance. De-normalization is a technique to move from higher to lower normal forms of database modeling in order to speed up database access. Denormalization of Database! Why? Only one valid reason exists for denormalizing a relational design - to enhance performance. However, there are several indicators which will help to identify systems and tables which are potential denormalization candidates. These are:
* Many critical queries and reports exist which rely upon data from more than one table. Often times these requests need to be processed in an on-line environment. * Repeating groups exist which need to be processed in a group instead of individually. * Many calculations need to be applied to one or many columns before queries can be successfully answered. * Tables need to be accessed in different ways by different users during the same timeframe. * Many large primary keys exist which are clumsy to query and consume a large amount of DASD when carried as foreign key columns in related tables. * Certain columns are queried a large percentage of the time. Consider 60% or greater to be a cautionary number flagging denormalization as an option.
32

Data Warehouse Concepts

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehouse Concepts

Uploaded by

Copyright:

Available Formats

Datawarehouse concepts 1.

Text description of the illustration dwhsg108.gif

Figure 17-3 Snowflake Schema

Text description of the illustration dwhsg008.gif

Slowly Changing Dimensions

1001 1005 Advantages:

This allows us to accurately keep all historical information.

Conceptua Logical Physical l

The figure below is an example of a conceptual data model.

Conceptual Data Model

Normalization. The figure below is an example of a logical data model.

Logical Data Model

Physical Data Model

Fourth Normal Form (4NF)

Third Normal Form (3NF)

You might also like