You are on page 1of 32

Integrated Assignment Solution

PERFORMING ETL: .........................................................................................2


PERFORMING MDDM: .....................................................................................5
PERFORMING REPORTING: ............................................................................. 26

Performing ETL
Description
As discussed in the chapter titled Basics of Data Integration (Extraction Transformation and Loading),
ETL is the process of transforming the source data (source schema) into a desired target database (target
schema). In our case study, the source data is present partly in MS Access and partly in flat text files.
Our case study involves, in all, five tables viz., Time, Assessment, Trainees, Modules and Score. The ETL
process, in accordance with the best practices, will be performed in three phases:
1. Source to Backup
2. Backup to Staging
3. Staging to Data Warehouse

Steps
Let us now look into the steps one by one.
(A) Source to Backup
Data from the source is directly fed into Excel spreadsheets without any transformations. Source
files could be in any form ranging from simple flat text files to complex relational databases. Our
case study deals with two such sources, a flat text file for Time data and relational access source for
the Assessment, Score, Modules and Trainees tables. Let us now go through this process for the
Trainees Table in the access source database.
Steps:
1. Click on the Data Tab from Menu Bar.

2. Choose the From Access option from the ribbon interface.

3. Select the source Access database file (having extension .accdb).

4. Choose the required (Trainees) table from the list.

5. As the last step, select any cell (preferably A1) as the top left point of the table.

At the end of this, we have a spreadsheet that has the Trainees table data as follows:

Similarly we can load the Assessment, Modules and Score tables into separate Excel sheets. The outputs for
these tables are as follows:
Assessment Table:

Modules Table:

Score Table:

Now to load the Time data from the text file source, we can use the text import option available in the form
of the From Text button under the Data tab. Text source files can be in delimited format or in fixed
width format.
1. Delimited Format: Fields are separated using a delimiter character. This character cannot be a part
of data within fields. , and : are commonly used to separate fields and the End Of Line
marker generally separates rows.
2. Fixed Width Format: Fields are aligned and separated using spaces between fields. Rows are
separated using End Of Line markers.
6

Time data provided to us is in delimited format, the delimiter being ,. With this information let us try to
load time data into the backup Excel sheet.
Steps:
1. Choose the From Text option in the Data Tab.

2. Browse and select the source file.

3. As mentioned before, the Time source file provided is in the delimited format with
delimiter ,. Therefore choose the Delimited option and click on Next.

4. Next, choose the correct delimiter character, in our case , and click on Next.

5. The next step is to choose the data type for each field. Excel provides a general data type
which automatically detects and assigns the required data type. Assign field data types to
general and click on Next.

6. As before, the last step is to select any cell (preferably A1) as the top left cell of the table.

At the end of this, we have a spreadsheet that has the Time table data as follows:

(B) Backup to Staging


In this stage, data from the backup database is transformed and loaded into a staging database.
As before, let us go through the Trainees tables transformation from backup to staging. The target
schema requires three additional columns in EmpName, EmpKey and BU. EmpName is the
concatenation of EmpFirstName, EmpMiddleName and EmpLastName whereas EmpKey is the
surrogate key that has incremental integer values (1, 2, NoOfRows). Value in the BU field is
determined by the EmpNumber, if EmpNumber is less than 100150 then the BU (Business Unit) is
SI (Systems Information) else it is TRPU (Training Practice Unit).
Steps:
1. Click on the Data tab in the Menu bar, and then click on the From Other Sources
button in the ribbon interface and choose the From Microsoft Query option.

10

2. Now that we have transferred the entire dataset into a backup database in Excel, we shall
use this Excel file as source. Choose Excel Files* and click OK.

3. Browse for the back-up Excel file.

4. Choose the desired table (Trainees Table) from the list. In the list box on the right, we can
view the fields in the table and unselect (using the button labeled <) any if required.
However, as we require all fields of the Trainees table, proceed by clicking on Next.

11

5. The Filter Data window enables selection of those entries in the database that satisfy a
desired condition on a column. As we need the entire dataset from the table, do not add any
conditions. Simply click on Next to proceed.

6. The Sort Order window enables us to sort a dataset based on any column in the dataset.
The dataset is sorted based on the first mentioned column and the remaining columns are
used only in case of clash. It is a good practice to sort data based on the primary key
column, hence let us select an ascending sort on EmpNumber.

12

7. To complete loading data choose Return Data to Microsoft Excel and click on Finish.

13

8. As before, select any cell (preferably A1) as the top left point of the table.

9. Now that we have the source dataset, we will add the required new columns. To insert a
new empty column, right click on the column header and choose Insert. This will insert a
column to the left of the current column.

10. As mentioned before EmpName is a concatenation of three fields. To perform this


concatenation we use the & operator. Go to the first row where the formula is to be
applied (in our case it is B2), use the formula
14

=C2 & & D2 & & E2


Then rename the column as EmpName.

11. EmpKey is an incremental count and can be derived from RowNumber. For this we can use
the in-built Row() function.
To add the column EmpKey, insert an empty column and in the second row. Insert the
formula =Row(B2)-1 (-1 to negate the count due to the column header).
Function: Row()
Syntax: Row(ReferenceValue)
Returns: The row number where the reference value is present

12. The values of the BU column are dependent on EmpNumber. We will use the IF() function
to get these values.
Function: IF()
Syntax: IF(<condition>, expr1, expr2)
15

Returns: expr1 if <condition> is TRUE, expr2 if <condition> is FALSE.


To add the column EmpKey, insert an empty column, and in the second row insert the
Formula =IF(C2<100150,SI,TRPU).

At the end of these steps we have the desired Trainees table.

Let us now look into the Assessment, Modules and Time tables. The source and target schemas
for these tables are the same and hence they require no transformations. Therefore we shall
simply load these tables into Excel sheets (steps 17 of the Trainees table). The outputs of these
tables are as follows:
Assessment Table:

16

Modules Table:

Time Table:

17

Score Table:
According to the data warehouse schema, we have EmpKey, ModuleKey and AssessmentTypeKey present
in the Score table.
Now in the Score table of the source we have EmpId, ModuleName and AssessmentType. The values of
EmpKey, ModuleKey and AssessmentTypeKey need to be extracted from the corresponding tables. This
could be done using VLOOKUP() function in Excel.
Function: VLOOKUP()
Syntax: VLOOKUP (lookup_value,table_array,col_index_num,[range_lookup])
Returns: The values in the column (col_index_num) from table_array corresponding to lookup_value.
The VLOOKUP function searches for the first column of a range (a range is defined as two or more cells
on a sheet) of cells, and then returns a value from any cell on the same row of the range. Hence the value
that must be looked up must be present in the first column on the table array.
Example:
Consider two tables, Employee and Department:
Employee Table
A

EmpKey

EmpName

Dno

Dname

E101

Rahul

E102

Shyam

Department Table
A

Dno

Dname

Mech

I.T.

The field DName can be fetched into the Employee table using a VLOOKUP function as:
D2=VLOOKUP(C2,Department,2,False)
Here C2 refers to the column in the current table array that must be looked up (searched for) in the table
array named Department. Department is a name assigned to the table array. 2 refers to the index
number of the column that contains the value that must be fetched. Column A has index 1, B has 2 and so
on. Thus 2 refers to column DName.
The FALSE option ensures that VLOOKUP returns the value only on a perfect match. The TRUE option is
set when a perfect match is not expected.
18

Output:
A

EmpKey

EmpName

Dno

Dname

E101

Rahul

Mech

E102

Shyam

I.T.

Steps:
1. Load the data from the Score table of the back-up Excel sheet onto a new sheet in the staging
Excel sheet as shown above in the steps 17. At the end of these steps we have the Score table
as:

2. Now we shall create a table array for the look-up. Go to the staging Trainees Excel sheet and
select the dataset. Remember that for look-up, the first column of the table array must be the
look-up value, which in our case is EmpNumber. Therefore bring EmpNumber to the first
column, which can be done simply by cut and paste. Select the entire dataset and name it as
EmployeeDetails in the name box (text box on the top right corner of menu, highlighted in the
snapshot below). Similarly, also name the table arrays AssessmentDetails and ModuleDetails in
the respective sheets.

19

3. Now, to add the column EmpKey, insert an empty column and in the second row apply the
formula:
=VLOOKUP(E2,EmployeeDetails,2,FALSE)
Then rename the column as EmpKey.

4. Next, add the column AssessmentTypeKey, insert an empty column and in the second row
apply the formula:
=VLOOKUP(C2,AssessmentDetails,2,FALSE)
Rename this column as AssessmentTypeKey.
5. Next, add the column ModuleKey, insert an empty column and in the second row apply the
formula:
20

=VLOOKUP(G2,ModuleDetails,2,FALSE)
Rename this column as ModuleKey.
At the end of these steps, we obtain the Score table as:

This brings us to the end of second stage of ETL, i.e., from backup to staging.
(C) Staging to Warehouse
In this stage, data from the staging database is cleaned and loaded into the warehouse. Cleaning
refers to the process of excluding incorrect or unwanted data present in the staging table.
For instance, if the warehouse is to be created for data only of the past decade, then any data older
than that should be excluded from the warehouse. Our case study however specifies no such
cleaning activity, and the data can be directly loaded into the warehouse. However, care must be
taken to exclude unnecessary columns present in the staging database like EmpFirstname,
EmpMiddleName and EmpLastName. This can be done by removing the unnecessary columns in
the Query Wizard Choose Columns window. Follow the same steps as for the backup-to-staging
transformation for the Trainees table for each of the five tables in the database to create the
warehouse.

21

Similarly unselect any unnecessary fields in the Score table to obtain the final data warehouse.
DimAssessment Table:

22

DimModule Table:

DimTrainee Table:

23

DimTime Table:

FactScores Table:

24

Multi Dimensional Data Modeling (MDDM)

Performing MDDM
Description
Dimensions are the points of view from which the facts can be seen. From the data warehouse that we got
after performing ETL on the source database we can identify the following dimensions:
1.
2.
3.
4.

DimTime
DimModule
DimAssessment
DimTrainee

Also, a fact table is a table from which the highest number of tables in the database are reachable. In our
data warehouse FactScores is one such table.
Now in MS Excel, forming a multidimensional cube is not possible. Thus, we will have to get the relevant
data that is needed for multidimensional analysis onto a single sheet.

Steps for performing MDDM


1. For creating a cube in Excel, first load the data from the FactScores sheet of the warehouse Excel
file onto a new sheet.
2. We need all the relevant data from all the dimensions to make the cube. In our case we need
EmpName, BatchName, Stream and IBU from DimTrainee, ModuleName and ModuleCreditPoints
from DimModule, AssessmentType and Duration from DimAssessment, FullDateAlternateKey,
Year, Quarter, MonthNumber and MonthName from DimTime. These fields can be loaded into the
IPACube sheet using VLOOKUP.
After performing lookup operations on the required columns, the final output of these steps is:

Thus we have got the MS Excel version of a cube. Reports can be made on this cube using a pivot table or a
pivot chart.
25

Reporting
Performing Reporting
In Excel, reports can be made on pivot tables and pivot charts, which have the ability to summarize large
volumes of complicated data. A pivot element is comprised of four elements:
1. Report Filters: Report filters are used to enable in-depth analysis of large amounts of data in the
pivot element by providing a subset of data. For instance, we can restrict our analysis to a specific
number of products or a specific region or span of time.
2. Row Labels (Axis Fields): Row labels are used to view facts through a dimension, that is, a row
label contains an attribute of any dimension. An attribute is preferred as a row label when its
domain is large (i.e., there is a large number of possible values). For instance, the CustomerName
attribute is generally used as a row label.
3. Column Labels (Legend Fields): Columns labels, like row labels, are also used to view facts with
respect to a dimension. A column label is an attribute of a dimension and generally attributes with
smaller domains are used as column labels. For example, Quarter, Month of year, etc. are generally
used as column labels.
4. Values: Values are facts (measures) which can be generally aggregated across one or more
dimensions, for example, Quantity or SalesAmount.
Report 1:
The report requires us to create a chart that displays percentage scores of employees for all modules with
assignment type as either Test or Retest. Now to achieve this, we create a pivot table on the Excel cube
sheet IPACube with AssessmentType and EmpName as report filters, ModuleName as row label and
percentage as value.
Steps:
1. Open the Excel cube Sheet IPACube and click on the Insert Tab to find the PivotTable button
in the ribbon interface.

26

2. Click on PivotTable and select PivotChart.

3. Next, select the table array on which the pivot chart has to be made. In our case, we select the entire
cube sheet as the source (selected by default by Excel). Also we may choose to create the pivot
chart in a new spread sheet or in the same sheet. For readability we shall go with a new worksheet.

4. This creates an empty pivot sheet in the Excel file. Fields may be dragged into areas such as report
filter, axis fields (row label), legend fields (column label) and values. Note that a field cannot
appear in any other area if it is present as a part of the report filter.
27

5. Now, as discussed before, drag ModuleName into the Axis Fields (row label) area, Percentage into
the Values area and AssessmentType and EmpName into the Report Filter area.

6. As a percentage is meaningful only when summarized as an average, we now change the Value
field setting to Average. For this, left click on the field in the Value area and click on Value Field
Settings.

28

7. Choose the calculation type as Average in the Value Field Settings dialog box and click on OK.

8. Our report is required to provide information for the assessment types Test and Retest. Hence in the
AssessmentType report filter on the top left corner of the sheet, check the boxes for Select Multiple
Items, Retest and Test.

29

At the end of these steps, we obtain the desired chart report:

Report 2:
The report requires us to create a table to display the module names and its assessment types conducted in a
month of a quarter of a year with drill downs active on the calendar hierarchy. We shall create a pivot table
with Year, Quarter, Month and ModuleName as row labels, AssessmentType as column label and finally
percentage summarized as Count of Percentage as a Value. The pivot table is created in the same way as the
pivot chart, therefore follow steps 17 to create the table report. It is important that we keep in mind that
percentage must this time be summarized as count and not average or sum. The drill down functionality for
the table is created by default in Excel and hence we need not explicitly handle it. The final report shows
the following details:
30

31

You might also like