Professional Documents
Culture Documents
Prepared for:
Prepared by:
October 7, 2002
Table of Contents
1. INTRODUCTION .................................................................................................................................. 1 1.1. 1.2. 2. 3. Purpose of Document .................................................................................................................. 1 Background.................................................................................................................................. 1
HIGH-LEVEL ETL ARCHITECTURE .................................................................................................. 2 DETAILED ETL ARCHITECTURE ...................................................................................................... 4 3.1. 3.2. 3.3. 3.4. ETL Mapping tables..................................................................................................................... 6 Instructional_Course_Section Dimension.................................................................................... 7 Student Dimension ...................................................................................................................... 9 Section_Fact_Table................................................................................................................... 11
4. 5.
UNIT AND STRING TESTING RESULTS ......................................................................................... 13 OPERATIONAL PROCESS FLOW ................................................................................................... 15
5.1. 5.2.
ii
October 7, 2002
1. Introduction
1.1. Purpose of Document
The purpose of this document is to detail the extract, transformation, and load (ETL) design and operation flow of the processes required to load course and student information into the Course Analytics data mart. The following information related to the design and operations of the ETL processes required to load current registration cycle-point information is as follows: High-level overview of the Extract, Transformation, and Load (ETL) architecture to support loading current information into the Course Analytics data mart. Detailed ETL architecture that identifies data sources, transformations and procedures Unit and String Testing Results. Operational Process Flow
1.2. Background
The Ohio State University engaged Covansys to manage the Course Analytics Data Warehouse Pilot project. As part of this project, data from existing OSU systems would needed to be extracted, transformed and loaded into the Course Analytics data mart for both ongoing (current) course registration cycles and previous course registration cycles (historical). During the analysis of business requirements and information needs, it became clear that there was not a single source of data to provide current and historical course analytics information. The decision was made to initially design and develop the ETL processes to load the current course registration cycle information into the Course Analytics data mart. The historical ETL process would be developed later in this project if time permitted.
October 7, 2002
October 7, 2002
E S -O U R D W W indows 2000 D ata W arehouse S taging A rea Inform atica Inform atica C ours e A nalytic s D ata M art
G EO RG E W indows 2000
October 7, 2002
Student Stdnt_Key: int Stdnt_SSN: char(9) Stdnt_Old_SSN: char(9) Stdnt_Gndr_Cd: char(1) Stdnt_Rpt_Ethncy_Cd: char(1) Stdnt_Rpt_Ethncy_Shrt_Desc: char(16) Stdnt_Rpt_Ethncy_Lng_Desc: varchar(30) Stdnt_Brth_Dt: datetime Stdnt_PT_FT_Cd: char(1) Stdnt_Rnk: char(1) Stdnt_Rpt_Rnk: char(2) Stdnt_Rpt_Rnk_Desc: varchar(33) Stdnt_Rpt_Cls_Cd: char(1) Stdnt_Rpt_Cls_Desc: char(20) Stdnt_Enroll_Proj_Rnk: char(2) Stdnt_Lev_Cd: char(1) Stdnt_Hon_Flg: char(1) Stdnt_Athlt_Flg: char(1) Stdnt_Schlr_Flg: char(1) Stdnt_Mult_Maj_Qty: tinyint Stdnt_Enroll_Stat_Cd: char(1) Stdnt_Enroll_Stat_Desc: varchar(35) Stdnt_Fee_Paid_Flg: char(1) Stdnt_OSU_Qtrs: smallint Stdnt_Cum_Hrs: smallint Stdnt_Cum_Pnts: decimal(5,1) Stdnt_Cum_GPA: decimal(3,2) Stdnt_Qtr_Hrs: smallint Stdnt_Qtr_Pnts: decimal(5,1) Stdnt_Qtr_GPA: decimal(3,2) Stdnt_Atmpt_Crse_Hrs: smallint Stdnt_Fail_Crse_Hrs: smallint Stdnt_Earn_Hrs: smallint Stdnt_Maj_Cd: char(3) Stdnt_Maj_Abbrv: varchar(8) Stdnt_Maj_VP_Coll_Num: char(2) Stdnt_Maj_Coll_Cd: varchar(3) Stdnt_Maj_Coll_Nam: varchar(25) Stdnt_Maj_Coll_Fisc_Unt_Cd: char(4) Stdnt_Maj_Coll_Fisc_Unt_Desc: varchar(25) Stdnt_Oth_Declrtn_Typ: char(1) Stdnt_Oth_Declrtn_Typ_Desc: varchar(20) Stdnt_Oth_Declrtn_Cd: char(3) Stdnt_Oth_Declrtn_Abbrv: varchar(8) Stdnt_Oth_Declrtn_VP_Coll_Num: numeric(2) Stdnt_Oth_Declrtn_Coll_Cd: char(3) Stdnt_Oth_Declrtn_Coll_Nam: varchar(25) Stdnt_Oth_Declrtn_Coll_Fisc_Unt_Cd: char(4) Stdnt_Oth_Declrtn_Coll_Fisc_Unt_Desc: varchar(25) Eff_Beg_Dt: datetime Eff_End_Dt: datetime
Section_Fact_Table Reg_Time_Key: int Enroll_Coll_Key: int Instrnl_Crse_Sect_Key: int Stdnt_Key: int Stdnt_Enroll_Qty: smallint Crse_Wtlst_Dmd: smallint Grd_Pnts: decimal(3,1) Cred_Hrs: smallint Ltr_Grd: char(2) Qual_Pnt: decimal(3,1)
Enrollment_College Enroll_Coll_Key: int Enroll_Cmps_Num: char(1) Enroll_Cmps_Abbrv: char(3) Enroll_Cmps_Nam: varchar(20) Enroll_Coll_Cd: char(3) Enroll_Coll_Nam: varchar(25) Enroll_Secnd_Coll_Cd: char(3) Enroll_Secnd_Coll_Nam: varchar(25) Eff_Beg_Dt: datetime Eff_End_Dt: datetime
October 7, 2002
Extract, Transformation, and Load (ETL) Design Document There are four dimensional tables and one fact table. The Instructional_Course_Section table represented the hierarchy of the instructional side of the university. This includes information for the VP college, campus, fiscal unit, academic unit, course, section and instructor information that is associated with university instructional offering. The Student table represents the information that describes the students enrollment, majors, minors, academic performance and progress, etc. The Enrollment_College table represents all possible combinations of primary and secondary colleges within a university college. The Registration_Cycle table is the directory of university registration cycle-points that occur during a university academic period over a period of years. The Section_Fact table represents a students enrollment and performance within a specific course/section. The dimension tables provide specific reference to a given students enrollment based on the instructional college hierarchy, enrollment college, and at a specific point within the registration cycle of an academic period. Information contained in the Section_Fact table includes an indication of enrollment, waitlist, grade points achieved, credit hours earned, and letter grade achieved. Since waitlist information is recorded at a course level for a student. Waitlist information was associated with a section that was all zeros. Student enrollment is associated with a valid university section number. In order to load information into the Course Analytics data mart, there are six areas that must be completed. These areas are: 1. 2. 3. 4. 5. 6. Registration Cycle Dimension Enrollment College Dimension ETL Mapping tables used for reference and validation. Instructional Course/Section Dimension Student Dimension Section_Fact table
The Registration Cycle dimension is a static table and does not change from one load cycle to the next. There is no automated process used to populate this dimension. This dimension will be populated directly using a spreadsheet given by the Office of Enrollment Services. The Enrollment College Dimension was created using all the possible combinations of primary and secondary colleges in the staging area. Any changes for this dimension should be made in the staging area, which will then be propagated to the Enrollment College Dimension. A data steward will have to be assigned to maintain this data in the staging area.
October 7, 2002
Perm_Academic_ Unit_Mf
Perm_Acad_Unit_Map
Campus_Agg_Map
Campus_Map
Fiscal_Dept_Map
Fiscal_Unit_Map
College_Map
College_Map
Qtr_Yr is NOT Null Rtrim & Ltrim Course Number, Convert Fiscal_Year to Char, Populate Course Funding Indicator
Funded_Courses
Route Courses
Instructional_Course_ Section_Staging
Qtr_Yr is Null
October 7, 2002
CSECTIO N _Current
(CSECTIO N.Cam pus < 9 and NO T CSECTIO N.Call_Check = 'A') OR (CSECTIO N.Cam pus < 9 AND Call_Check = 'A' and Call NO T IN (SELECT Parent FROM CSECTIO N))
Extrctn_Year_Q tr = CSection.Year_Qtr
Extraction_Cycle
Join on Year_Qtr
Ltrim and Rtrim Course Num ber & Course Title Populate Course_Lvl, Ugrad_Cd, G rad_Cd, & Prof_Cd Decode Dpt_Fiscal
Join on Cam pus, College, Dept Num ber, Course Num ber
Cam pus_Map
Crse_Tier_Lev_Map
Instructional_Course_ Section_Staging
Populate G EC Flag
Acad. Unit Abbrv Desc., Acad. Unit Sched Desc., Fisc. Unit Desc, VP College Num ber, VP College Nam e, Cam pus Abbrv Desc., Cam pus Nam e, GEC Code, Funded Courses
Fiscal_Unit_Map LookUp
GEC_Courses
Funded_Courses
October 7, 2002
Instructor Information
LookUp
CInstructor
Primary Instructor, Secondary Instructor, Primary Section Type (Concat MB_Group 1& 2 Section Type) LookUp
Instructional_Course_ Section_Staging
Instructional_Course_S ection
CInstructor
CSection
Primary Instructor, Primary Section Type (Concat MB_Group 1& 2 Section Type)
CInstructor
CSection
October 7, 2002
enrollm ent
Ex trc tn_Y ear_Qtr = enrollm ent.y y y y q_c ode Enrollm ent_y y y y q_c ode = y y y y q_c ode
c rs e_grade
E x trac tion_C y c le Ex trac tion_C y c le ex pres s ion: populate lev el, reported rank , ethnic ity , part/tim e f ull tim e, projec ted rank , reported c las s c ode, reported c las s c ode des c ription,etc ... F ilter: option_c ode is null or option_c ode != 'R ' AN D m obility _grp_c ode is null or m obility _grp_c ode != '2' AND c all_num is not null and c all_num not lik e 99% f inal_grade not in ('K','KM','KD ','E', 'R ') AND drop_date is null or drop_date > ef f ec tiv e_date (or th day 14th date if 15 D ay or EO Q+30. Add_date als o c hec k ed if 15t h D ay of EO Q+30)
E x trac tion_C y c le
pers onal
s s n_c hange_ev en t
dept2
Ex trc n_Y ear_Qtr = dept2.enroll_y y y y q_c ode AND dept2_ty pe_c ode in ('2','3','4','5', '6','7','8','9')
Ex trac tion_C y c le
Look up: Stdnt _H on_F lag Stdnt _Sc h_F lag Stdnt _Athl_F lag c urrent_U H S_v iew
enroll_s tatus _m a p
c urrent_SPT_v iew
c urrent_SC H _v iew
October 7, 2002
Lookup Stdnt_Oth_Declrtn_Cd
current_secondary _major
current_minor
College_map
fiscal_unit_map
current_AOI_less _than_900
current_AOI_greater _than_900
current_specialization
Student_Dimension _Staging
Student
College_map
fiscal_unit_map
10
October 7, 2002
3.4. Section_Fact_Table
The Student information is built from information contained in the Enrollment and Crse_Grade and Waitlist files. Specific information from these files are extracted and consolidated to determine student enrollment and waitlist in different instructional course sections. Figures 8 describe the process that determines all the information required to pick up the surrogate keys from the corresponding dimensions. This information is stored in the Section_Fact_Staging_Table. Figure 9 describe the process that uses that information to determine the surrogate keys and populates it into the Section_Fact_Table.
c rs e_grade
F ilt er: W H ER E (m obility _grp_c ode is null or m obility _grp_c ode != '2') AN D c all_num is not null AN D c all_num not lik e 99% AN D (f inal_grade is null O R f inal_grade not in ('K', 'KD ', 'EM','KM')) AND drop_date is null
F ilter: inc lude only appropriate TER M rec ords f or SU Q uarter, and inc lude only rec ords with add_dat e t <= 14 t h D ay and drop_dat e > 14h D ay f or F inal F if t eent h D ay and EOQ +30 c y c le points
R trim f inal_grad e
grade_m ap
Ex trac tion_C y c le
ex pres s ion: c alc ulate grade points , quality points , St dnt _Enroll_Qt y = 1
Aggregate rec ords by s s n,c ours e, c all_num ber. Sum grade point s , c redit _hours , qualit y point s , c alc ulat e av g. grade point
Enrollm ent
Look up: perm ac un_c oll_c ode, s ec ondary _c oll_c ode_ abbrev
Ex trac tion_C y c le
s ec ondary _c ollege_m f
Ex trac tion_C y c le
waitlis t .reques t
11
October 7, 2002
Lookup Enroll_Coll_Key
Lookup Instrnl_Crse_Sect_Key
Y es
No
Student
Enrollment_College
Instructional_Course_ Section
Section_Fact_Table_ Rejects
Section_Fact_Table
12
October 7, 2002
Name of Source
Selection Criteria
Name of Target
Comments
CSection
Year_Qtr = 20022, Campus < '9', Dpt_Number < '999', Call_Check <> 'A', MB_Group is Null or '1'
19750
Instructional_Course _Section
19750
Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group is Null, Parent = '' Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group is Null, Parent > '' Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group = '1', Parent > ''
18056
Instructional_Course _Section
18056
1776
Instructional_Course _Section
1776
104
Instructional_Course _Section
104
13
October 7, 2002
Name of Source
Selection Criteria
Name of Target
Comments
Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group = '1', Parent = '' Student_Current. Enroll_Yyyyq_Code = enrollment 20022 Enroll_Yyyyq_Code= 20022, Call_Num is not Null, Mobility_grp_code <> Student_Current. 2, crse_grade Final_Grade <> (K,KD,KM,EM), (Drop_Date is Null or Drop_Date > Extrctn_Cycle_Pt_Eff _Dt)
55
Instructional_Course _Section
55
61,657
Student
61,657 1,426 records were rejected because corresponding information was missing from Instructional_Course_ Section. Basically, the check digit in CSection was non-numeric (A).
164,932
Section_Fact_Table
163,506
Waitlist. Request
5,937
Section_Fact_Table
5,870
67 records were rejected because corresponding information was missing from Instructional_Course_ Section Basically, the check digit in CSection was non-numeric (A).
14
October 7, 2002
Step
1.
Name/Description
Insert a record in the Extraction_Cycle table (The Extraction_Cycle table is joined to the course and student mappings to pull the records for a particular registration cycle. This process truncates the Extraction_Cycle table and inserts one record that contains the proper registration cycle record key for the data to be loaded.) Move all source data to staging database (This step moves all data from the source databases to the staging area (cpstage). These tables will be backed up from the staging area when the process is complete.) Run the preliminary updates for mapping (lookup) tables (The mapping tables are truncated and current values are loaded into perm_academic_unit_map, campus_map, fiscal_unit_map and college_map.) Copy staging tables to flat files. (This step copies relevant tables in cpstage to the DW_Staging_Backups directory on the D: drive of DWETL.) Load Instructional_Course_Section_Staging table (This step loads the Instructional_Course_Section_Staging table. The Ccourse_Current and Ccourse_Section tables are used as input. The Extraction_Cycle table is joined together to do all the lookups and outputs four groups of records. The output records are processed through the router and each output group is then transformed and inserted into the Instructional_Course_Section table.) Load Student_Dimensional_Stage table (This step loads the Student_Dimensional_Stage table. This table is identical to the student table with the exception of the Stdnt_Key, Eff_Beg_Dt, and Eff_End_Dt fields. These fields are generated by the Slowly Changing Dimensional load step.) Load Section_Fact_Stage table (This step loads the Section_Fact_Stage table. This table has all the fact data, plus the necessary fields necessary to do lookups against the dimensional tables to populate the dimensional key fields.)
2.
3.
4.
5.
6.
7.
15
October 7, 2002
Extract, Transformation, and Load (ETL) Design Document 8. Load Instructional_Course_Section table (This step is called the slowly changing dimension mapping. The Type 2 Dimension/Effective Date Range mapping is used to update the slowly changing dimensional table. For each source row with a matching primary key in the target, the Expression compares user-defined source and target columns. If those columns do not match, the Expression marks the row as changed. Each time the Informatica server inserts a changed dimension, it updates the previous version of the target, using the current date to fill the end date column. A Sequence Generator creates a primary key for each row for the new row is to be inserted. It uses the current date to indicate the start of the effective date range. The transformation leaves the end date null, which indicates the new row contains current dimension data.) Load Student table (This step used the Student_Dimension_Stage table as input and performs a type 2 slowly changing dimension update to the Student table where appropriate.) Load Section_Fact_Table (This step loads the Section_Fact_Table in the Course Analytics data mart using the Section_Fact_Stage table as input. Lookups are performed to the various dimension tables to determine the surrogate key values.)
9.
10.
16
October 7, 2002
5.2. Course Analytics Operational Process The process is used to load the Course Analytics data mart.
Environment:
All Informatica Workflows, Sessions, and Mappings exist in the ca_production folder of the etl_repository Informatica Repository (etl_repository DB on DWPROD server) Production JCL to submit these tasks exist on US.PANLIB WPJ148Zn02. Where n = 1 8 which corresponds to the step number below. Job WPJ148ZA02 can be used to execute all 8 steps at once. Copies also exist in DW.CNTL.CA.LOADS. US.PANLIB versions are authoritative.
Step
1.
Name/Description
Insert a record in the Extraction_Cycle table (J148Z102) Edit JCL : Change // SET VALUE=nnn Where nnn = reg_time_key of current cycle point. Execute Workflow s_m_extraction_cycle_load: a. s_m_extraction_cycle_load
2.
Move all source data to staging database (J148Z202) Execute Workflow b_source_to_staging: a. s_CCOURSE_CURRENT_landing b. s_CSECTION_CURRENT_landing c. s_m_source_to_staging_student_current d. s_m_source_to_staging_waitlist e. s_m_source_to_staging_mapfiles f. s_Course_Daily_Lookup_Files
3.
Run the preliminary updates for mapping (lookup) tables (J148Z302) Execute Workflow b_preliminary_mappings: a. s_Mapping_Staging b. s_Funded_Courses c. s_GEC_Courses
4.
Backup staging tables to flat files (J148Z402) Edit JCL : Change // SET VALUE=nnn Where nnn = reg_time_key of current cycle point. 17 October 7, 2002
Extract, Transformation, and Load (ETL) Design Document Execute Workflow b_backup_staging_to_files: a. b. c. d. 5. s_m_staging_to_files_course_daily s_m_staging_to_files_mapfiles s_m_staging_to_files_student_current s_m_staging_to_files_waitlist
18
October 7, 2002