Professional Documents
Culture Documents
Concepts
Presented By:
Navneet Aggarwal
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
What is ETL?
EXTRACT
TRANSFORM
LOAD
Extract, transform and load (ETL) is a core process of data integration
in which data is extracted from a chosen source(s), transformed into
new formats according to business rules, and loaded into target data
structure(s).
2
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
3
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Data Completeness
One of the most basic tests of data completeness is to verify that all expected data
loads into the data warehouse. This includes validating that all records, all fields
and the full contents of each field are loaded. Strategies to consider include:
Comparing record counts between source data, data loaded to the warehouse and
rejected records.
Comparing unique values of key fields between source data and data loaded to the
warehouse. This is a valuable technique that points out a variety of possible data errors
without doing a full validation on all fields.
Utilizing a data profiling tool that shows the range and value distributions of fields in a
data set. This can be used during testing and in production to compare source and target
data sets and point out any data anomalies from source systems that may be missed even
when the data movement is correct.
Populating the full contents of each field to validate that no truncation occurs at any step
in the process. For example, if the source data field is a string(30) make sure to test it with 30
characters.
Testing the boundaries of each field to find any database limitations. For example, for a
decimal(3) field include values of -99 and 999, and for date fields include the entire range of
dates expected. Depending on the type of database and how it is indexed, it is possible that
the range of values the database accepts is too small.
4
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Data transformation
Validating that data is transformed correctly based on business rules can be the most
complex part of testing an ETL application with significant transformation logic. One
typical method is to pick some sample records and "stare and compare" to validate
data transformations manually. This can be useful but requires manual testing steps
and testers who understand the ETL logic. A combination of automated data profiling
and automated data movement validations is a better long-term strategy. Here are
some simple automated data movement techniques:
Create a spreadsheet of scenarios of input data and expected results and validate these with
the business customer. This is a good requirements elicitation exercise during design and can also
be used during testing.
Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the
process of populating data sets with the scenario spreadsheet to allow for flexibility because
scenarios will change.
Utilize data profiling results to compare range and distribution of values in each field
between source and target data.
Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned
child records are handled.
5
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Data quality
For the purposes of this discussion, data quality is defined as "how the ETL system
handles data rejection, substitution, correction and notification without modifying
data." To ensure success in testing data quality, include as many data scenarios as
possible. Typically, data quality rules are defined during design, for example:
Depending on the data quality rules of the application being tested, scenarios to
test might include null key values, duplicate records in source data and invalid
data types in fields (e.g., alphabetic characters in a decimal field). Review the
detailed test scenarios with business users and technical designers to ensure that
all are on the same page. Data quality rules applied to the data will usually be
invisible to the users once the application is in production; users will only see
what's loaded to the database. For this reason, it is important to ensure that what
is done with invalid data is reported to the users. These data quality reports
present valuable data that sometimes reveals systematic issues with source data.
In some cases, it may be beneficial to populate the "before" data in the database
for users to view.
6
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Load the database with peak expected production volumes to ensure that this
volume of data can be loaded by the ETL process within the agreed-upon window.
Compare these ETL loading times to loads performed with a smaller amount of
data to anticipate scalability issues. Compare the ETL processing times
component by component to point out any areas of weakness.
Monitor the timing of the reject process and consider how large volumes of
rejected data will be handled.
large database volumes. Work with business users to develop sample queries and
acceptable performance criteria for each query.
7
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Integration Testing
Typically, system testing only includes testing within the ETL application. The
endpoints for system testing are the input and output of the ETL code being tested.
Integration testing shows how the application fits into the overall flow of all
upstream and downstream applications. When creating integration test scenarios,
consider how the overall process can break and focus on touch points between
applications rather than within one application. Consider how process failures at
each step would be handled and how data would be recovered or deleted if
necessary.
Most issues found during integration testing are either data related to or resulting
from false assumptions about the design of another application. Therefore, it is
important to integration test with production-like data. Real production data is
ideal, but depending on the contents of the data, there could be privacy or security
concerns that require certain fields to be randomized before using it in a test
environment. As always, don't forget the importance of good communication
between the testing and design teams of all systems involved. To help bridge this
communication gap, gather team members from all systems together to formulate
test scenarios and discuss what could go wrong in production. Run the overall
process from end to end in the same order and with the same dependencies as in
production. Integration testing should be a combined effort and not the
responsibility solely of the team testing the ETL application.
8
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
possible. Users typically find issues once they see the "real" data, sometimes
leading to design changes.
important that users sign off and clearly understand how the views are created.
Plan for the system test team to support users during UAT. The users will likely
have questions about how the data is populated and need to understand details
of how the ETL works.
Consider how the users would require the data loaded during UAT and
negotiate how often the data will be refreshed.
9
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Regression Testing
Regression testing is revalidation of existing functionality with each new
release of code. When building test cases, remember that they will likely be
executed multiple times as new releases are created due to defect fixes,
enhancements or upstream systems changes. Building automation
during system testing will make the process of regression testing much
smoother. Test cases should be prioritized by risk in order to help
determine which need to be rerun for each new release. A simple but effective
and efficient strategy to retest basic functionality is to store source data sets
and results from successful runs of the code and compare new test results with
previous runs. When doing a regression test, it is much quicker to compare
results to a previous execution than to do an entire data validation again.
Taking these considerations into account during the design and testing portions
of building a data warehouse will ensure that a quality product is produced and
prevent costly mistakes from being discovered in production.
10
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
References
Article Name
Source
Author
11
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.
Thank You
12
Xchanging 2006, no part of this document may be circulated, quoted or reproduced without prior written approval of
Xchanging.