You are on page 1of 2

THE BASICS OF DATA PROFILING

Data profiling consists of multiple analyses to investigate the structure and content of data and make inferences about data.

Column Examination Identify all values in column along with frequency of occurrence Identify min and max values Determine true data type Determine degree of uniqueness Determine encoding patterns used, frequency of each pattern Compute values: AVG, SUM, MEDIAN, STD DEVIATION

Row Examination Find all primary key candidates (single or multi-column) Find intra-row column dependencies (find de-normalization instances) Find multi-column value relationships Value ordering rules NULL value dependencies

Multi-table Examination Find matching columns across tables Match by column name, data type Match by values Find primary/foreign key pairs (single and multi-column) Determine 1-1, 1-M, 1-0, M-1, M-M, 0-1 rules Find primary values not found in secondary tables

Invalid Values Missing values when should not be missing Values out of range or not in domain of expected values Value in one column not possible when combined with values in one or more other columns Example: obviously wrong values Name = Donald Duck Address = 1600 Pennsylvania Avenue

Examples of problems easily uncovered through data profiling analysis:


Data elements used for purposes other than thought to be Empty columns; columns containing no data at all Invalid values in columns Inconsistent methods of representing the same value Missing values Violation of structural dependencies Violation of expected column relationships missing date values Violation of business rules Unrealistic percentages of specific values appearing in a column

Data profiling is an organized methodology for analyzing the data in stages that provides for a thorough result. The stages that an analyst typically exercises are:

Analyze individual values to determine if they are valid values for a column Analyze all the values in a column together to find problems with unique rules, consecutive rules and unexpected frequencies of specific values Analyze structure rules governing functional dependencies, primary keys, foreign keys, synonyms and duplicate columns Validate data rules that must hold true with a row of data Validate data rules that must hold true over all rows for a single business object Validate data rules that must hold true over collections of a business object Validate data rules that must hold true between collections of different types of business objects

Data rules are a subset of business rules that define relationships between sets of columns or rows that must always be true within the data. A violation may mean that data inaccuracies exist in the data or that the business rules they are based on are not being followed in the real world. In one case the data was entered inaccurately. In the other case the data was entered correctly but the transaction was handled with data outside of the corporation's business policies. Both of these situations are important to expose. Examples of data rules are:

Employees must be at least 18 years old. Part-time employees are paid hourly. Checkout periods for tools cannot overlap for the same tool. Customers with more than $50,000 in sales last quarter get a 5 percent discount Suppliers cannot supply radioactive part numbers unless certified.

You might also like