Welcome to Scribd!

Basic Data Profiling

Uploaded by

0% found this document useful (0 votes)

72 views2 pages

Data profiling is an organized methodology for analyzing the data in stages that provides for a thorough result. The stages that an analyst typically exercises are: Analyze individual values to determine if they are valid values for a column Analyze all the values in a column together to find problems with unique rules, consecutive rules and unexpected frequencies of specific values.

Original Description:

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

72 views2 pages

Basic Data Profiling

Uploaded by

Wayne Yaddow

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

THE BASICS OF DATA PROFILING

Data profiling consists of multiple analyses to investigate the structure and content of data and make inferences about data.

Column Examination Identify all values in column along with frequency of occurrence Identify min and max values Determine true data type Determine degree of uniqueness Determine encoding patterns used, frequency of each pattern Compute values: AVG, SUM, MEDIAN, STD DEVIATION

Row Examination Find all primary key candidates (single or multi-column) Find intra-row column dependencies (find de-normalization instances) Find multi-column value relationships Value ordering rules NULL value dependencies

Multi-table Examination Find matching columns across tables Match by column name, data type Match by values Find primary/foreign key pairs (single and multi-column) Determine 1-1, 1-M, 1-0, M-1, M-M, 0-1 rules Find primary values not found in secondary tables

Invalid Values Missing values when should not be missing Values out of range or not in domain of expected values Value in one column not possible when combined with values in one or more other columns Example: obviously wrong values Name = Donald Duck Address = 1600 Pennsylvania Avenue

Examples of problems easily uncovered through data profiling analysis:

Data elements used for purposes other than thought to be Empty columns; columns containing no data at all Invalid values in columns Inconsistent methods of representing the same value Missing values Violation of structural dependencies Violation of expected column relationships missing date values Violation of business rules Unrealistic percentages of specific values appearing in a column

Data profiling is an organized methodology for analyzing the data in stages that provides for a thorough result. The stages that an analyst typically exercises are:

Analyze individual values to determine if they are valid values for a column Analyze all the values in a column together to find problems with unique rules, consecutive rules and unexpected frequencies of specific values Analyze structure rules governing functional dependencies, primary keys, foreign keys, synonyms and duplicate columns Validate data rules that must hold true with a row of data Validate data rules that must hold true over all rows for a single business object Validate data rules that must hold true over collections of a business object Validate data rules that must hold true between collections of different types of business objects

Data rules are a subset of business rules that define relationships between sets of columns or rows that must always be true within the data. A violation may mean that data inaccuracies exist in the data or that the business rules they are based on are not being followed in the real world. In one case the data was entered inaccurately. In the other case the data was entered correctly but the transaction was handled with data outside of the corporation's business policies. Both of these situations are important to expose. Examples of data rules are:

Employees must be at least 18 years old. Part-time employees are paid hourly. Checkout periods for tools cannot overlap for the same tool. Customers with more than $50,000 in sales last quarter get a 5 percent discount Suppliers cannot supply radioactive part numbers unless certified.

Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4609)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (119)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (838)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (265)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (399)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (587)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2219)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1891)
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5794)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (599)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (537)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1711)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
Rating: 4 out of 5 stars
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (821)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (344)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2322)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (890)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1103)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3811)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (440)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4200)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2099)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carre
Rating: 3.5 out of 5 stars
3.5/5 (104)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1839)
Testing A Data Warehouse
Document7 pages
Testing A Data Warehouse
Wayne Yaddow
100% (1)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (104)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (73)
Data Warehouse / ETL Testing Effectiveness
Document37 pages
Data Warehouse / ETL Testing Effectiveness
Wayne Yaddow
100% (2)
Data Warehouse / ETL Testing Effectiveness
Document37 pages
Data Warehouse / ETL Testing Effectiveness
Wayne Yaddow
100% (2)
Informatica Powercenter Course
Document8 pages
Informatica Powercenter Course
Thameem
No ratings yet
Policy For Backup
Document6 pages
Policy For Backup
Kamrul Hasan
No ratings yet
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1937)
Delivering Data Warehouse Quality - A Business Obligation
Document4 pages
Delivering Data Warehouse Quality - A Business Obligation
Wayne Yaddow
No ratings yet
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (792)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1015)
Topics For The Data Warehouse Test Plan
Document16 pages
Topics For The Data Warehouse Test Plan
Wayne Yaddow
No ratings yet
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (806)
Meeting DWH QA Challenges Part 1
Document9 pages
Meeting DWH QA Challenges Part 1
Wayne Yaddow
No ratings yet
Setting Up A Mysql Cluster Step by Step
Document11 pages
Setting Up A Mysql Cluster Step by Step
pavankalluri
No ratings yet
Sampling of Slides From DWH, Data Integration, and ETL Testing Course
Document79 pages
Sampling of Slides From DWH, Data Integration, and ETL Testing Course
Wayne Yaddow
No ratings yet
Data Integration & ETL Testing Course Description
Document5 pages
Data Integration & ETL Testing Course Description
Wayne Yaddow
No ratings yet
Planning For Effective Data Warehouse Testing
Document4 pages
Planning For Effective Data Warehouse Testing
Wayne Yaddow
No ratings yet
Attacking - Data Warehouse Quality - Issues W. Yaddow Article
Document4 pages
Attacking - Data Warehouse Quality - Issues W. Yaddow Article
Wayne Yaddow
No ratings yet
On Becoming A Valued Data Warehouse Tester
Document4 pages
On Becoming A Valued Data Warehouse Tester
Wayne Yaddow
No ratings yet
Ibmdatamag Articles Wyaddow
Document11 pages
Ibmdatamag Articles Wyaddow
Wayne Yaddow
No ratings yet
Meeting DWH QA Challenges Part 2
Document10 pages
Meeting DWH QA Challenges Part 2
Wayne Yaddow
No ratings yet
An Introduction To Data Warehouse Testing
Document6 pages
An Introduction To Data Warehouse Testing
Wayne Yaddow
No ratings yet
Data Warehouse and ETL Verification Services Process Methods
Document10 pages
Data Warehouse and ETL Verification Services Process Methods
Wayne Yaddow
No ratings yet
Regression Testing For BI Reports and Data
Document5 pages
Regression Testing For BI Reports and Data
Wayne Yaddow
No ratings yet
QA Backend Database Testing Approach
Document8 pages
QA Backend Database Testing Approach
Wayne Yaddow
No ratings yet
Avida Towers BGC 9th Avenue - Tower 2: Identification
Document3 pages
Avida Towers BGC 9th Avenue - Tower 2: Identification
brian9211
No ratings yet
Logcat Prev CSC Log
Document29 pages
Logcat Prev CSC Log
Dante Lipa
No ratings yet
DBMS Practical List
Document6 pages
DBMS Practical List
atulnight
No ratings yet
VXVM - Creating A Mirrored Volume From Two Concatinated Plexes
Document5 pages
VXVM - Creating A Mirrored Volume From Two Concatinated Plexes
ravinder83
No ratings yet
Topic 3 Entity Relationship Modelling 1
Document86 pages
Topic 3 Entity Relationship Modelling 1
Niño Apostol
No ratings yet
Data Structures Unit 5
Document20 pages
Data Structures Unit 5
SYED SHDN
No ratings yet
Co2209 SG Vol01 PDF
Document159 pages
Co2209 SG Vol01 PDF
Nabeel
No ratings yet
1 - 04-DDD - Assignment 2 Brief
Document4 pages
1 - 04-DDD - Assignment 2 Brief
Guys Good
No ratings yet
Prashanth Resume Updated
Document2 pages
Prashanth Resume Updated
Pefi Vijay
No ratings yet
Oracle Data Mining
Document17 pages
Oracle Data Mining
gowripriya12
No ratings yet
Unit III Database Management System
Document20 pages
Unit III Database Management System
Aliya
No ratings yet
Magento URL Rewrite Process
Document7 pages
Magento URL Rewrite Process
Santosh Kumar
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
Document74 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
ratneshkumarg
No ratings yet
BDC Solved Questions Answers
Document3 pages
BDC Solved Questions Answers
Nelson Karunakar Darla
100% (1)
Data Refreshed on January 17, 2019
Document483 pages
Data Refreshed on January 17, 2019
EDGAR RIOFRIO
No ratings yet
HANA System Replication Overview
Document7 pages
HANA System Replication Overview
mouladj
No ratings yet
(Hạn cuối nộp bài: 15:30 PM ngày) : Phần Làm Tại Nhà Số 1 Phần Làm Tại Nhà Số 2
Document3 pages
(Hạn cuối nộp bài: 15:30 PM ngày) : Phần Làm Tại Nhà Số 1 Phần Làm Tại Nhà Số 2
Hu Văn Thuận
No ratings yet
COURSE Strucure - M.tech (S.E) I & II Sem (Autonomous)
Document40 pages
COURSE Strucure - M.tech (S.E) I & II Sem (Autonomous)
Fresherjobs India
No ratings yet
Oracle Database 11g Performance Tuning DBA Release 2 - D50317GC20 - 1080544 - US
Document6 pages
Oracle Database 11g Performance Tuning DBA Release 2 - D50317GC20 - 1080544 - US
Jinendraabhi
No ratings yet
Oracle Multiple Choice Questions
Document37 pages
Oracle Multiple Choice Questions
Arun Kumar
0% (1)
Mapping Conceptual Design To Logical Desogn
Document32 pages
Mapping Conceptual Design To Logical Desogn
Himanshu Upadhyay
No ratings yet
SAP BW On HANA
Document13 pages
SAP BW On HANA
Raman Kumar
No ratings yet
Er Diagram
Document79 pages
Er Diagram
Biniyam Ajaw
No ratings yet
Banko Et Al - 2007 - Open Information Extraction From The Web
Document7 pages
Banko Et Al - 2007 - Open Information Extraction From The Web
marianataglio
No ratings yet
2 ND Unit DBMS
Document23 pages
2 ND Unit DBMS
Bhagath Yadav
No ratings yet
Adbase Presentation Group 4
Document60 pages
Adbase Presentation Group 4
brittain markale
No ratings yet
2010 NoSQL Summer Reading List
Document1 page
2010 NoSQL Summer Reading List
bjgx
No ratings yet