You are on page 1of 3

Investigate stage

Understanding your data is a necessary precursor to cleansing. You can use WebSphere Information Analyzer to create a direct input into the cleansing process by using shared metadata, or use the Investigate stage to create this input. The Investigate stage shows the actual condition of data in legacy sources and identifies and corrects data problems before they corrupt new systems. Investigation parses and analyzes free-form fields, counts unique values, and classifies or assigns a business meaning to each occurrence of a value within a field. Investigation achieves these goals:

Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices. Identifies invalid or default values. Reveals common terminology. Verifies the reliability of fields proposed as matching criteria.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by WebSphere DataStage, from a flat file or data set, or from any processing stage. Inputs to the Investigate stage can be fixed length or variable. Figure 1. Designing the Investigate stage

As Figure 1 shows, you use the WebSphere DataStage and QualityStage Designer to specify the Investigate stage. The stage can have one or two output links, depending on the type of investigation that you specify. The Word Investigation stage parses free-form data fields into individual tokens and analyzes them to create patterns. This stage also provides frequency counts on the tokens. To create the patterns in address data, for example, the Word Investigation stage uses a set of rules for classifying personal names, business names, and addresses. The stage provides pre-built rule sets for investigating patterns on names and postal addresses for a number of different countries. For example, for the United States the stage parses the following components: USPREP Name, address, and area if the data is not previous formatted USNAME Individual and organization names USADDR Street and mailing addresses USAREA City, state, ZIP code, and so on The test field 123 St. Virginia St. is analyzed in the following way:

1. Field parsing would break the address into the individual tokens of 123 , St., Virginia, and St. 2. Lexical analysis determines the business significance of each piece: a. 123 = number b. St. = street type c. Virginia = alpha d. St. = Street type 3. Context analysis identifies the various data structures and content as 123 St. Virginia, St. a. 123 = House number b. St. Virginia = Street address c. St. = Street type The Character Investigation stage parses a single-domain field (one that contains one data element or token, such as Social Security number, telephone number, date, or ZIP code) to analyze and classify data. The Character Investigation stage provides a frequency distribution and pattern analysis of the tokens. A pattern report is prepared for all types of investigations and displays the count, percentage of data that matches this pattern, the generated pattern, and sample data. This output can be presented in a wide range of formats to conform to standard reporting tools. Parent topic: WebSphere QualityStage tasks

Case 2: Investigation stage to investigate the data in quality stage. Read all about stages in quality stage stages . Explanation of Investigation stage to investigate the data. Different types of data to investigate in the real time. How to investigate the data using investigate stage in quality stage. Which data we need to investigate using investigate stage, stages in quality stages, stages explanations in quality stage./> <br /><meta name= May 12, 2011 View in Crawl 4 * 10 * Save * Bury qsInvCount - the number of rows where the pattern occurred. qsInvPercent - the percentage of rows where the pattern occurred.

You might also like