Professional Documents
Culture Documents
Challenges
Understanding, management and analysis of data from heterogeneous sources
Topics
Various format of data Data storage mechanism Data access methods Management of data Extract desire information. Challenges by different format of data.
Digital Data
Unstructured (80%) : Semi-structured : Structured : organized form, in tables, According to Merrill Lynch 80-90% of business data is either unstructured or semi-structured. Data is usually in this format which makes it difficult to extract information from it.
Unstructured : which does not conform to a data model. Not in a form which can be easily used by computer program. Example : images, videos, memos, chat rooms, ppt, researches, white papers, body of email. Semi-Structure : which does not conform to a data model but has some structure. Not in a form which can be easily used by computer program. Ex : emails, XML, Markup language, metadata(not sufficient data). Structure : which is in organized form, and easily understand by a computer program. Ex : data store in database.
Structured data is organized in rows and columns in a rigidly defined format so that applications can retrieve and process it efficiently. Typically stored using a database management system (DBMS). Data is unstructured if its elements cannot be stored in rows and columns, and is therefore difficult to query and retrieve by business applications. For example, customer contacts may be stored in various forms such as sticky notes, e-mail messages, business cards, or even digital format files such as .doc, .txt, and .pdf . Due its unstructured nature, it is difficult to retrieve using a customer relationship management application.
Types of data
In hospital (GoodLife) data structure is maintained in structured way, so anyone can locate desire information easily. Comes from Access , OLTP, SQL, spreadsheets,
Fully described datasets. Clearly describe categories and sub categories. Data neatly placed in rows and columns Indexing can be easily done.
Unstructured Data
Email had not been successfully updated in medical system database as it fell in the Unstructured format. Difficult to determine the meaning of the data. Does not follow any rules and semantics. Any type so unpredictable. Free form text without any structure
Web pages , even in HTML (which is structured). Tagged element do not capture the meaning of the data. So difficult to automatically process information. Carry Links, reference to external unstructured content. Like images.
Tags /metadata Using metadata data in the document can be tagged but in unstructured data this is difficult as little or no metadata is available. structure of the document cannot be determined as it is coming from more than one source and doesnt has particular format
Classification/taxonomy Taxonomy is classifying data on the basis of the relationships that exist between data. Data can be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. Classifying unstructured data is difficult as identifying relationships between data is not an easy task. CAS (content addressable storage ):It stores data based on their metadata. It assigns a unique to every object stored in it. It is used extensively to store emails.
Challenges to store
Example
Address with different Attributes. Emails (format , body) Blood test report.(format, Diagnosis, conclusion) Web pages.(tags, metatags, data)