You are on page 1of 36

Types of Digital Data

Challenges
Understanding, management and analysis of data from heterogeneous sources

Topics
Various format of data Data storage mechanism Data access methods Management of data Extract desire information. Challenges by different format of data.

Digital Data
Unstructured (80%) : Semi-structured : Structured : organized form, in tables, According to Merrill Lynch 80-90% of business data is either unstructured or semi-structured. Data is usually in this format which makes it difficult to extract information from it.

Unstructured : which does not conform to a data model. Not in a form which can be easily used by computer program. Example : images, videos, memos, chat rooms, ppt, researches, white papers, body of email. Semi-Structure : which does not conform to a data model but has some structure. Not in a form which can be easily used by computer program. Ex : emails, XML, Markup language, metadata(not sufficient data). Structure : which is in organized form, and easily understand by a computer program. Ex : data store in database.

Structured data is organized in rows and columns in a rigidly defined format so that applications can retrieve and process it efficiently. Typically stored using a database management system (DBMS). Data is unstructured if its elements cannot be stored in rows and columns, and is therefore difficult to query and retrieve by business applications. For example, customer contacts may be stored in various forms such as sticky notes, e-mail messages, business cards, or even digital format files such as .doc, .txt, and .pdf . Due its unstructured nature, it is difficult to retrieve using a customer relationship management application.

Types of data

In hospital (GoodLife) data structure is maintained in structured way, so anyone can locate desire information easily. Comes from Access , OLTP, SQL, spreadsheets,
Fully described datasets. Clearly describe categories and sub categories. Data neatly placed in rows and columns Indexing can be easily done.

Characteristics of Structured data

Unstructured Data
Email had not been successfully updated in medical system database as it fell in the Unstructured format. Difficult to determine the meaning of the data. Does not follow any rules and semantics. Any type so unpredictable. Free form text without any structure

Characteristics of UnStructured data

Anything in non database form.


Bitmap objects : image, video, audio files. Textual objects : word, email. Body of email is raw data without any structure. Email had not been updated into the medical database record. Noisy text such as chats, emails, sms. Language is also different from normal lang.

Web pages , even in HTML (which is structured). Tagged element do not capture the meaning of the data. So difficult to automatically process information. Carry Links, reference to external unstructured content. Like images.

Sources of unstructured data

How to manage unstructured data


Index in SQL is created on existing tables to retrieve the rows quickly. When there are thousands of records in a table, retrieving information will take a long time. Therefore indexes are created on columns which are accessed frequently, so that the information can be retrieved quickly. Indexes can be created on a single column or a group of columns. When a index is created, it first sorts the data and then it assigns a ROWID for each row.
Indexing is nothing but an identifier and represents data in a data set. Indexing is possible in case of unstructured data . Based on text or some other attributes like the filename. Indexing is difficult in unstructured data is difficult because it does not follow any naming conventions.

Tags /metadata Using metadata data in the document can be tagged but in unstructured data this is difficult as little or no metadata is available. structure of the document cannot be determined as it is coming from more than one source and doesnt has particular format

Classification/taxonomy Taxonomy is classifying data on the basis of the relationships that exist between data. Data can be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. Classifying unstructured data is difficult as identifying relationships between data is not an easy task. CAS (content addressable storage ):It stores data based on their metadata. It assigns a unique to every object stored in it. It is used extensively to store emails.

Challenges to store

Solution to storage challenges

UIMA : Unstructured Information Management Architecture


Solution for unstructured data. It is an open source platform from IBM which integrates different kinds of analysis engines to provide a complete solution for knowledge discovery from unstructured data. UIMA stores information in structured format. Various analysis engines analyze unstructured data in different ways as such:
Breaking up of documents. Grouping and classifying acc. to taxonomy. Detecting parts of speech ,grammar and synonyms Detecting events and times Detecting relationship between various elements.

Semi structured data


Only about 10 percent of data in an organization is semi structured. Semi structured data does not conform to any data model. Data cant be stored in rows and columns. Semi structured data has tags and markers which help group the data and describe how the data is stored ,giving some metadata. But they are not sufficient for management and automation of data. Similar entities are grouped and organized in a hierarchy. The properties or the attributes within a group may or may not be the same.

Characteristics of semi-structure data

Example
Address with different Attributes. Emails (format , body) Blood test report.(format, Diagnosis, conclusion) Web pages.(tags, metatags, data)

Fine line b/w unstructured and semistructured data.

How semi structured data is stored


Schemas : used to define the structure of data. The problem with schema is that requirements are ever changing and changes required in data also lead to changes in schema. Graph based data models: these can be used to describe data .self describing, tree like structure to describe relationship and hierarchies. Schema less approach. XML: used to store and exchange semi structured data. It allows the user to define tags to store data hierarchical form.

Challenges in Storage of semi structured data

Solution for storing

Challenges to extract information.

Solutions to extract data

Difference between structured and semi structured data

You might also like