You are on page 1of 34

TEXT ANALYTICS

C Sudhakar
CEO

Raskey Software Solutions Ltd


Email:sudhakar@raskeysoft.com

Web:www.raskeysoft.com

SMART ANALYTICS

Start with strategy


Measure Metrics and Data
Apply analytics
Report results
Transform Business

TYPES OF ANALYTICS
Data analytics
Compete on Analytics
Text analytics
Video analytics
Social networking analytics
Web analytics
Speech analytics

TEXT ANALYTICS
Text analytics is the process of analyzing
unstructured text, extracting relevant
information, and transforming it into useful
business intelligence
Text analysis is now capable of telling us things
we did not already know and perhaps more
importantly had no way of knowing before.
Access to huge text data sets an improved
technical capability means we can now mine the
text for patterns and trends that can be
incredibly useful in business.

TEXT ANALYTICS TASKS INCLUDE


Text categorization
Text clustering
Concept extraction
Sentiment analysis
Document summarization

TEXT CATEGORIZATION
Text categorization applies some structure to
the text which can then be used for analysis
or query
Text analytics assigns a document to one or
more classes or categories according to the
subject or according to other attributes such
as document type, author, creation date etc.,

TEXT CLUSTERING

As the name would suggest text clustering


allows you to automatically cluster huge
repositories of text into meaningful topics or
categories for fast information retrieval or
filtering

CONCEPT EXTRACTION
This concept allows you to extract concepts
from text.
Meaning varies with concept

SENTIMENT ANALYSIS

Sentiment analysis (also known as opinion mining) refers to the use of


natural language processing, text analysis and computational linguistics
to identify and extract subjective information in source materials.
An important part of our information-gathering behavior has always
been to find out what other people think. With the growing availability
and popularity of opinion-rich resources such as online review sites and
personal blogs, new opportunities and challenges arise as people now
can, and do, actively use information technologies to seek out and
understand the opinions of others. The sudden eruption of activity in the
area of opinion mining and sentiment analysis, which deals with the
computational treatment of opinion, sentiment, and subjectivity in text,
has thus occurred at least in part as a direct response to the surge of
interest in new systems that deal directly with opinions as a first-class
object.
The basic purpose of sentiment analysis is to classify polarity of any
given text data as positive negative or neutral. Or star classification or a
scal classification.

EXAMPLE

(1) I bought an iPhone 2 days ago


. (2) It was such a nice phone.
(3) The touch screen was really cool.
(4) The voice quality was clear too.
(5) However, my mother was mad with me as I did
not tell her before I bought it.
(6) She also thought the phone was too expensive,
and wanted me to return it to the shop.
? The first thing that we may notice is that there are
several opinions in this review.

ANALYSIS

Sentences (2), (3) and (4) express three positive opinions, while
sentences (5) and (6) express negative opinions.
Then we also notice that the opinions all have some targets on which they are
expressed.
The opinion in sentence (2) is on iPhone as a whole,
the opinions in sentences (3) and (4) are on the touch screen and voice
quality features of iPhone respectively.
The opinion in sentence (6) is on the price of iPhone, but the opinion/emotion in
sentence (5) is on me, not iPhone.
This is an important point.
In an application, the user may be interested in opinions on certain targets, but
not on all (e.g., unlikely on me).
Finally, we may also notice the sources or holders of opinions.
The source or holder of the opinions in sentences (2), (3) and (4) is the author of
the review
(I), but in sentences (5) and (6) it is my mother. With this example in mind, we
can define sentiment

OBJECT AND FEATURE


In general, opinions can be expressed on
any target entity, e.g., a product, a service,
an individual, an organization, or an event.
We use the term object to denote the target
entity that has been commented on.
An object can have a set of components (or
parts) and a set of attributes (or properties)
[1, 4], which we collectively call the features
of the object.

TECHNICAL CHALLNGES
Object Identification
Feature grouping and synonym grouping
Opinion orientation classification
Integration
Identification of spam reviews/ documents

CLASSFICATION
Document-level sentiment analysis;
Sentence-level sentiment analysis;
Aspect-based sentiment analysis;
Comparative sentiment analysis; and,
Sentiment lexicon acquisition.

DOCUMENT SUMMRIZATION
Again as the name suggest this text analytic
tool allows you to automatically summarize
documents to retain the most important
points from the original document.
Extraction
Abstraction

SUMMARY

Text Analytics is particularly useful for


information retrieval, pattern recognition,
tagging and annotation, information
extraction, sentiment assessment and
predictive analytics.

A REAL TIME PROCESS

SMALL EXAMPLE IN AI

THIS APPROACH WORKS INCASE OF BOUNDED GROUND

CURATOR ENGINE INTELLIGENCE ENGINES


Domain Intelligence
Extraction Engine
Context Intelligence
Keyword Intelligence
Intent Analysis Engine

Lead Validity
Intelligence
Positive

Opportunity

DOMAIN INTELLIGENCE
Document
Url & Name

Negative

Url / Name
pattern

Url / Name
pattern

Unsure

Both
Positive and
Negative

Neither
Positive Nor
Negative

Challenges

Dmoz /
Jigsaw Data

Positive

Insufficient domain knowledge More elimination can be achieved with


more domain knowledge from source.

Solution

Insufficient domain knowledge SLED crawler and domain classification


should provide more knowledge

EXTRACTION ENGINE
Document

Text, Xml
and
Metadata

Old
Document

New
Document

Parser

Tika and
Pdf2Xml

Challenges

Non visible characters raises exceptions or misinterpretation (2%)

PdfMiner

schools is extracted as schools and changes the meaning.

Parser failures PdfMiner is an accurate parser but fails at times (10%)

Solutions

Parser Failures Using Tika and Pdf2Xml as a combination reduces context


leakage.

CONTEXT INTELLIGENCE
Parser

Document
Titles and
Headers

Positive

Unsure

Challenges

Ambiguous Context Misleads Decision

Negative

Job posting inside an agenda

Insufficient Context Context away from keyword location or missing

Solutions

Insufficient Context Extract context from various locations.


Information from source, directory information, domain intelligence,
etc.

KEYWORD INTELLIGENCE
Parser

Context
Around
Keyword

Paragraphs

Bullet
Points

Challenges

Identification of keyword phrases Reduces data leakage


Keyword specific intelligence Negative extensions, support words etc.

Tables

free wifi, wireless mouse, network security policy.

Solutions

Keyword specific intelligence Manually collected for popular keywords.


Use statistical bigram approach for other keywords.

INTENT ANALYSIS

Context
Around
Keyword

Paragraph

Direct
Relation

Indirect
Relation

Bullet Point

Header
Analysis

Bullet Point
Analysis

Table

Row
Analysis

Header
Analysis

INTENT ANALYSIS CHALLENGES

Human Ambiguity

Improved productivity and streamlined IT infrastructure through file


storage capabilities
The plan includes providing sufficient network capacity (This sentence
is present in an analysis document from a writer)

Machine Ambiguity

Authorize a purchase of storage area network equipment - keyword


is network equipment
The technology director shall enhance awareness regarding network
security

Solution

Experimenting by building probabilistic language models.

INTENT ANALYSIS CHALLENGES

Stanford Mistakes

Sometimes Stanford software we are using, builds wrong relations


Ex: IT Infrastructure , IT is identified as it.

Solution

Replace keyword with a generic keyword before parsing it with


stanford. The generic keyword shouldnt spoil the relations.

Indirect buying decision

Information security is recognized as a top management challenge


for the department

OTHER CHALLENGES

Noisy Keywords

Noisy Domains

Keywords like vmware, firewall and gis contributes lots of


noise
Unavoidable these keywords also contribute towards
positives.

Domains like itdashboard.gov contributes lots of noise.


Contributed 22% noise to Tegile leads in June.

Duplicates

Same domain documents appears multiple times, contributing


to duplicate documents

POSITIVE MARKED DOCUMENTS


45%
40%

40%

37%

35%
30%

32%
28%

25%
19%

20%
15%
10%

12%
8%

5%
0%
May

June

13%

Lost Business
Wrong Context
Rejected by reviewer
Approved

LEAD VALIDITY INTELLIGENCE

False
Positives

Lost
Business

Low
Budget

Wrong
Industry

Too Early

Others

Challenges

Duplicates

Company specific constraints Campus Management requires only


Higher education leads.
Identifying Budget Constraints Eg. < $10k

Solution

Implemented Patterns to identify Lost Businesses

AFTER APPLYING LOST BUSINESS PATTERNS


90%
79%

80%
70%

65%
57%

60%

50%
40%

43%
Identified L.B
Not identified L.B

35%

30%

21%

20%
10%

0%
Juniper
(55/120)

Google
(30/150)

Tegile (19/55)

CURRENTLY COMPANIES ARE WORKING ON

Probabilistic Language Models

Build semi supervised language models to handle machine


ambiguity.
Develop a diversified language based dataset for training.

Driver Based Patterns

Develop patterns specific to driver word.


Eg:

Provide Specifies intent of an action


Provides Specifies intent of solution/service

Keyword Intelligence

Methodologies to derive and handle keyword phrases.


Start with manually adding keyword phrases and slowing
move towards an automated system.

THANK YOU

You might also like