Text Analytics

TEXT ANALYTICS
C Sudhakar
CEO
Raskey Software Solutions Ltd

Email:sudhakar@raskeysoft.com
Web:www.raskeysoft.com
SMART ANALYTICS
Start with strategy

Measure Metrics and Data
Apply analytics
Report results
Transform Business
TYPES OF ANALYTICS
Data analytics
Compete on Analytics
Text analytics
Video analytics
Social networking analytics
Web analytics
Speech analytics
TEXT ANALYTICS
Text analytics is the process of analyzing
unstructured text, extracting relevant
information, and transforming it into useful
business intelligence
Text analysis is now capable of telling us things
we did not already know and perhaps more
importantly had no way of knowing before.
Access to huge text data sets an improved
technical capability means we can now mine the
text for patterns and trends that can be
incredibly useful in business.
TEXT ANALYTICS TASKS INCLUDE

Text categorization
Text clustering
Concept extraction
Sentiment analysis
Document summarization
TEXT CATEGORIZATION
Text categorization applies some structure to
the text which can then be used for analysis
or query
Text analytics assigns a document to one or
more classes or categories according to the
subject or according to other attributes such
as document type, author, creation date etc.,
TEXT CLUSTERING
As the name would suggest text clustering

allows you to automatically cluster huge
repositories of text into meaningful topics or
categories for fast information retrieval or
filtering
CONCEPT EXTRACTION
This concept allows you to extract concepts
from text.
Meaning varies with concept
SENTIMENT ANALYSIS
Sentiment analysis (also known as opinion mining) refers to the use of

natural language processing, text analysis and computational linguistics
to identify and extract subjective information in source materials.
An important part of our information-gathering behavior has always
been to find out what other people think. With the growing availability
and popularity of opinion-rich resources such as online review sites and
personal blogs, new opportunities and challenges arise as people now
can, and do, actively use information technologies to seek out and
understand the opinions of others. The sudden eruption of activity in the
area of opinion mining and sentiment analysis, which deals with the
computational treatment of opinion, sentiment, and subjectivity in text,
has thus occurred at least in part as a direct response to the surge of
interest in new systems that deal directly with opinions as a first-class
object.
The basic purpose of sentiment analysis is to classify polarity of any
given text data as positive negative or neutral. Or star classification or a
scal classification.
EXAMPLE
(1) I bought an iPhone 2 days ago

. (2) It was such a nice phone.
(3) The touch screen was really cool.
(4) The voice quality was clear too.
(5) However, my mother was mad with me as I did
not tell her before I bought it.
(6) She also thought the phone was too expensive,
and wanted me to return it to the shop.
? The first thing that we may notice is that there are
several opinions in this review.
ANALYSIS
Sentences (2), (3) and (4) express three positive opinions, while
sentences (5) and (6) express negative opinions.
Then we also notice that the opinions all have some targets on which they are
expressed.
The opinion in sentence (2) is on iPhone as a whole,
the opinions in sentences (3) and (4) are on the touch screen and voice
quality features of iPhone respectively.
The opinion in sentence (6) is on the price of iPhone, but the opinion/emotion in
sentence (5) is on me, not iPhone.
This is an important point.
In an application, the user may be interested in opinions on certain targets, but
not on all (e.g., unlikely on me).
Finally, we may also notice the sources or holders of opinions.
The source or holder of the opinions in sentences (2), (3) and (4) is the author of
the review
(I), but in sentences (5) and (6) it is my mother. With this example in mind, we
can define sentiment
OBJECT AND FEATURE

In general, opinions can be expressed on
any target entity, e.g., a product, a service,
an individual, an organization, or an event.
We use the term object to denote the target
entity that has been commented on.
An object can have a set of components (or
parts) and a set of attributes (or properties)
[1, 4], which we collectively call the features
of the object.
TECHNICAL CHALLNGES
Object Identification
Feature grouping and synonym grouping
Opinion orientation classification
Integration
Identification of spam reviews/ documents
CLASSFICATION
Document-level sentiment analysis;
Sentence-level sentiment analysis;
Aspect-based sentiment analysis;
Comparative sentiment analysis; and,
Sentiment lexicon acquisition.
DOCUMENT SUMMRIZATION
Again as the name suggest this text analytic
tool allows you to automatically summarize
documents to retain the most important
points from the original document.
Extraction
Abstraction
SUMMARY
Text Analytics is particularly useful for

information retrieval, pattern recognition,
tagging and annotation, information
extraction, sentiment assessment and
predictive analytics.
A REAL TIME PROCESS
SMALL EXAMPLE IN AI
THIS APPROACH WORKS INCASE OF BOUNDED GROUND
CURATOR ENGINE INTELLIGENCE ENGINES

Domain Intelligence
Extraction Engine
Context Intelligence
Keyword Intelligence
Intent Analysis Engine
Lead Validity
Intelligence
Positive
Opportunity
DOMAIN INTELLIGENCE
Document
Url & Name
Negative
Url / Name
pattern
Url / Name
pattern
Unsure
Both
Positive and
Negative
Neither
Positive Nor
Negative
Challenges
Dmoz /
Jigsaw Data
Positive
Insufficient domain knowledge More elimination can be achieved with

more domain knowledge from source.
Solution
Insufficient domain knowledge SLED crawler and domain classification

should provide more knowledge
EXTRACTION ENGINE
Document
Text, Xml
and
Metadata
Old
Document
New
Document
Parser
Tika and
Pdf2Xml
Challenges
Non visible characters raises exceptions or misinterpretation (2%)
PdfMiner
schools is extracted as schools and changes the meaning.
Parser failures PdfMiner is an accurate parser but fails at times (10%)
Solutions
Parser Failures Using Tika and Pdf2Xml as a combination reduces context

leakage.
CONTEXT INTELLIGENCE
Parser
Document
Titles and
Headers
Positive
Unsure
Challenges
Ambiguous Context Misleads Decision
Negative
Job posting inside an agenda
Insufficient Context Context away from keyword location or missing
Solutions
Insufficient Context Extract context from various locations.

Information from source, directory information, domain intelligence,
etc.
KEYWORD INTELLIGENCE
Parser
Context
Around
Keyword
Paragraphs
Bullet
Points
Challenges
Identification of keyword phrases Reduces data leakage

Keyword specific intelligence Negative extensions, support words etc.
Tables
free wifi, wireless mouse, network security policy.
Solutions
Keyword specific intelligence Manually collected for popular keywords.

Use statistical bigram approach for other keywords.
INTENT ANALYSIS
Context
Around
Keyword
Paragraph
Direct
Relation
Indirect
Relation
Bullet Point
Header
Analysis
Bullet Point
Analysis
Table
Row
Analysis
Header
Analysis
INTENT ANALYSIS CHALLENGES
Human Ambiguity
Improved productivity and streamlined IT infrastructure through file

storage capabilities
The plan includes providing sufficient network capacity (This sentence
is present in an analysis document from a writer)
Machine Ambiguity
Authorize a purchase of storage area network equipment - keyword

is network equipment
The technology director shall enhance awareness regarding network
security
Solution
Experimenting by building probabilistic language models.
INTENT ANALYSIS CHALLENGES
Stanford Mistakes
Sometimes Stanford software we are using, builds wrong relations

Ex: IT Infrastructure , IT is identified as it.
Solution
Replace keyword with a generic keyword before parsing it with

stanford. The generic keyword shouldnt spoil the relations.
Indirect buying decision
Information security is recognized as a top management challenge

for the department
OTHER CHALLENGES
Noisy Keywords
Noisy Domains
Keywords like vmware, firewall and gis contributes lots of

noise
Unavoidable these keywords also contribute towards
positives.
Domains like itdashboard.gov contributes lots of noise.

Contributed 22% noise to Tegile leads in June.
Duplicates
Same domain documents appears multiple times, contributing

to duplicate documents
POSITIVE MARKED DOCUMENTS

45%
40%
40%
37%
35%
30%
32%
28%
25%
19%
20%
15%
10%
12%
8%
5%
0%
May
June
13%
Lost Business
Wrong Context
Rejected by reviewer
Approved
LEAD VALIDITY INTELLIGENCE
False
Positives
Lost
Business
Low
Budget
Wrong
Industry
Too Early
Others
Challenges
Duplicates
Company specific constraints Campus Management requires only

Higher education leads.
Identifying Budget Constraints Eg. < $10k
Solution
Implemented Patterns to identify Lost Businesses
AFTER APPLYING LOST BUSINESS PATTERNS

90%
79%
80%
70%
65%
57%
60%
50%
40%
43%
Identified L.B
Not identified L.B
35%
30%
21%
20%
10%
0%
Juniper
(55/120)
Google
(30/150)
Tegile (19/55)
CURRENTLY COMPANIES ARE WORKING ON
Probabilistic Language Models
Build semi supervised language models to handle machine

ambiguity.
Develop a diversified language based dataset for training.
Driver Based Patterns
Develop patterns specific to driver word.

Eg:
Provide Specifies intent of an action

Provides Specifies intent of solution/service
Keyword Intelligence
Methodologies to derive and handle keyword phrases.

Start with manually adding keyword phrases and slowing
move towards an automated system.
THANK YOU

Text Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Analytics

Uploaded by

Copyright:

Available Formats

TEXT ANALYTICS

Raskey Software Solutions Ltd

Start with strategy

TEXT ANALYTICS TASKS INCLUDE

As the name would suggest text clustering

Sentiment analysis (also known as opinion mining) refers to the use of

(1) I bought an iPhone 2 days ago

OBJECT AND FEATURE

Text Analytics is particularly useful for

A REAL TIME PROCESS

THIS APPROACH WORKS INCASE OF BOUNDED GROUND

CURATOR ENGINE INTELLIGENCE ENGINES

Insufficient domain knowledge More elimination can be achieved with

Insufficient domain knowledge SLED crawler and domain classification

Non visible characters raises exceptions or misinterpretation (2%)

schools is extracted as schools and changes the meaning.

Parser failures PdfMiner is an accurate parser but fails at times (10%)

Parser Failures Using Tika and Pdf2Xml as a combination reduces context

Ambiguous Context Misleads Decision

Job posting inside an agenda

Insufficient Context Context away from keyword location or missing

Insufficient Context Extract context from various locations.

Identification of keyword phrases Reduces data leakage

free wifi, wireless mouse, network security policy.

Keyword specific intelligence Manually collected for popular keywords.

INTENT ANALYSIS CHALLENGES

Improved productivity and streamlined IT infrastructure through file

Authorize a purchase of storage area network equipment - keyword

Experimenting by building probabilistic language models.

INTENT ANALYSIS CHALLENGES

Sometimes Stanford software we are using, builds wrong relations

Replace keyword with a generic keyword before parsing it with

Indirect buying decision

Information security is recognized as a top management challenge

Keywords like vmware, firewall and gis contributes lots of

Domains like itdashboard.gov contributes lots of noise.

Same domain documents appears multiple times, contributing

POSITIVE MARKED DOCUMENTS

LEAD VALIDITY INTELLIGENCE

Company specific constraints Campus Management requires only

Implemented Patterns to identify Lost Businesses

AFTER APPLYING LOST BUSINESS PATTERNS

CURRENTLY COMPANIES ARE WORKING ON

Probabilistic Language Models

Build semi supervised language models to handle machine

Driver Based Patterns

Develop patterns specific to driver word.

Provide Specifies intent of an action

Methodologies to derive and handle keyword phrases.

You might also like