Semantic Search: With Contributions From Thanh Tran (KIT)

Semantic Search
Peter Mika
Senior Research Scientist
Yahoo! Research
With contributions from Thanh Tran (KIT)
- 2 -
Yahoo! serves over 700 million users in 25 countries
- 3 -
Yahoo! Research: visit us at research.yahoo.com
- 4 -
Yahoo! Research Barcelona
Established January, 2006
Led by Ricardo Baeza-Yates
Research areas
Web Mining
content, structure, usage
Social Media
Distributed Systems
Semantic Search

- 5 -
Search is really fast, without necessarily being intelligent
- 6 -
Why Semantic Search? Part I
Improvements in IR are harder and harder to come by
Machine learning using hundreds of features
Text-based features for matching
Graph-based features provide authority
Heavy investment in computational power, e.g. real-time
indexing and instant search
Remaining challenges are not computational, but in
modeling user cognition
Need a deeper understanding of the query, the content and/or
the world at large
Could Watson explain why the answer is Toronto?

- 7 -
Poorly solved information needs
Multiple interpretations
paris hilton
Long tail queries
george bush (and I mean the beer brewer in Arizona)
Multimedia search
paris hilton sexy
Imprecise or overly precise searches
jim hendler
pictures of strong adventures people
Searches for descriptions
countries in africa
32 year old computer scientist living in barcelona
reliable digital camera under 300 dollars
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
- 8 -
Example: multiple interpretations

- 9 -
Why Semantic Search? Part II
The Semantic Web is now a reality
Large amounts of RDF data
Heterogeneous schemas, quality
Users who are not skilled in writing
complex queries (e.g. SPARQL)
and may not be experts in the
domain

Searching data instead or in
addition to searching documents
Direct answers
Novel search tasks
- 10 -
Information box with
content from and
links to Yahoo!
Travel
Example: direct answers in search
Points of
interest in
Vienna,
Austria
Since Aug,
2010, regular
search results
are Powered
by Bing
Faceted
search for
Shopping
results
Information
from the
Knowledge
Graph
- 11 -
Novel search tasks
Aggregation of search results
e.g. price comparison across websites
Analysis and prediction
e.g. world temperature by 2020
Semantic profiling
Ontology-based modeling of user interests
Semantic log analysis
Linking query and navigation logs to ontologies
Support for complex tasks (search apps)
e.g. booking a vacation using a combination of services

- 13 -
Interactive search and task completion
- 14 -
Why Semantic Search? Part III
There is a use case
Consumers want to understand content
Publishers want consumers to understand their content
Semantic Web standards seem to be a good fit

http://en.wikipedia.org/wiki/Underpants_Gnomes
- 16 -
Example: Facebooks Like and the Open Graph Protocol
The Like button provides publishers with a way to promote
their content on Facebook and build communities
Shows up in profiles and news feed
Site owners can later reach users who have liked an object
Facebook Graph API allows 3
rd
party developers to access the
data
Open Graph Protocol is an RDFa-based format that allows
to describe the object that the user Likes

- 17 -
Example: Facebooks Open Graph Protocol
RDF vocabulary to be used in conjunction with RDFa
Simplify the work of developers by restricting the freedom in RDFa
Activities, Businesses, Groups, Organizations, People, Places,
Products and Entertainment
Only HTML <head> accepted
http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" />
</head> ...
- 18 -
Example: schema.org
Agreement on a shared set of schemas for common types of
web content
Bing, Google, and Yahoo! as initial supporters
Similar in intent to sitemaps.org (2006)
Use a single format to communicate the same information to all
three search engines
Support for microdata
schema.org covers areas of interest to all search engines
Business listings (local), creative works (video), recipes,
reviews
User defined extensions
Each search engine continues to develop its products

- 19 -

Documentation and OWL ontology
- 20 -
Current state of metadata on the Web
31% of webpages, 5% of domains contain some
metadata
Analysis of the Bing Crawl (US crawl, January, 2012)
RDFa is most common format
By URL: 25% RDFa, 7% microdata, 9% microformat
By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
Adoption is stronger among large publishers
Especially for RDFa and microdata
See also
P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
LDOW 2012
H.Mhleisen, C.Bizer.Web Data Commons - Extracting Structured
Data from Two Large Web Corpora, LDOW 2012
- 21 -
Exponential growth in RDFa data
Percentage of URLs with embedded metadata in various formats
Five-fold increase between
March, 2009 and October,
2010
Another five-fold increase
between October 2010 and
January, 2012
Semantic Search
- 23 -
Semantic Search: a definition
Semantic search is a retrieval paradigm that
Makes use of the structure of the data or explicit schemas
to understand user intent and the meaning of content
Exploits this understanding at some part of the search
process
Web search vs. vertical/enterprise/desktop search
Related fields:
XML retrieval
Keyword search in databases
Natural Language Retrieval
- 24 -
Semantics at every step of the IR process
bla bla bla?
bla
bla bla
q=bla * 3
Crawling and indexing bla
bla bla
bla
bla
bla
Indexing
Ranking
bla
(q,d)
Query interpretation
Result presentation
The IR engine The Web
Crawling and Indexing
- 26 -
Data on the Web
Most web pages on the Web are generated from structured
data
Data is stored in relational databases (typically)
Queried through web forms
Presented as tables or simply as unstructured text
The structure and semantics (meaning) of the data is not
directly accessible to search engines
Two solutions
Extraction using Information Extraction (IE) techniques
(implicit metadata)
Supervised vs. unsupervised methods
Relying on publishers to expose structured data using standard
Semantic Web formats (explicit metadata)
Particularly interesting for long tail content

- 27 -
Information Extraction methods
Natural Language Processing
Extraction of triples
Suchanek et al. YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia, WWW, 2007.
Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
2007.
Filling web forms automatically (form-filling)
Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
Extraction from HTML tables
Cafarella et al. WebTables: Exploring the Power of Tables on
the Web. VLDB 2008
Wrapper induction
Kushmerick et al. Wrapper Induction for Information
ExtractionText extraction. IJCAI 2007

- 28 -
Semantic Web
Sharing data across the Web
Publish information in standard formats (RDF, RDFa)
Share the meaning using powerful, logic-based languages
(OWL, RIF)
Query using standard languages and protocols (HTTP, SPARQL)
Two main forms of publishing
Linked Data
Data published as RDF documents linked to other RDF documents
and/or using SPARQL end-points
Community effort to re-publish large public datasets (e.g. Dbpedia,
open government data)
RDFa
Data embedded inside HTML pages
Recommended for site owners by Yahoo, Google, Facebook
- 29 -
Crawling the Semantic Web
Linked Data
Similar to HTML crawling, but the the crawler needs to parse
RDF/XML (and others) to extract URIs to be crawled
Semantic Sitemap/VOID descriptions
RDFa
Same as HTML crawling, but data is extracted after crawling
Mika et al. Investigating the Semantic Gap through Query Log
Analysis, ISWC 2010.
SPARQL endpoints
Endpoints are not linked, need to be discovered by other
means
Semantic Sitemap/VOID descriptions
- 30 -
Data fusion
Ontology matching
Widely studied in Semantic Web research, see e.g. list of
publications at ontologymatching.org
Unfortunately, not much of it is applicable in a Web context due to the
quality of ontologies
Entity resolution
Logic-based approaches in the Semantic Web
Studied as record linkage in the database literature
Machine learning based approaches, focusing on attributes
Graph-based approaches, see e.g. the work of Lisa Getoor are
applicable to RDF data
Improvements over only attribute based matching
Blending
Merging objects that represent the same real world entity and
reconciling information from multiple sources

- 31 -
Data quality assessment and curation
Heterogeneity, quality of data is an even larger issue
Quality ranges from well-curated data sets (e.g. Freebase) to
microformats
In the worst of cases, the data becomes a graph of words
Short amounts of text: prone to mistakes in data entry or
extraction
Example: mistake in a phone number or state code
Quality assessment and data curation
Quality varies from data created by experts to user-generated
content
Automated data validation
Against known-good data or using triangulation
Validation against the ontology or using probabilistic models
Data validation by trained professionals or crowdsourcing
Sampling data for evaluation
Curation based on user feedback
- 32 -
Indexing
Search requires matching and ranking
Matching selects a subset of the elements to be scored
The goal of indexing is to speed up matching
Retrieval needs to be performed in milliseconds
Without an index, retrieval would require streaming through the
collection
The type of index depends on the query model to support
DB-style indexing
IR-style indexing

- 33 -
IR-style indexing
Index data as text
Create virtual documents from data
One virtual document per subgraph, resource or triple
typically: resource
Key differences to Text Retrieval
RDF data is structured
Minimally, queries on property values are required

- 34 -

Horizontal index structure
Two fields (indices): one for terms, one for properties
For each term, store the property on the same position in the
property index
Positions are required even without phrase queries
Query engine needs to support the alignment operator
Dictionary is number of unique terms + number of
properties
Occurrences is number of tokens * 2

- 35 -

Vertical index structure
One field (index) per property
Positions are not required
But useful for phrase queries
Query engine needs to support fields
Dictionary is number of unique terms
Occurrences is number of tokens
Number of fields is a problem for merging, query performance
- 36 -
Distributed indexing
MapReduce is ideal for building inverted indices
Map creates (term, {doc1}) pairs
Reduce collects all docs for the same term: (term, {doc1,
doc2}
Sub-indices are merged separately
Term-partitioned indices
Peter Mika. Distributed Indexing for Semantic Search,
SemSearch 2010.

Query Processing
- 38 -
What is search?
The search problem
A data collection consisting of a set of items (units of retrieval)
Information needs expressed as queries
Ambiguity in the interpretation of the data and/or the queries
Search is the task of efficiently finding items that are relevant
to the information need
Query processing mainly focuses on efficiency of matching
whereas ranking deals with degree of matching (relevance)!

- 40 -
Types of data models (1)
Textual
Bag-of-words
Represent documents, text in structured data,, real-world
objects (captured as structured data)
Lacks structure
Text structure, e.g. linguistic structure, outlines, hyperlinks etc.
Structure in structured data representation

In combination with
Cloud Computing
technologies, promising
solutions for the
management of `big
data' have emerged.
Existing industry
solutions are able to
support complex
queries and analytics
tasks with terabytes of
data. For example,
using a Greenplum.
combination
Cloud
Computing
Technologies
solutions
management
`big data'
industry
solutions
support
complex

term (statistics)
- 41 -
Graph structure
Relationships in the data
Hyperlinks
Typed relationships
Ontology
Bob
Person
creator
Picture
- 42 -
Hybrid
RDF data embedded in text (RDFa)
- 43 -
Formalisms for querying semantic data (1)
Example information need
Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.
- 44 -
Unstructured
NL
Keywords
apartment Berlin Alice shared
- 45 -
Fully-structured
SPARQL: BGP, filter, optional, union, select, construct, ask, describe
PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name Alice.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name
KIT }
- 46 -
Hybrid: both content and structure constraints
?x ns:knows ? y. ?y ns:name Alice.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name KIT
shared apartment Berlin Alice
- 47 -
Summary: data and queries in Semantic Search
Query
Data
M
a
t
c
h
i
n
g

Keywords
NL
Questions
Form- / facet-
based Inputs
Structured Queries
(SPARQL)
OWL ontologies with
rich, formal
semantics
Structured
RDF data
Semi-
Structured
RDF data
RDF data
embedded in
text (RDFa)
Ambiquities
Ambiquities: confidence degree, truth/trust
value
Semantic Search target
different group of users,
information needs, and types
of data. Query processing for
semantic search is hybrid
combination of techniques!
- 48 -
Processing hybrid graph patterns (1)
Alice
Bob is a good friend
of mine. We went to
the same university,
and also shared an
apartment in Berlin
in 2008. The trouble
with Bob is that he
takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
Peter
FluidOps
34
?y ns:name Alice. ?x ns:knows ? y
apartment shared Berlin Alice
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name KIT
Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.
- 49 -
Matching keyword query against text
Retrieve documents
Inverted list (inverted index)
keyword {<doc1, pos, score>,,<doc2, pos, score, ...>, ...}
AND-semantics: top-k join

shared
shared berlin alice
= =
shared Berlin Alice
shared Berlin Alice
D1 D1 D1
- 50 -
Matching structured query against structured data
Retrieve data for triple patterns
Index on tables
Multiple redundant indexes to cover different access patterns
Join (conjunction of triples)
Blocking, e.g. linear merge join (required sorted input)
Non-blocking, e.g. symmetric hash-join
Materialized join indexes

SP-index PO-index
=
=
=
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name KIT
Per1 ns:works ?v ?v ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
- 51 -
Matching keyword query against structured data
Retrieve keyword elements
Using inverted index
keyword {<el1, score, ...>, <el2, score, ...>,}
Exploration / Join
Data indexes for triple lookup
Materialized index (paths up to graphs)
Top-k Steiner tree search, top-k subgraph exploration

=
=
Alice Bob KIT Alice Bob KIT
Alice ns:knows Bob
Bob ns:works Inst1
Inst1 ns:name KIT
- 52 -
Matching structured query against text
Offilne IE
Online IE, i.e., retrieve is as follows
Derive keywords to retrieve relevant documents
On-the-fly information extraction, i.e., phrase pattern matching X
name Y
Retrieve extracted data for structured part
Retrieve documents for derived text patterns, e.g. sequence,
windows, reg. exp.

name
knows
KIT
- 53 -
Matching structured query against text
Index
Inverted index for document retrieval and pattern matching
Join index inverted index for storing materialized joins
between keywords
Neighborhood indexes for phrase patterns

KIT
name
knows
KIT
name
- 54 -
Query processing main tasks
Retrieval
Documents , data elements, triples,
paths, graphs
Inverted index,, but also other (B+ tree)
Index documents, triples, materialized
paths
Join
Different join implementations, efficiency
depends on availability of indexes
Non-blocking join good for early result
reporting and for unpredictable Linked
Data / data streams scenario
Query
Data
M
a
t
c
h
i
n
g

- 55 -
Query processing more tasks
More complex queries: disjunction,
aggregation, grouping, analytics
Join order optimization
Approximate
Approximate the search space
Approximate the results (matching, join)
Parallelization
Top-k
Use only some entries in the input
streams to produce k results
Multiple sources
Federation, routing
On-the-fly mapping, similarity join
Hybrid
Join text and data

Query
Data
M
a
t
c
h
i
n
g

Ranking
- 58 -
Ranking problem definition
Query
Data
M
a
t
c
h
i
n
g

Ambiguities arise when
representation is incomplete /
imprecise
Ambiguities at the level of
elements (content ambiguity)
structure between elements
(structure ambiguity)

Due to ambiguities in the representation of the
information needs and the underlying resources, the
results cannot be guaranteed to exactly match the query.
Ranking is the problem of determining the degree of
matching using some notions of relevance.
- 59 -
Content ambiguity
Alice
of mine. We went to
and also shared an
apartment in Berlin
with Bob is that he
takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
Peter
FluidOps
34
What is meant by Berlin in the query?
What is meant by Berlin in the data?
A city with the name Berlin? a person?
What is meant by KIT in the query?
What is meant by KIT in the data?
A research group? a university? a location?
- 60 -
Structure ambiguity
Alice
of mine. We went to
and also shared an
apartment in Berlin
with Bob is that he
takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
Peter
FluidOps
34
What is the connection between
Berlin and Alice?
Friend? Co-worker?
What is meant by works?
Works at? employed?
- 61 -
Ambiguity
Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
Syntactic, e.g. works vs. works at
Semantic, e.g. works vs. employ
Aboutness, i.e., contain some elements which represent the
correct interpretation
Ambiguities arise when matching elements of different granularities
Does i contains the interpretation for j, given some part(s) of i
(syntactically/semantically) match j
E.g. Berlin vs. we went to the same university, and also, we shared
an apartment in Berlin in 2008
Strictly speaking, ranking is performed after syntactic / semantic
matching is done!

- 62 -
Features: What to use to deal with ambiguities?
What is meant by Berlin? What is the
connection between Berlin and Alice?
Content features
Frequencies of terms: d more likely to be about a query term k
when d more often, mentions k (probabilistic IR)
Co-occurrences: terms K that often co-occur form a contextual
interpretation, i.e., topics (cluster hypothesis)
Structure features
Consider relevance at level of fields
Linked-based popularity

- 63 -
Ranking paradigms
Explicit relevance model
Foundation: probability ranking principle
Ranking results by the posterior probability (odds) of being
observed in the relevant class:
P(w|R) varies in different approaches
binary independence model
Two-Poisson model
BM25

) ) | ( 1 ( ) | ( ) | (
[ [
e e
=
D w D w
N w P R w P R D P

P(D|R)
P(D/N)
- 64 -
Ranking paradigms
No explicit notion of relevance: similarity between the query
and the document model
Vector space model (cosine similarity)
Language models (KL divergence)
)) ,..., ( , ) ,..., ( ( ) , (
, , 1 , , 1 q k q d t d
w w w w Cos d q Sim =
) | (
) | (
log( ) | ( ) || ( ) , (
d
q
q
V t
d q
t P
t P
t P KL d q Sim
u
u
u u u

e
= =
- 65 -
Model construction
How to obtain
Relevance models?
Weights for query / document terms?
Language models for document / queries?

- 66 -
Content-based features
Document statistics, e.g.
Term frequency
Document length
Collection statistics, e.g.
Inverse document frequency
Background language models

) | ( ) 1 (
| |
) | ( C t P
d
tf
t P
d
u + =
idf
d
tf
w
d t
- =
| |
,
An object is more likely
about Berlin when
it contains a relatively
high number of
mentions of the term
Berlin
the number of
mentions of this term in
the overall collection is
relatively low

- 67 -
Structure-based features
Consider structure of objects
Content-based features for structured objects, documents and
for general tuples

) | ( ) | (
f
F f
f d
d
t P t P u o u

e
=
An object is more likely about Berlin when
one of its (important) fields contains a relatively high
number of mentions of the term Berlin
- 68 -
Structure-based features (2)
PageRank
Link analysis algorithm
Measuring relative importance of nodes
Link counts as a vote of support
The PageRank of a node recursively depends on the number
and PageRank of all nodes that link to it (incoming links)
ObjectRank
Types and semantics of links vary in structured data setting
Authority transfer schema graph specifies connection strengths
Recursively compute authority transfer data graph

An object about Berlin is more important than another when
a relatively large number of objects are linked to it
- 69 -
In practice
Many more aspects of relevance
User profiles
History
Context, e.g. geo-location
etc.
Combination of features using Machine Learning
Several hundred features in modern search engines
Pre-compute static features such as PageRank/ObjectRank
Two-phase scoring for efficiency
Round 1: easy to compute features
Round 2: more expensive features

Evaluation
Harry Halpin, Daniel Herzig, Peter Mika, Jeff Pound,
Henry Thompson, Roi Blanco, Thanh Tran Duc

- 71 -
Semantic Search challenge (2010/2011)
Two tasks
Entity Search
Queries where the user is looking for a single real world object
Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
2010.
List search (new in 2011)
Queries where the user is looking for a class of objects
Billion Triples Challenge 2009 dataset
Evaluated using Amazons Mechanical Turk
Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
Blanco et al. Repeatable and Reliable Search System
Evaluation using Crowd-Sourcing, SIGIR2011

- 72 -
Evaluation form
- 73 -
Other evaluations
TREC Entity Track
Related Entity Finding
Entities related to a given entity through a particular relationship
Retrieval over documents (ClueWeb 09 collection)
Example: (Homepages of) airlines that fly Boeing 747
Entity List Completion
Given some elements of a list of entities, complete the list
Question Answering over Linked Data
Retrieval over specific datasets (Dbpedia and MusicBrainz)
Full natural language questions of different forms
Correct results defined by an equivalent SPARQL query
Example: Give me all actors starring in Batman Begins.

Search interface
- 75 -
Search interface
Input and output functionality
helping the user to formulate complex queries
presenting the results in an intelligent manner
Semantic Search brings improvements in
Query formulation
Snippet generation
Suggesting related entities
Adaptive and interactive presentation
Presentation adapts to the kind of query and results presented
Object results can be actionable, e.g. buy this product
Aggregated search
Grouping similar items, summarizing results in various ways
Filtering (facets), possibly across different dimensions
Task completion
Help the user to fulfill the task by placing the query in a task context

- 76 -
Query formulation
Snap-to-grid: suggest the most likely interpretation of
the query
Given the ontology or a summary of the data
While the user is typing or after issuing the query
Example: Freebase suggest, TrueKnowledge

- 77 -

Enhanced results/Rich Snippets
Use mark-up from the webpage to generate search snippets
Originally invented at Yahoo! (SearchMonkey)
Google, Yahoo!, Bing, Yandex now consume schema.org
markup
Validators available from Google and Bing
- 79 -
Aggregated search: facets
- 80 -
Aggregated search: Sig.ma
- 81 -
Related entities

Related actors
and movies
- 83 -
Resources
Books
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
Survey papers
Thanh Tran, Peter Mika. Survey of Semantic Search Approaches.
Under submission, 2012.
Conferences and workshops
ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
Semantic Search workshop series
Exploiting Semantic Annotations in Information Retrieval (ESAIR)
Entity-oriented Search (EOS) workshop
Upcoming
Joint Intl. Workshop on Entity-oriented and Semantic Search
(JIWES) at SIGIR 2012
ESAIR 2012 at CIKM 2012

- 84 -
The End
Many thanks to Thanh Tran (KIT) and members of the
SemSearch group at Yahoo! Research in Barcelona
Contact
pmika@yahoo-inc.com
Internships available for PhD students (deadline in January)

Semantic Search: With Contributions From Thanh Tran (KIT)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semantic Search: With Contributions From Thanh Tran (KIT)

Uploaded by

Copyright:

Available Formats

Semantic Search

You might also like