You are on page 1of 68

Introduction to the

Semantic Web
Stefane Fermigier, Olivier Grisel - Nuxeo
Solutions Linux - Paris - May 2011

Wednesday, May 11, 2011


Agenda

• A pragmatic introduction to the Semantic


Web
• Experience report and demos from Nuxeo
• Apache tools for Big Linked Data

Wednesday, May 11, 2011


1. Introduction to the
Semantic Web

Wednesday, May 11, 2011


Prelude

Wednesday, May 11, 2011


Source: Mills Davis, “Semantic Social Computing”, sept. 2007
Wednesday, May 11, 2011
History

Wednesday, May 11, 2011


Wednesday, May 11, 2011
Invented the web in 1989
(yeah!)

Wednesday, May 11, 2011


Invented the web in 1989
(yeah!)

Invented the semantic


web in 1994 (duh?)

Wednesday, May 11, 2011


Historical perspective
• From web 1.0: web of sites and pages,
aka the World Wide Web
• To web 2.0: web of people and of
participation, aka the Social Web (Blogs,
RSS, tags, Facebook, Wikipedia, etc.)
• To web 3.0: web of data, of meaning and
connected knowledge, aka the Semantic
Web

Wednesday, May 11, 2011


Semantics & Ontologies

Wednesday, May 11, 2011


Wednesday, May 11, 2011
Wednesday, May 11, 2011
Wednesday, May 11, 2011
Wednesday, May 11, 2011
Some examples
• FOAF: relationships between people (social
network)
• SIOC: relationships between websites,
articles, blogs, comments
• Rich Snippets: syndicate RDFa content for
SEO by Google,Yahoo
• good-relations: e-commerce (Ebay...)
• rNews: metadata for news agencies (AFP,
Reuters...)
Wednesday, May 11, 2011
How is it related to
the Web?

Wednesday, May 11, 2011


The traditional Web

• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML

Wednesday, May 11, 2011


“To a computer, then, the web is a flat,
boring world devoid of meaning”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/


Wednesday, May 11, 2011
“This is a pity, as in fact documents on the
web describe real objects and imaginary
concepts, and give particular relationships
between them”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/


Wednesday, May 11, 2011
“Adding semantics to the web involves two things:
allowing documents which have information in
machine-readable forms, and allowing links to be
created with relationship values.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/


Wednesday, May 11, 2011
“The Semantic Web is not a separate Web but an
extension of the current one, in which information
is given well-defined meaning, better enabling
computers and people to work in cooperation.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/


Wednesday, May 11, 2011
The traditional Web

• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML

Wednesday, May 11, 2011


The semantic Web

• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML RDF

Wednesday, May 11, 2011


The W3C “Layer Cake”

Wednesday, May 11, 2011


The W3C “Layer Cake”

Already
standardized

Wednesday, May 11, 2011


URIs and the
Web of Things

• URIs (Unique Resource Identifiers) are


used to identify things (also called
entities) in the real world
• For instance: people, places, events,
companies, products, movies, etc.

Wednesday, May 11, 2011


The RDF model

RDF is used to describe relationships


between objects, identified by their URIs

Predicate
Subject Object

Wednesday, May 11, 2011


Example

Source: http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-
web-30-linked-data-quelques-repres-pour-sy-retrouver
Wednesday, May 11, 2011
RDF serialization
As XML:

Others, ex: N3:

Wednesday, May 11, 2011


SPARQL
• Query language for RDF databases
• Several implementations
• OSS: Apache Jena, Sesame, 4Store,
Virtuoso, Mulgara, Redland, Open Anzo...
• Proprietary: 5Store, AllegroGraph
RDFStore, Stardog, Dydra, OWLIM...
• More expressive than SQL, scalability is still
an open question
Wednesday, May 11, 2011
SPARQL Sample

Wednesday, May 11, 2011


Where and how
to find these data?

Wednesday, May 11, 2011


Solution 1: “Lift”
• One can use HTML scrapping and natural
language processing (NLP) technique to
extract semantic information from existing
content / sites
• Generic solutions: OpenCalais, Zemanta,
Apache Stanbol
• Pro: no need to change existing content
• Con: error prone, needs human checks
Wednesday, May 11, 2011
Example: DBPedia

Wednesday, May 11, 2011


Solution 2: export
• RDFa and microformats are used to embed
semantic information (expressed using the
RDF model) into regular web pages
• RDFa does it using existing (rel) and
additional (about, property, typeof)
attributes
• Microformats only use usual HTML
attributes (class)

Wednesday, May 11, 2011


Solution 3: reuse

• Linked Open Data: (usually large) data


repositories available on the web (for free
or not), expressed using the RDF model
• Interoperability between these repositories
(their ontologies) must be defined

Wednesday, May 11, 2011


Linked Open Data in 2007

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Wednesday, May 11, 2011
2008

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Wednesday, May 11, 2011
2009

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Wednesday, May 11, 2011
2010

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Wednesday, May 11, 2011
Good for Enterprise apps too!

Diagram source: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/


Wednesday, May 11, 2011
Why now?

Wednesday, May 11, 2011


Key Enablers
Open Data and Linked Online Data
Advances in automatic content analysis
(linguistics, image processing) and machine
learning
Classical logic and classical AI
Computing power (Moore’s law +
MapReduce)

Wednesday, May 11, 2011


The technologies and data
are available,

Let’s put them to use!

Wednesday, May 11, 2011


2. Nuxeo &
Semantic ECM

Wednesday, May 11, 2011


Nuxeo: an open source
ECM vendor
Our Focus is Enterprise Content Management
ECM as a Platform for Content Applications
Open Source as Efficient Development Model
Modern architecture for 21st Century business
“Lean, mobile, social, interoperable”

A Social Marketplace in action


Innovation driven by community of customers, partners,
and our core developers

Wednesday, May 11, 2011


Nuxeo ECM - From Platform to Products

Construction Media Government Life Sciences

Business
Solutions
Correspondence Contracts Records
Invoice Processing
Management Management Management

Case Structured
Horizontal Document Digital Asset Content
Management Document
Packages Management Management
Framework Server
Aggregator

Nuxeo Enterprise Platform


Complete set of components covering all aspects of ECM
Platform
Content
Infrastructure
Nuxeo Core
Lightweight, scalable, embeddable content repository
45
Wednesday, May 11, 2011
Major Customers

Wednesday, May 11, 2011


Goals for Semantic ECM
• Repurpose existing content better

• Improve search and collaboration

• Make information more contextual

• Extract and use information from content

• Leverage Open and Linked Data, contribute

• Make ECM user’s content smarter!

• > Gain efficiency, effectiveness and strategic


positioning on the ECM market

47

Wednesday, May 11, 2011


Demo

48

Wednesday, May 11, 2011


IKS project
• European project under the
FP7, with 13 partners (6 SMEs) and a 8.5 MEUR
budget

• Goal: create a semantic software “stack” that will be


used by CMS vendors to add semantic features to
their products

• Started in Jan. 2009, will last until Dec. 2012

• First tangible result: Apache Stanbol, already


integrated in a Nuxeo plugin

49

Wednesday, May 11, 2011


The Semantic Engine

• From unstructured content to Knowledge

• Language guessing

• Topic classification (Business, Sports, Media, ...)

• Named Entities extraction and linking

• Relationships and properties extraction

50

Wednesday, May 11, 2011


51

Wednesday, May 11, 2011


52

Wednesday, May 11, 2011


53

Wednesday, May 11, 2011


RESTful
is
Beautiful

54

Wednesday, May 11, 2011


=
Semantic Engines
(Apache OpenNLP)
+
Fast Linked Data local index
(Apache Solr)
+
Semantic Rule Engine 55

(Apache Jena)
Wednesday, May 11, 2011
Apache Stanbol

Engine 1 DBpedia
Engine 2

2
1 Engine 3

Freebase

Nuxeo DM
3
addon
Geonames
LDAP
Local IT infrastructure (LAN) 56

Wednesday, May 11, 2011


3. Apache tools for
processing
Big and/or Linked Data

Wednesday, May 11, 2011


Training statistical models for NER with
Wikipedia and DBpedia
• Extract sentences with link positions in Wikipedia articles

• DBPedia to the find type of the target entity (Person,


Location, Organization)

• Apache Pig scripts to compute the join + format the result as


training files for OpenNLP

• Apache OpenNLP to build and evaluate the models

• Apache Hadoop for distributed processing

• Apache Whirr for deployment and management on Amazon


EC2 cluster

58

Wednesday, May 11, 2011


59

Wednesday, May 11, 2011


60

Wednesday, May 11, 2011


61

Wednesday, May 11, 2011


62

Wednesday, May 11, 2011


Training statistical models for topic
classification from Wikipedia and DBpedia
• Filter category tree from DBpedia SKOS entries (~500k)

• Pig scripts to compute the joins with articles abstracts for all
the articles categorized in Wikipedia

• Export as 2.8GB TSV file to be indexed in Apache Solr

• Use Solr MoreLikeThisHandler to find the top 5 most related


Wikipedia category for any kind of text

• Apache Whirr & Hadoop for deployment and management on


Amazon EC2 cluster

63

Wednesday, May 11, 2011


What’s next?

• Integrate the R&D results into Stanbol / Nuxeo

• Work on user interface / high level javascript toolkits for Linked


Data editing

• http://github.com/bergie/VIE based on backbone.js

• Experiment / Integrate / Refine

64

Wednesday, May 11, 2011


Resources
• http://iks-project.eu

• http://stanbol.demo.nuxeo.com

• http://incubator.apache.org/stanbol

• http://blogs.nuxeo.com/dev

• http://hadoop.apache.org/

• http://incubator.apache.org/opennlp/

• http://github.com/ogrisel/pignlproc
65

Wednesday, May 11, 2011

You might also like