You are on page 1of 61

Search Intelligence & MarkLogic Search API

MarkLogic World 2012 Will Thompson wthompson@jonesmcclure.com

Search API Resources


5-minute Guide to the Search API MarkLogic Search Developer's Guide developer.marklogic.com MarkMail.org MarkLogic Developer Listserv

Code

Github:
https://github.com/wthoolihan/MLUC-2012-Examples

Search Intelligence

Search Intelligence

Search Intelligence
Get the most out of our XML in search
Approach 1: GUI

Search Intelligence
Get the most out of our XML in search
Approach 1: GUI

Search Intelligence
Get the most out of our XML in search
Approach 2: Syntax

Search Intelligence
Get the most out of our XML in search
Approach 2: Syntax

Search Intelligence
Get the most out of our XML in search
Approach 3: Facets

Search Intelligence
Get the most out of our XML in search
Approach 3: Facets, constraints, filters

Search Intelligence
Get the most out of our XML in search
Infer (Search Intelligence)

Enrich Your Query!


Infer
Use knowledge about the user Look for meaning in search terms

Enrich
Translate into more complex query Gain speed, accuracy

Enrich Your Query!


Strategies
Custom term handling
Works well for single term transformations See: http://developer.marklogic.com/try/ninja/page13

Roll your own parser


A lot of work (see Michael Blakeleys xqysp)

Work between parse and search steps

Search API Overview


The Search API is an XQuery library module designed to simplify creating search applications:
o o o o

Parser Constraints Faceting Snippets

High performance, scalability Extensible

Search API Extensibility


Search API provides several points to hook in Hooks are defined in Search API options XML node
o o o o o

Custom constraints Custom grammar Custom snippets Custom term handling Search operators

Search API Basics


Search API module:
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";

Main entry point: search:search()


parses $qtext with given $options executes search returns <search:response> o set of <search:result>s o facets o snippets o metrics and other info

Search API Basics


Search API options:

Search API Extensibility


Snippet:

Constraint:

Search API Extensibility


Term handler:

Parser:
let $custom-parser-output := my:parse($qtext) search:resolve( $custom-parser-output, $options )

Search API Basics


Search API parser:
1st half of search:search() returns annotated cts:query XML

Execute search:
2nd half of search:search() accepts cts:query XML as input

search:parse() Strategy

1. Call search:parse() 2. Analyze and enrich the query XML 3. Call search:resolve()

Our Use Case


OConnors Online
Search portal built on MarkLogic Legal rules and commentaries content

Problem
Users will enter citation numbers, abbreviations, etc. expecting complete results Text editorial content follows different conventions Detect special cases pre-search and enrich query

Solution

Example: detect year


Content:
MarkLogic database of news/op-ed articles

Organized into year directories:


/content/1990 /content/1991 /content/1992 ... /content/2012

Year is in directory structure, not article text


But users will still include year in search terms

How to transform query?


Recursive typeswitch
(function mapping on):

do-stuff-here($q)

Example: detect year

Example: detect year


let $terms := "1996 United States Olympics" return local:detect-year(search:parse($terms))

Example: detect year


Strategy depends on your content model Other possibilities
date detection date ranges locations etc.

search:parse() Strategy
Weakness
Limited to single word token
Similar to custom term handling

What about multiple tokens?


Analyze querystring text directly using regex
Dangerous

Transform cts:query XML into intermediate form


Preserve Boolean logic & grouping Preserve phrases Preserve constraints

Building Intermediate Query


The hack
Basically, undoing some of the parser's work Text "run" concept Similar to WordprocessingML

Building Intermediate Query


Intermediate query strategy
1. 2. 3. 4. Flatten query Join sibling words in <run> Transform <run>s Convert <run>s back to word queries

Example: multi-word thesaurus


Content:
Same MarkLogic database of news/op-ed articles from detect-year() example

Query:
Same as before: "1996 United States Olypmics" Start with the search:parse()output

Example: multi-word thesaurus


Intermediate query strategy
1. 2. 3. 4. Flatten query Join sibling words in <run> Transform <run>s Convert <run>s back to word queries

Example: multi-word thesaurus


1. Flatten query
remove implicit and-queries from search:parse() output:

Example: multi-word thesaurus


1. Flatten query
XML should look more like cts:query string representation:
cts:and-query( (cts:word-query("1996", "lang=en", 1), cts:word-query("United", "lang=en", 1), cts:word-query("States", "lang=en", 1), cts:word-query("Olympics", "lang=en", 1)), ())

Example: multi-word thesaurus


1. Flatten query
Typeswitch on cts:and-query:
1. Check and-queries for parent and-query 2. Remove the nested ones, copy through anything else

Example: multi-word thesaurus


1. Flatten query
Typeswitch function output:

Example: multi-word thesaurus

Intermediate query strategy


1. 2. 3. 4. Flatten query Join sibling words in <run> Transform <run>s Convert <run>s back to word queries

Example: multi-word thesaurus


2. Join sibling words in <run>:
Typeswitch on cts:word-query: 1. 2. 3. Ignore phrases Delete if query is not the first. Take first word-query in sequence and join with its following siblings into a <run>

Example: multi-word thesaurus


2. Join sibling words in <run>:
Input:
search:parse("1996 United States Olympics")/local:unnestands(.)/local:create-runs(.)

Output:

Example: multi-word thesaurus


2. Join sibling words in <run>:
Input:
search:parse("1996 (sprint OR marathon) United States Olympics")/local:unnest-ands(.)/local:create-runs(.)

Output:

Example: multi-word thesaurus

Intermediate query strategy


1. 2. 3. 4. Flatten query Join sibling words in <run> Transform <run>s Convert <run>s back to word queries

Example: multi-word thesaurus


3. Transform <run>s:
1. Store terms in thesaurus 2. Build cts:or-query of thesaurus terms 3. Using cts:or-query of terms, cts:highlight() <run>s, and replace with thesaurus synonyms

Example: multi-word thesaurus


3. Transform <run>s:
1. store terms in thesaurus

Example: multi-word thesaurus


3. Transform <run>s:
2. build cts:or-query of thesaurus terms:

Example: multi-word thesaurus


3. Transform <run>s:
3. replace matches with synonyms:
cts:highlight() - powerful cts:query-based find/replace

Example: multi-word thesaurus


3. Transform <run>s:
3. replace matches with synonyms:

Example: multi-word thesaurus


3. Transform <run>s:
Input:
let $q-thsr := cts:or-query( doc("thesaurus.xml") //thsr:entry/thsr:term/cts:word-query(string(.))) ) let $q-runs := search:parse("1996 United States Olympics") /local:unnest-ands(.)/local:create-runs(.)

return local:thsr-expand($runs, $q-thsr)

Example: multi-word thesaurus


3. Transform <run>s:
Output:

Example: multi-word thesaurus

Intermediate query strategy


1. 2. 3. 4. Flatten query Join sibling words in <run> Transform <run>s Convert <run>s back to word queries

Example: multi-word thesaurus


4. Convert <run>s back to word queries
Typeswitch:

Example: multi-word thesaurus


4. Convert <run>s back to word queries
Input:
let $q-thsr := cts:or-query( doc("thesaurus.xml") //thsr:entry/thsr:term/cts:word-query(string(.))) ) let $runs := search:parse("1996 United States Olympics") /local:unnest-ands(.)/local:create-runs(.) let $expanded := local:thsr-expand($runs, $q-thsr) return local:resolve-runs($expanded)

Example: multi-word thesaurus


4. Convert <run>s back to word queries
Output:

Combining Examples
local:thsr-expand-runs($runs, $q-thsr) /local:resolve-runs($expanded)/local:detect-year($runs)

Enrich Your Query!

Takeaway
1. No added GUI 2. Didn't ask the user for additional input 3. Able to build more robust query before executing search

Search API Hacking


Many potential applications:
Ad-hoc weighting:
local:q-add-weights( search:parse("bananas"), (<element ns="$ns" name="p" weight="1"/>, <element ns="$ns" name="b" weight="2"/>, <element ns="$ns" name="title" weight="3.5"/>) )

Search API Hacking


Many potential applications:
Automatic spell correction:

Search API Hacking


Many potential applications:

Detect entities
Transform text into element-based query Less false positives and exclusions Leverage indexes:
"New York Times"

Search API Hacking


Other ideas
Regex unparsed query string
apply constraints, operators, etc as configured in Search API based on key words/patterns

Custom term handler


single-term transformations

Combine with data enrichment on ingestion


MarkLogic Entity Framework Linguistic processing

Hazards
Chaos
Daisy chained transformations can have unintended consequences
Performance
Pre-search transformations need to be fast make sure to leverage indexes as much as possible Larger queries do take longer

Questions

You might also like