You are on page 1of 414

Endeca® Navigation Platform

Advanced Features Guide


Copyright and Disclaimer
Product specifications are subject to change without notice and do not
represent a commitment on the part of Endeca Technologies, Inc. The
software described in this document is furnished under a license
agreement. The software may not be reverse assembled and may be
used or copied only in accordance with the terms of the license
agreement. It is against the law to copy the software on any medium
except as specifically allowed in the license agreement.

No part of this document may be reproduced or transmitted in any form


or by any means, electronic or mechanical, including photocopying and
recording, for any purpose without the express written permission of
Endeca Technologies, Inc.

Copyright © 2003-2005 Endeca Technologies, Inc. All rights reserved.


Printed in USA.
®
Corda PopChart and Corda Builder™ Copyright 1996-2005 Corda
Technologies, Inc.

®
Outside In SearchML © 1992-2005 Stellent Chicago, Inc. All rights
reserved.

®
Rosette Globalization Platform Portions Copyright © Basis
Technology Corp. 2003-2005. All rights reserved.

Teragram Language Identification Software Portions Copyright ©


1997-2005 Teragram Corporation. All rights reserved.

Trademarks
Don't Stop At Search, Endeca, Endeca InFront, Endeca Navigation
Engine, Guided Navigation, and ProFind are registered trademarks, and
Endeca Data Foundry and Endeca Latitude are trademarks of Endeca
Technologies, Inc.

Basis Technology and Rosette are trademarks of Basis Technology Corp.

All other trademarks or registered trademarks contained herein are the


property of their respective owners.

Endeca Advanced Features Guide • August 2005


Contents

Preface
About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Who Should Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Symbols and Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Endeca Documentation Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Contacting Endeca Standard Customer Support . . . . . . . . . . . . . . . . xx

SECTION I DATA IMPORT FEATURES


Chapter 1 Content Acquisition System
Sections of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CAS and Security Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Components that Support CAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CAS Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Full Crawls versus Differential Crawls . . . . . . . . . . . . . . . . . . . . . . . . 27
URL and Record Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Redundant URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Source Documents and Endeca Records. . . . . . . . . . . . . . . . . . . . . . . 31
Property Name Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Viewing all Properties Generated by CAS . . . . . . . . . . . . . . . . . . . 38
Creating a Full Crawling Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Creating a Record Adapter to Read Documents . . . . . . . . . . . . . . 41
Creating a Record Manipulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Adding a RETRIEVE_URL Expression . . . . . . . . . . . . . . . . . . . . . . . 45
Converting Documents to Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv

Identifying the Language of the Documents . . . . . . . . . . . . . . . . . . 50


Removing Document Body Properties . . . . . . . . . . . . . . . . . . . . . . 52
Modifying Records with a Perl Manipulator . . . . . . . . . . . . . . . . . . 54
Creating a Spider. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Specifying Root URLS to Crawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Configuring URL Extraction Settings. . . . . . . . . . . . . . . . . . . . . . . . 60
Example Syntax of URL Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Specifying a Record Source for the Spider . . . . . . . . . . . . . . . . . . . 65
Specifying Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Specifying Proxy Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Removing any Unnecessary Records after a Crawl . . . . . . . . . . . . 68
Handling Crawling Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Properties Generated by CAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Formats Supported by ProFind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 2 Web Crawling with Authentication


Configuring Basic Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
KEY_RING Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
SITE Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
HOST Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
PORT Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
HTTP Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
REALM Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
KEY Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Configuring HTTPS Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Boot-Strapping Server Authentication . . . . . . . . . . . . . . . . . . . . . . 95
CA_DB Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Disabling Server Authentication for a Host. . . . . . . . . . . . . . . . . . . 96
HTTPS Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
AUTHENTICATE_HOST Attribute . . . . . . . . . . . . . . . . . . . . . . . . 96
Configuring Client Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . 97
CERT Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
PATH Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
v

PRIV_KEY_PATH Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Authenticating with a Microsoft Exchange Server . . . . . . . . . . . . 98
EXCHANGE_SERVER Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Authenticating with a Proxy Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
PROXY Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Using Forge to Encrypt Keys and Pass Phrases . . . . . . . . . . . . . . . . 100
Encrypting a Username/Password Pair . . . . . . . . . . . . . . . . . . . . 101
Encrypting a Pass Phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

SECTION II RECORD FEATURES


Chapter 3 Creating Aggregated Records
Aggregated Record Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Enabling Record Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Generating and Displaying Aggregated Records . . . . . . . . . . . . . . . 109
Determining the Available Rollup Keys . . . . . . . . . . . . . . . . . . . . 109
Creating Aggregated Record Navigation Queries . . . . . . . . . . . . 112
Specifying the Rollup Key for the Navigation Query . . . . . . . 112
Setting the Maximum Number of Returned Records . . . . . . 113
Creating Aggregated Record Queries. . . . . . . . . . . . . . . . . . . . . . 114
Displaying Aggregated Records . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Retrieving an Aggregated Record from
a ENEQueryResults Object . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Retrieving an Aggregated Record List from
a Navigation Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Displaying Aggregated Record Attributes . . . . . . . . . . . . . . . 117
Displaying the Records in the Aggregated Record . . . . . . . . 119

Chapter 4 Using Derived Properties


Specifying Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Displaying Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Troubleshooting Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . 128
Derived Property Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
vi

Chapter 5 Selecting a Record Set Based on a Key


About the Select Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Configuring the Select Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Using URL Query Parameters for Select . . . . . . . . . . . . . . . . . . . . . . 133
Selecting Keys in the Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Using the Java Selection Method. . . . . . . . . . . . . . . . . . . . . . . . . . 133
Using the .NET Selection Property . . . . . . . . . . . . . . . . . . . . . . . . 135
Using the COM/Perl Selection Methods . . . . . . . . . . . . . . . . . . . . 135

Chapter 6 Bulk Export of Records


Configuring the Bulk Export Feature . . . . . . . . . . . . . . . . . . . . . . . . . 137
Using URL Query Parameters for Bulk Export . . . . . . . . . . . . . . . . . 138
Retrieving Bulk Records in the Application . . . . . . . . . . . . . . . . . . . . 138
Setting the Number of Bulk Records. . . . . . . . . . . . . . . . . . . . 138
Retrieving the Bulk-format Records . . . . . . . . . . . . . . . . . . . . . . . 140
Using Java Bulk Export Methods . . . . . . . . . . . . . . . . . . . . . . . 140
Using COM/Perl Bulk Export Methods. . . . . . . . . . . . . . . . . . . 142
Using .NET Bulk Export Methods . . . . . . . . . . . . . . . . . . . . . . . 143
Performance Impact for Bulk Export Records. . . . . . . . . . . . . . . . . . 144

Chapter 7 Record Filters


Record Filter Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
ENE Query Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
XML Syntax for File-based Record Filter Expressions . . . . . . . . 149
Enabling Properties for Use in Record Filters . . . . . . . . . . . . . . . . . . 151
Data Configuration for File-based Filter Expressions. . . . . . . . . . . . 151
Record Filter Result Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
ENE URL Query Parameters for Record Filters . . . . . . . . . . . . . . . . 153
Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Record Filter Performance Implications . . . . . . . . . . . . . . . . . . . . . . 154
Memory Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Record Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
vii

Interaction with Spelling Auto-correction and Spelling Did You


Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Memory Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

SECTION III DIMENSION FEATURES


Chapter 8 Using Inert Dimension Values
Configuring Inert Dimension Values . . . . . . . . . . . . . . . . . . . . . . . . . 162
Using Inert Dimension Values in the Application . . . . . . . . . . . . . . . 163
Sample Java Code for Inert Dimension Values . . . . . . . . . . . 164
Sample .NET Code for Inert Dimension Values . . . . . . . . . . . 165
Sample COM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Chapter 9 Working with Externally Created Dimensions


XML Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
XML Syntax to Specify Dimension Hierarchy . . . . . . . . . . . . . . . . 171
Example of Using Nested node Elements . . . . . . . . . . . . . . . . . . 172
Example of Using Parent Attributes . . . . . . . . . . . . . . . . . . . . . . . 173
Example of Using Child Elements . . . . . . . . . . . . . . . . . . . . . . . . 173
Node ID Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Importing an Externally Created Dimension . . . . . . . . . . . . . . . . 174

Chapter 10 Working with an Externally Managed Taxonomy


XSLT and XML Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
XSLT Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
XML Syntax to Specify Dimension Hierarchy . . . . . . . . . . . . . . . . 181
Example of Using Nested node Elements . . . . . . . . . . . . . . . . . . 183
Example of Using parent Attributes . . . . . . . . . . . . . . . . . . . . . . . 183
Example of Using child Elements . . . . . . . . . . . . . . . . . . . . . . . . . 183
Node ID Requirements and Identifier Management in Forge . . 184
Pipeline Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Integrating an Externally Managed Taxonomy . . . . . . . . . . . . . . 185
viii

Transforming an Externally Managed Taxonomy. . . . . . . . . . . . . 187


Loading an Externally Managed Dimension . . . . . . . . . . . . . . . . . 188
Running a Second Baseline Update. . . . . . . . . . . . . . . . . . . . . . . . 189
Updating an Externally Managed Taxonomy in Your Pipeline . . . 190

Chapter 11 Classifying Documents with Stratify


Sections of This Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Frequently Used Terms and Concepts . . . . . . . . . . . . . . . . . . . . . 193
How Endeca and Stratify Classify Unstructured Documents . . . 196
Overview of the Integration Process . . . . . . . . . . . . . . . . . . . . . . . 199
Required Stratify Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Developing a Stratify Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Building a Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Exporting a Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Creating a Pipeline to Incorporate Stratify . . . . . . . . . . . . . . . . . . 204
Creating a CAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Classifying Documents with Stratify Classification Server . . . . . 207
Adding a Property Mapper and Indexer Adapter . . . . . . . . . . . . . 212
Integrating a Stratify Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Running the First Baseline Update . . . . . . . . . . . . . . . . . . . . . . . . 215
Loading a Dimension and its Dimension Values . . . . . . . . . . . . . 216
About Synonym Values and Dimension Values. . . . . . . . . . . . . . . 219
Mapping a Dimension Based on a Stratify Taxonomy . . . . . . . . . 222
Running the Second Baseline Update . . . . . . . . . . . . . . . . . . . . . . 222
Updating a Taxonomy in Your Pipeline . . . . . . . . . . . . . . . . . . . . . 223

SECTION IV LOGGING AND PERFORMANCE FEATURES


Chapter 12 Forge Hierarchical Logging System
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Log Levels and Message Categories . . . . . . . . . . . . . . . . . . . . . . . 228
Log Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Message Category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
ix

Log Appenders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232


Format of the Appenders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Configuring MustMatch Messages . . . . . . . . . . . . . . . . . . . . . . . . 236
Configuring the Dimension Server Match Count Log . . . . . . . . . 237
Reference log.ini File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
A Simple Reference log.ini file . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
The Forge Logging Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

Chapter 13 Using Multithreaded Mode


Understanding Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . . . 245
Costs of Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Configuration for Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . 247
Multithreaded Navigation Engine Performance . . . . . . . . . . . . . . . . 247
Application Query Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 248
Thread Pool Size and OS Platform . . . . . . . . . . . . . . . . . . . . . . . . 249
Hyperthreaded Intel Processors . . . . . . . . . . . . . . . . . . . . . . . 249
Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

Chapter 14 Coremetrics Integration


Using the Integration Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

SECTION V OTHER ADVANCED FEATURES


Chapter 15 Implementing Merchandising and Content Spotlighting
Introduction to Dynamic Business Rules and Promoting Records . .258
Comparing Dynamic Business Rules to Content Management
Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Dynamic Business Rule Constructs . . . . . . . . . . . . . . . . . . . . . . . 260
Query Results and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Two Examples of Promoting Records . . . . . . . . . . . . . . . . . . . . . 262
An Example with One Rule Promoting Records. . . . . . . . . . . 263
x

An Expanded Example with Three Rules. . . . . . . . . . . . . . . . . 265


Suggested Workflow Using Endeca Tools to Promote Records . . . . 268
Incremental Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Building the Supporting Constructs for a Rule . . . . . . . . . . . . . . . . . 270
Ensuring Promoted Records are Always Produced . . . . . . . . . . . 270
Creating a Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Controlling the Number of Promoted Records. . . . . . . . . . . . 272
Performance and the Maximum Records Setting . . . . . . . . . 273
Ensuring Consistent Property Usage with
Property Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Indicating How to Display Promoted Records. . . . . . . . . . . . . 275
Creating Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Creating a Rule and Ordering Its Results . . . . . . . . . . . . . . . . . . . 278
Specifying When to Promote Records . . . . . . . . . . . . . . . . . . . . . . 279
Specifying a Time Trigger to Promote Records . . . . . . . . . . . . . . 283
Synchronizing Time Zone Settings. . . . . . . . . . . . . . . . . . . . . . 284
Specifying Which Records to Promote . . . . . . . . . . . . . . . . . . . . . 284
Adding Custom Properties to a Rule . . . . . . . . . . . . . . . . . . . . . . . 286
Adding Static Records in Rule Results . . . . . . . . . . . . . . . . . . . . . 287
Order of Featured Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
No Uniqueness Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
No Maximum Record Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Sorting Rules in the Rules View . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Prioritizing Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Presenting Rule Results in a Web Application. . . . . . . . . . . . . . . . . . 291
Required Navigation Engine URL Query Parameters . . . . . . . . . 292
Adding Web Application Code to Extract Rule Results . . . . . . . . 293
Sample Java Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Sample ASP .NET Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Sample COM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Adding Web Application Code to Render Rule Results . . . . . . . . 301
Grouping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Deleting a Rule Group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
xi

Prioritizing Rule Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304


Interaction Between Rules and Rule Groups. . . . . . . . . . . . . . . . 305
Performance Considerations for Dynamic Business Rules. . . . . . . 305
Rules without Explicit Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Using an Agraph and Dynamic Business Rules . . . . . . . . . . . . . . . . 306
Applying Relevance Ranking to Rule Results . . . . . . . . . . . . . . . . . . 308
About Overloading Supplement Objects . . . . . . . . . . . . . . . . . . . . . . 309

Chapter 16 Implementing User Profiles


Profile-Based Trigger Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Developer Studio Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
User Profile Query Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Objects and Method Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Java Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
.NET C# Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
COM Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Performance Impact of User Profiles . . . . . . . . . . . . . . . . . . . . . . . . 318

Chapter 17 Implementing Partial Updates


About Partial Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Partial Update Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Partial Updates Reference Implementation . . . . . . . . . . . . . . . . 322
Baseline Pipeline Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Creating a Partial Update Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Creating the Record Adapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Creating the Record Manipulator . . . . . . . . . . . . . . . . . . . . . . . . . 327
IF Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
UPDATE_RECORD Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Examples of UPDATE_RECORD Expressions. . . . . . . . . . . . . 333
UPDATE_RECORD Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Format of Update Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Format of Records to Be Deleted . . . . . . . . . . . . . . . . . . . . . . 335
Format of Records to Be Updated. . . . . . . . . . . . . . . . . . . . . . 335
xii

Format of Records to Be Added . . . . . . . . . . . . . . . . . . . . . . . . 335


Format of Records in Your Implementation . . . . . . . . . . . . . . 335
Creating the Update Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Dimension Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Dimension Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Dimension Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Naming Format of Update Source Data Files. . . . . . . . . . . . . . . . 339
Index Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Record Specification Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Navigation Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Dgidx Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Control Script Development and Execution . . . . . . . . . . . . . . . . . . . . 343
Directory Structure for Updates . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Running the Baseline Updates Script . . . . . . . . . . . . . . . . . . . . . . 346
Step 1: Delete Old Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Step 2: Run Forge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Step 3: Run Dgidx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Step 4: Stop the Navigation Engine . . . . . . . . . . . . . . . . . . . . . 347
Step 5: Move the Index Files to the Dgraph Directory . . . . . . 348
Step 6: Start the Navigation Engine . . . . . . . . . . . . . . . . . . . . . 348
Running the Partial Updates Script . . . . . . . . . . . . . . . . . . . . . . . . 349
Step 1: Run Forge on the New Source Data . . . . . . . . . . . . . . 349
Step 2: Apply a Timestamp to the Record File . . . . . . . . . . . . 350
Step 3: Update the Navigation Engine . . . . . . . . . . . . . . . . . . . 351
Adding Other Bricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
URL Update Command Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Partial Updates in Agraph Implementations . . . . . . . . . . . . . . . . . . . 355
Choosing a Distribution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 355
How the Agraph Partitions Handle Updates . . . . . . . . . . . . . . 357
Use of Record Spec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Naming Convention for Source Data Files . . . . . . . . . . . . . . . . . . 358
Random Distribution Format . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Deterministic Distribution Format . . . . . . . . . . . . . . . . . . . . . . 359
xiii

Configuring the Partial Updates Pipeline. . . . . . . . . . . . . . . . . . . 360


Configuring the Record Adapter . . . . . . . . . . . . . . . . . . . . . . . 360
Configuring the Record Manipulator . . . . . . . . . . . . . . . . . . . 361
Configuring the Update Adapter . . . . . . . . . . . . . . . . . . . . . . . 365
Control Script for Agraph Updates . . . . . . . . . . . . . . . . . . . . . . . . 367
Forge Partial Updates Brick . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Distributing the Forge Output to the Dgraphs . . . . . . . . . . . . 368

Chapter 18 Using the Agraph


What You Should Know First. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Overview of Distributed Query Processing . . . . . . . . . . . . . . . . . . . . 371
Agraph Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Data Foundry Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Guidance about When to Use an Agraph . . . . . . . . . . . . . . . . . . . 376
Implementation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Modifying the Project for Agraph Partitions . . . . . . . . . . . . . . . . . . . 376
Provisioning an Agraph Implementation . . . . . . . . . . . . . . . . . . . . . . 378
Running an Agraph Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 383
Agraph Presentation API Development . . . . . . . . . . . . . . . . . . . . . . . 384
Agraph Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Agraph Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Control Script Environment Considerations . . . . . . . . . . . . . . . . . . . 386
Arranging Partitions and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Agraph and Dynamic Business Rules. . . . . . . . . . . . . . . . . . . . . . 387

Chapter 19 Using Internationalized Data


Installing the Supplemental Language Pack . . . . . . . . . . . . . . . . . . 390
Specifying the License Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Configuring Forge Components for Languages . . . . . . . . . . . . . . . . 391
Setting the Encoding for the Incoming Source Data . . . . . . . . . . 391
Specifying the Language for Documents . . . . . . . . . . . . . . . . . . . 393
Forge Language Support Table . . . . . . . . . . . . . . . . . . . . . . . . 394
Performance Considerations for Language Identification . . . . . 397
xiv

Configuring Languages for the Navigation Engine . . . . . . . . . . . . . . 398


Using Language Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Specifying a Global Language ID . . . . . . . . . . . . . . . . . . . . . . . 400
Specifying a Per-Record Language ID . . . . . . . . . . . . . . . . . . . 401
Specifying a Per-Dimension/Property Language ID. . . . . . . . 401
Specifying a Per-Query Language ID . . . . . . . . . . . . . . . . . . . . 402
Configuring Language-Specific Spelling Correction . . . . . . . . . . 403
Using Encoding in the Web Application . . . . . . . . . . . . . . . . . . . . . . . 405
Setting the Encoding for URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Setting the Page Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
Viewing Navigation Platform Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

Index
Preface

The Endeca® Navigation Platform is the foundation for


building applications based on Endeca Navigation Engine®
technology. With the Endeca Navigation Platform, you can
build solutions that allow your users to quickly, precisely,
and easily search and navigate through large data sets,
avoiding all the traditional problems associated with
information overload and finding information online.
Endeca applications generate precise, relevant results with
sub-second response times, even across very large data sets.

The Endeca Navigation Platform allows you to build Guided


Navigation® functionality into your Web applications. The
Endeca Guided Navigation solution puts the results of all
search, navigation, and analytic queries in an organized
context that shows users precisely how to refine and
explore further. This helps solve the problems associated
with information overload by guiding users as they quickly,
precisely, and easily navigate through large data sets. The
Endeca Navigation Platform is based on technology that
makes it possible to scale to very large data sources and
user loads while running on low-cost hardware.
xvi

About This Guide


The Endeca Advanced Features Guide provides
procedures for implementing advanced Endeca features
such as the Content Acquisition System and partial
updates.

Who Should Use This Guide


This guide is intended for developers while they are
implementing an Endeca system.

Symbols and Conventions


IMPORTANT: Text marked as important requires special attention.

Note: Notes provide related information, recommendations,


and suggestions.

The Endeca documentation set uses the following


symbols and conventions:
1. Numbered lists, when the order of the items is
important.
a. Alphabetical lists, when the order of secondary
items is important.
• Bulleted lists, when the order of the items is
unimportant.

Advanced Features Guide Endeca Confidential


xvii

Italic text represents variables you should substitute a


value for, such as:
C:\RootDirectory\MyDirectory\MyFile

Italic text may also indicate new terms that appear in the
Endeca Glossary.

Courier text indicates code snippets or commands that


you should enter exactly as they are written in the
documentation.

Endeca Documentation Set


Note: In addition to the documentation deliverables listed
below, you can find useful information, including the Endeca
Performance Tuning Guide, in the knowledge base on the
Endeca Customer Support site at https://customers.endeca.com.

The Endeca documentation set consists of the following:


• Endeca Installation Guide for UNIX and Endeca
Installation Guide for Windows describe how to install
Endeca software.
• Endeca Migration Guide provides information on
migrating from previous versions of Endeca software.
• Endeca Concepts Guide introduces the critical
concepts you should understand before learning how
to build an Endeca application. The information in this
guide is the foundation upon which all the other
Endeca documentation depends.

Endeca Confidential Preface


xviii

• Endeca Developer's Guide for Java, Endeca Developer's


Guide for COM, and Endeca Developer's Guide for
.NET provide an overview of the Endeca development
process as well as procedures and code snippets for
all non-advanced Endeca development tasks.
• Endeca Advanced Features Guide provides
procedures for implementing advanced Endeca
features such as the Content Acquisition System and
partial updates.
• Endeca Administrator's Guide for UNIX and Endeca
Administrator's Guide for Windows provide
information on using Endeca's administrative and
logging tools to configure and manage your Endeca
implementation, and create logging reports.
• Endeca Tools Guide provides information on
configuring and administering Endeca tools, including
the Endeca Manager, Endeca Developer Studio, and
Endeca Web Studio.
• Endeca Developer Studio Help provides online
information for developing data pipelines using the
Endeca Developer Studio.
• Endeca Web Studio Help provides online information
for the administrative tasks, as well as search and
merchandising configuration, that you can do using
Endeca Web Studio.
• Endeca Javadocs provide online access to class and
method descriptions for the Java version of the
Presentation and Logging APIs.

Advanced Features Guide Endeca Confidential


xix

• Endeca API Guide for COM, Endeca API Guide for


Perl, and Endeca API Guide for .NET provide class and
method descriptions for the COM, Perl, and .NET
versions of the Presentation and Logging APIs. The
Endeca API Guide for .NET is in an online format.
• Endeca Security Guide for Java and Endeca Security
Guide for .NET and COM describe how to implement
user authentication and how to structure your data to
limit access to only those users with the correct
permissions. The Java version of this guide also
provides information on using SSL certificates and
encryption to secure your Endeca application.
• Endeca Performance Tuning Guide provides
guidelines on monitoring and tuning the performance
of the Endeca Navigation Engine. It also contains tips
on resolving associated operational issues.
• Endeca Content Adapter Developer's Guide describes
the Content Adapter Development Kit (CADK), a
framework that provides developers with a flexible
and simple mechanism to extract data from a data
source and load it into Forge. The CADK is only
available from Endeca customer support.
• Endeca Data Indexing API Guide provides class and
method descriptions of the Data Indexing API and
describes how to use the API to move source data to
the Forge directory and run updates.
• Endeca Forge API Guide for Perl provides online
information for the class and method descriptions of
the Perl Manipulator component. You can use a Perl

Endeca Confidential Preface


xx

manipulator within a data pipeline to perform record


manipulation.
• Endeca XML Reference provides detailed, online
reference information for the XML files used in a Data
Foundry pipeline.
• Endeca Glossary defines terms used in the Endeca
Navigation Platform documentation set.
• Release Announcement describes the major new
features changes for the release.
• Release Notes detail the changes specific to the release,
including bug fixes and new features.
• Endeca Third-Party Software Usage and Licenses
provides copyright, license agreement, and/or
disclaimer of warranty information for the third-party
software packages that Endeca incorporates.

Contacting Endeca Standard Customer Support


You can contact Endeca Standard Customer Support
through the online Endeca Support Center
(<https://customers.endeca.com>).

The Endeca Support Center provides registered users


with important information regarding Endeca software,
implementation questions, product and solution help,
training and professional services consultation as well as
overall news and updates from Endeca.

Advanced Features Guide Endeca Confidential


SECTION I
Data Import Features
22

Advanced Features Guide Endeca Confidential


Chapter 1
Content Acquisition System

The Content Acquisition System (CAS) provides the


capability to crawl file systems, HTTP hosts, and HTTPS
hosts to fetch documents in a variety of formats. You use a
record adapter to read in the documents to a CAS pipeline.
Once read into the CAS pipeline, Forge processes the
documents and converts them into Endeca records. These
records can contain property values, dimension values, and
meta-data based on each document’s content. You can then
build an Endeca application to access the records and allow
your application users to search and navigate the document
contents contained in the records.

Note: You can use either Endeca InFront® or ProFind® to build a


CAS pipeline; however, InFront can process only .html and .txt
documents. ProFind can process over 200 document types. See
“Formats Supported by ProFind” on page 81 for the full list of
document types that ProFind supports.

Sections of This Chapter


This chapter is divided into the following sections:
• The introductory sections, which provide an overview of
CAS and its components.
24

• “URL and Record Processing”, which describes the


flow and processing of URLs and records in a CAS
pipeline.
• “Source Documents and Endeca Records”, which
describes the relationship of source documents that
CAS crawls with the Endeca records that CAS
produces.
• “Creating a Full Crawling Pipeline”, which provides
the procedure to create a CAS pipeline.
• “Properties Generated by CAS”, which describes each
property generated by the components in a CAS
pipeline.
• “Formats Supported by ProFind”, which lists the
source document types that Profind can process.

CAS and Security Information


The CAS also supports accessing hosts that require basic
client authentication and HTTPS authentication. For
details on setting up the CAS to crawl secure resources,
see the chapter “Web Crawling with Authentication” on
page 89.

In addition, the CAS can be used to support the Access


Control System. The CAS generates access control list
(ACL) properties for each record. These properties can be
used in conjunction with security login modules to limit
access to records based on user login profiles. (Login
modules are a part of the Access Control System.) For

Advanced Features Guide Endeca Confidential


Chapter 1
25

details on gathering security information, see the Endeca


Security Guide.

Components that Support CAS


Developer Studio exposes CAS functionality using the
following components. These components are the core of
a CAS pipeline.
• A spider—Crawls documents starting at the root URLs
you specify. In the spider component, you indicate the
root URLs from which to begin a crawl, URL filters to
determine which documents to crawl, as well as other
configuration information that specifies how the crawl
should proceed. This information may include timeout
values for a crawl, proxy server values, and so on. The
spider crawls the URLs and manages a URL queue that
feeds the record adapter.
• A record adapter configured to read
documents—Receives URLs from the spider and
creates an Endeca record for each document located at
a URL. Each record contains a number of properties,
one of which is the record’s identifying URL. A
downstream record manipulator uses the record
identifier to retrieve the document and extract its data.
Unlike basic pipelines, which use a record adapter to
input source data from a variety of formats, a CAS
pipeline uses a record adapter to input URLs provided
by the spider. In a basic pipeline, the format type of a
record adapter matches the source data, for example,
delimited, XML, fixed-width, ODBC, and so on. In a

Endeca Confidential Content Acquisition System


26

CAS pipeline, the format type of a record adapter must


be set to Document. See “URL and Record Processing”
on page 28 for further explanation of the flow of
records and URLs among CAS pipeline components.
• A record manipulator incorporating CAS
expressions—Contains several Data Foundry
expressions that support crawling and document
processing tasks. At a minimum, a record manipulator
contains one CAS expression to retrieve a URL based
on the record's identifier and a second expression to
extract and convert the document’s content to text. In
addition, you can include optional expressions to
identify the language of a document, remove
temporary properties after processing is complete, or
perform a variety of other processing tasks.

In addition to the CAS components listed above, a CAS


pipeline can use other components common to basic
pipelines, for example, a dimension adapter, dimension
server, property mapper, indexer adapter, and so on. For
the sake of simplicity, this feature document does not
emphasize the common components but rather focuses
on creating, configuring, and explaining the components
specific to CAS.

CAS Reference Implementation


The Endeca Navigation Platform includes a sample CAS
reference implementation that you can examine and run.
This project is stored in ENDECA_ROOT\reference
\sample_CAS_data. The CAS reference implementation

Advanced Features Guide Endeca Confidential


Chapter 1
27

crawls http://endeca.com and produces Endeca records


that can be searched with user-provided search terms
and navigated according to the meta-data properties for
the records. In other words, the navigation controls are
based on meta-data properties, such as date modified,
encoding, MIME type, file size, fetch status, and so on.

Full Crawls versus Differential Crawls


There are two types of CAS crawls a spider can perform:
• Full crawl—A crawl in which the spider retrieves all
the documents that it is configured to access. A full
crawl in a CAS pipeline is analogous to a full update in
a basic data pipeline. This feature document describes
a full crawl.
• Differential crawl—A re-crawl in which the spider
retrieves only the documents that have changed since
the last crawl. The differential crawl URL specified on
the General tab of the Spider editor indicates a file in
which Forge stores URLs and metadata about URLs. By
reading this file at the beginning of a crawl, the spider
can detect source documents that have been modified
(updated, added, and so on) since the last crawl. For
more information on creating a differential crawl,
contact your Endeca technical team.

Endeca Confidential Content Acquisition System


28

URL and Record Processing


As mentioned in the introduction, Developer Studio
exposes crawling and text extraction functionality in the
context of a pipeline. It is important to understand how
this functionality fits into the Forge processing
framework.

The following figure shows a diagram of a full crawling


pipeline. There are two kinds of flow in the pipeline:
• URLs flow from the spider to the record adapter (a
record adapter that uses the Document format)
• Documents flow into the indexer adapter and get
turned into Endeca records.

When Forge executes this pipeline using Developer


Studio, the flow of URLs and records is as follows:
1. The terminating component (indexer adapter) requests
the next record from its record source (property
mapper).

Advanced Features Guide Endeca Confidential


Chapter 1
29

2. At this point, the property mapper has no record, so


the property mapper asks its record source (spider) for
the next record.
3. The spider has no record, so the spider asks its record
source (record manipulator) for the next record.
4. The record manipulator also has no record, so it
passes the request for the next record upstream to the
record adapter (with format type Document).
5. The record adapter then asks the spider for the next
URL it is to retrieve (the first iteration through, this is
the root URL configured on the Root URL tab of the
Spider editor).
6. Based on the URL that the spider provides, the record
adapter creates a record containing the URL and a
limited set of metadata.
7. The created record then flows down to the record
manipulator where the following takes place:
• The document associated with the URL is fetched
(using the RETRIEVE_URL expression).
• Content (searchable text) is extracted from the
document (using the CONVERTTOTEXT or PARSE_DOC
expression).
• Any URLs in the text are also extracted for additional
crawling.
8. The record then moves to the spider where additional
URLs (those extracted in the record manipulator) are
queued for crawling.

Endeca Confidential Content Acquisition System


30

9. The property mapper performs property to dimension


and source property to Endeca property mapping.
10.The indexer adapter receives the record and writes it
out to disk.

The process repeats until there are no URLs in the URL


queue maintained by the spider.

Redundant URLs

To minimize the possibility of creating redundant records


in an application, the spider avoids enqueuing a URL that
is already queued or that has already been retrieved. The
spider does this by comparing URLs to determine if they
are equivalent.

Even in cases where flexible URL formatting allows two


URLs to be different strings and yet point to the same
resource, the spider can determine that both URLs are
equivalent, and does not queue both URLs for processing.
For example, http://www.endeca.com:80/about/
../index.shtml and http://www.endeca.com/index.shtml
point to the same resource. The spider recognizes the
equivalence and does not queue both URLs.

However, there is no foolproof way to determine if two


non-equivalent URLs point to the same resource. If the
following configurations are present in a crawling
environment, it is possible to cause the spider to queue
URLs that point to the same resource:

Advanced Features Guide Endeca Confidential


Chapter 1
31

• Symbolic links: For example, if file:///data.txt is a


symbolic link to file:///machineA/
directory/data.txt the spider queues both URLs and
processes data.txt twice.
• Virtual hosts: For example, if a spider was set up to
crawl both http://web.mit.edu/index.html and
http://www.mit.edu/index.html, it would crawl the
same resource twice because one URL is an alias for
the other.
• IP Address/DNS name equivalence: The spider does
not perform reverse DNS look ups to determine if two
URLs are equivalent. For example, if a spider is set up
to crawl http://www.mit.edu/index.html and
http://18.181.0.31/index.html, it crawls the same
resource twice because the second URL is the IP
address of the first.

Source Documents and Endeca Records


Note: For a general overview of Endeca records, see
“Understanding Endeca Records, Endeca Properties, and
Dimensions” in the Endeca Concepts Guide.

In the context of a CAS application, Endeca records


represent the data in the source documents that a spider
crawls. CAS provides the means to get the source
documents into a pipeline. The source documents
themselves may reside on a file system, HTTP, or HTTPS
host and be in a wide variety of file formats (common
examples include .pdf, .html, .doc, and .txt). As is the
case with basic (non-CAS) pipelines, the source

Endeca Confidential Content Acquisition System


32

documents themselves are not modified in any way by


CAS pipeline processing.

Here is an example source document from the CAS


reference implementation. The reference implementation
crawls http://endeca.com; this file is the homepage for
that URL. The .html file has a title, text that describes
Endeca solutions, and links to other areas of the Web site.
Title

Links to
more
URLs

Document text

Advanced Features Guide Endeca Confidential


Chapter 1
33

Suppose a full crawl has been configured to crawl and


process this document. In a basic CAS application, the
document content on the .html page is crawled,
converted to text, and stored in a single Endeca property
named Endeca.Document.Text. (Forge properties
generated in a CAS pipeline will be explained in later
sections.) In addition, CAS crawls the links to additional
URLs and queues them for further crawling and text
extraction.

If only the Endeca.Document.Text and Endeca.Title


properties are mapped using the property mapper, the
Endeca record for this document looks like this:

Title

The section called


“Solutions for Industry”
begins here.

The section called


“Solutions for Enterprise
Search” begins here.

Endeca Confidential Content Acquisition System


34

Notice the following correspondence between the source


document and the Endeca record:
• The heading in blue is the document’s title.
• The first line in the record’s Endeca.Document.Text
property also lists the source document’s title and the
first line under the documents main graphic. The text
begins “Award Winning…”
• The remaining lines correspond to the two sections of
the source document called “Solutions for Industry”
and “Solutions for Enterprise Search.”

Note: The order in which a source document’s data appears in


an Endeca record depends upon the structure of the actual
source document. The ordering shown here is an example.

Although this example is useful for illustrative purposes,


such a record is not very useful to application users. Here
it shows the simplest relationship between a source
document and an Endeca record with two properties (title
and text). An application for users is not likely to have all
of a document's data contained in a single property.

In a more user-oriented application, a CAS pipeline might


include Perl code to parse properties from the document
text and use those to build Endeca properties and
dimensions. Alternatively, a CAS pipeline might build
dimensions based on any the meta-data properties that
CAS generates for a record. The CAS reference
implementation is designed in this way: it builds
dimensions based on meta-data properties of a record.

Advanced Features Guide Endeca Confidential


Chapter 1
35

Suppose the example shown so far were modified to


show a record page based on the same source document
with the metadata properties that Forge can generate.
From the meta-data properties available, the pipeline is
set up to expose properties such as encoding, date
modified, application type, fetch status, and so, for use as
dimensions. These properties are mapped with a
property mapper component to provide both record
details and navigation controls in the application. (See the
Developer Studio Help for information on these tasks that
are common to both CAS and basic pipelines.)

Re-running a full crawl would produce an Endeca record


for the source document that looks like this:

Endeca Confidential Content Acquisition System


36

The properties Forge generates are prefixed with Endeca.


A user-oriented application may not employ all of the
properties that Forge generates, but it is useful to see
some of them here. Notice the following changes to the
revised Endeca record:
• All meta-data properties in the source document
appear with the record, rather than just
Endeca.Document.Text and Endeca.Document.Title as
shown previously.
• There are dimension values based on meta-data
properties for encoding, fetch status, and MIME type.

To build a record page that displays all properties, the


property mapper must be configured to map all of the
Forge-generated source properties to Endeca properties.
(See the Developer Studio Help for information on these
tasks that are common to both CAS and basic pipelines.)

Property Name Syntax

During a crawl, Forge produces properties according to a


standardized naming scheme. The naming scheme is
made up of a qualified name with a period (.) to separate
qualifier terms. The first term Endeca indicates that the
property was automatically created by Forge. There may
be any number of additional terms after Endeca
depending on the property value being described.

Simple properties require only one additional term to


fully qualify the property, for example,
Endeca.Identifier or Endeca.Title. Often, a second term

Advanced Features Guide Endeca Confidential


Chapter 1
37

describes a property category and a third term fully


qualifies the property, for example, Endeca.
Document.Body or Endeca.Fetch.Timeout. Less frequently,
properties require additional terms to fully be qualified.

The following table provides an overview of naming


syntax and includes several common property examples.
Common Qualifier Description Example with Fully
Terms Qualified Names

Document Text of the document Endeca.Document.Text


or metadata about the Endeca.Document.Revision
document. Endeca.Document.MimeType
Endeca.Document.Language

Fetch Configuration Endeca.Fetch.ConnectTimeout


information about Endeca.Fetch.Proxy
how the document is
retrieved.

ACL Security information Endeca.ACL.Allow.Read


(access control list) Endeca.ACL.Allow.Write
about the document. Endeca.ACL.Allow.Execute

Relation Reference Endeca.Relation.References


information to other
documents.

For a description of each property generated by CAS, see


“Properties Generated by CAS” on page 73.

Endeca Confidential Content Acquisition System


38

Viewing all Properties Generated by CAS

Depending on the type of source documents a spider


crawls, Forge can generate dozens of properties for each
record. Although you may or may not employ all the
properties in your application, it is useful to see which
properties are available. You control which properties are
available by modifying the property mapper in your CAS
pipeline to map all source properties to Endeca
properties.

Note: The following procedure assumes you created a full


crawling pipeline and can also access your Endeca records via
an Endeca application.

To view all properties:

1. In the Project tab of Developer Studio, double-click


Pipeline Diagram.
2. Double-click the property mapper. The Property
Mapper editor displays.
3. Click the Advanced tab of the Property Mapper editor.
4. Check “If no mapping is found, map source properties
to Endeca:” and then click Properties.
5. Click OK.
6. Perform a full update.
7. Start an Endeca application, for example, a generic JSP
reference implementation, and view your Endeca
records.

Advanced Features Guide Endeca Confidential


Chapter 1
39

See the Endeca Developer Studio Help for more


information about configuring the property mapper and
running full updates.

Creating a Full Crawling Pipeline


The remaining sections of this feature document describe
how to create and configure a full crawling pipeline using
Developer Studio. As mentioned in the introduction, the
goal of the section is to describe the pipeline components
that are specific to crawling. Therefore, components that
you would create in both a crawling pipeline and in a
basic pipeline (dimension server, property mapper,
indexer adapter, and so on) are omitted here for
simplicity. The document focuses on the processing loop
for a crawling pipeline that is made up of the record
adapter, record manipulator, and spider components.

Note: Creating a differential crawl pipeline requires more than


just providing a URL to store metadata about URLs. In addition
to providing the URL, you have to create a differential pipeline,
which requires a different design than that of a full crawl. For
more information on creating a differential crawl pipeline,
contact your Endeca technical team for assistance.

The high-level overview of a full crawling pipeline is as


follows:

Endeca Confidential Content Acquisition System


40

1. Create a record adapter to read documents (required).


See “Creating a Record Adapter to Read Documents”
on page 41.
2. Create a record manipulator to perform the following
tasks. See “Creating a Record Manipulator” on
page 43.
• Retrieve documents from a URL (required).
• Extract and convert document text for each URL
(required).
• Identify the language of a document (optional).
• Remove document body properties (optional).
3. Modify records with a Perl manipulator (optional). See
“Modifying Records with a Perl Manipulator” on
page 54.
4. Create a spider to send URLs to the record adapter
(required). See “Creating a Spider” on page 55.
• Provide root URLs from which to start a crawl
(required).
• Configure URL extraction settings (required).
• Specify a record source for the spider (required).
• Specify spider settings such as timeout values and
proxy servers (optional).
5. Create a record manipulator to remove any
unnecessary records after processing (optional). See
“Removing any Unnecessary Records after a Crawl” on
page 68.

Advanced Features Guide Endeca Confidential


Chapter 1
41

Here is an example of a CAS pipeline that calls out the


core CAS components and also shows the components
common to both basic and CAS pipelines (that is,
dimension flow, a property mapper, an indexer adapter,
and so on).

Record adapter of
type Document

Record manipulator

Spider

Creating a Record Adapter to Read Documents

A record adapter reads in the documents associated with


the URLs provided by the spider component, and creates
a record for each document. As long as the spider has
URLs queued, the record adapter creates a record for
each URL until all are processed.

Endeca Confidential Content Acquisition System


42

To create a Document record adapter:

1. Start Developer Studio.


2. From the File menu, select New Project.
3. In the Project tab of Developer Studio, double-click
Pipeline Diagram.
4. In the Pipeline Diagram editor, click New.
5. Select Record > Adapter. The Record Adapter editor
displays.
6. In the Name text box, type in the name of this record
adapter.
7. In the Direction frame, make sure the Input option is
selected.
8. From the Format drop-down list, choose Document.
9. Leave the URL text box empty, leave Filter Empty
Properties and Multi File unchecked. These settings
are ignored by a record adapter configured for
Document formats.
10.Enter a language encoding in the Encoding text box, if
you know that all of the source documents are the
same encoding type. If you do not provide an
encoding value, CAS automatically attempts to
determine the encoding of each document by either
requesting that information from the Web server or by
examining the document’s body.
11.Click the Pass Throughs tab of the record adapter.
Note: You may have to use the left/right arrows to scroll to
the Pass Throughs tab.

Advanced Features Guide Endeca Confidential


Chapter 1
43

12.Enter URL_SOURCE in the Name text box and enter


the name of the spider component in the Value text
box. You will create and configure the spider
component later in “Creating a Spider” on page 55.
For now, you only have to choose the name of the
spider. The URL source is required and must name a
spider component.
13.Click Add.
14.Click OK to add the new record adapter to the project.
15.From the File menu, choose Save.

For a description of each property generated by the


record adapter, see “Properties Generated by CAS” on
page 73.

Creating a Record Manipulator

Expressions in a record manipulator perform document


retrieval, text extraction, language identification, record
or property clean up, and other tasks related to crawling.
These expressions are evaluated against each record as it
flows through the pipeline, and the record is changed as
necessary.

At a minimum, a CAS pipeline requires a record


manipulator with two expressions, one to retrieve
documents (RETRIEVE_URL) and another to convert
documents to text (CONVERTTOTEXT or PARSE_DOC). In
addition to these expressions, you can include other
optional expressions to identify the language of
documents (ID_LANGUAGE) or delete the temporary files

Endeca Confidential Content Acquisition System


44

created on disk by RETRIEVE_URL (using


REMOVE_EXPORTED_PROP).

The expressions associated with these operations are


described in the sections below.

To create a record manipulator:

1. In the Project tab of Developer Studio, double-click


Pipeline Diagram.
2. In the Pipeline Diagram editor, click New.
3. Select Record > Manipulator. The New Record
Manipulator editor displays.
4. In the Name text box, type in the name of this record
manipulator.
5. From the Record source drop-down list, choose the
name of the record adapter that you created in
“Creating a Record Adapter to Read Documents” on
page 41.
6. Click OK to add the new record manipulator to the
project.
7. From the File menu, choose Save.
8. If you are ready to add the expressions described in
the sections below, double-click the record
manipulator in your pipeline diagram. The Expression
editor displays.

Advanced Features Guide Endeca Confidential


Chapter 1
45

Adding a RETRIEVE_URL Expression

The RETRIEVE_URL expression is required in a CAS


pipeline to retrieve a document from its URL and store
the document in a file on disk. A STRING DIGEST
sub-expression of RETRIEVE_URL typically determines the
name of the file in which the document is stored.
RETRIEVE_URL places file’s location into the
Endeca.Document.Body property. Later in pipeline
processing, a text extraction expression examines
Endeca.Document.Body and converts the body content into
text stored in Endeca.Document.Text.

Forge also places any metadata it can retrieve about the


document in properties on the record.

To add RETRIEVE_URL to a record manipulator:

1. If the Expression editor is not already open,


double-click the Pipeline Diagram on the Project tab
of Developer Studio.
2. Double-click the record manipulator. The Expression
editor displays.
3. Starting at the first line in the Expression editor, insert
a RETRIEVE_URL expression using the example below
as a guide. The nested sub-expressions within
RETRIEVE_URL configure how it functions. Here are
several important points to consider when configuring
RETRIEVE_URL:

• A STRING sub-expression is required to name a file


created to store the document content for a URL.
Typically, you use a STRING DIGEST expression

Endeca Confidential Content Acquisition System


46

create a shorter property identifier (a digest) of the


URL indicated by PROP_NAME. This digest is necessary
because URLs may contain values that are invalid for
use as file names. DIGEST creates a file name based
on the URL but uses only characters a-f and
numbers 0-9, so the file name is valid.
• The VALUE expression node in the CONST expression
specifies the path where the contents of each URL
are stored on disk after retrieval.
• The PROP_NAME expression node in the DIGEST
expression specifies the property that contains the
URL to retrieve. The default name of this property is
Endeca.Identifier.

4. Click Check Syntax to ensure the expressions are well


formed.
5. Click Commit Changes and close the Expression
editor.
Note: It is not necessary to provide attribute values for the
LABEL or URL attributes.

For additional information on expression configuration,


see the Endeca XML Reference.

The following properties contain values that are passed


as parameters to RETRIEVE_URL. The property values
configure additional fetching options. The Endeca.Fetch
properties exist for a record if you provided values on the
Timeout tab, Proxy tab, and User Agent text box of the
Spider editor.
• Endeca.Fetch.Timeout

Advanced Features Guide Endeca Confidential


Chapter 1
47

• Endeca.Fetch.ConnectTimeout
• Endeca.Fetch.TransferRateLowSpeedLimit
• Endeca.Fetch.TransferRateLowSpeedTime
• Endeca.Fetch.Proxy
• Endeca.Fetch.UserAgent

For a description of each property generated by the


RETRIEVE_URL expression, see “Properties Generated by
CAS” on page 73.

Converting Documents to Text

An expression such as CONVERTTOTEXT or PARSE_DOC is


required in a CAS pipeline to extract document content
from the file created by RETRIEVE_URL and convert the
content into text.

If you are using Endeca ProFind, you can use


CONVERTTOTEXT to convert over 200 document types into
text. If you are using Endeca InFront, you can use
PARSE_DOC to convert .html and .txt documents. See
“Formats Supported by ProFind” on page 81 for the full
list of document types that ProFind supports.

After a record manipulator retrieves a URL and stores a


path to the file in Endeca.Document.Body, a text extraction
expression examines the file indicated by
Endeca.Document.Body, extracts the document body from
the file, and converts the document body into text. The
text is stored by default in Endeca.Document.Text.

Endeca Confidential Content Acquisition System


48

To guide text extraction and conversion, the text


extraction expression refers to the
Endeca.Document.MimeType and
Endeca.Document.Encoding properties. If no
Endeca.Document.Encoding exists, Forge attempts to
identify the encoding automatically. See “Identifying the
Language of the Documents” on page 50 if you want to
explicitly identify encoding.

As the document body is being extracted from the file


and converted to text, the expression examines the
document body for any URLs. The text extraction
expression adds any URLs it finds as
Endeca.Relation.References properties to the record.

For example, if a product overview document contains


links to ten product detail pages, the Endeca record for
the overview document will have ten
Endeca.Relation.References properties – one for each
product detail link. When the record for this document is
passed to the downstream spider component, the spider
queues the URLs in each Endeca.Relation.References
property and crawls it. This process continues until CAS
processes all URLs contained in a document.

One of following text extraction expressions must be


included in a CAS pipeline:
• CONVERTTOTEXT expression-Extracts documents based
on content-type, converts the document body to text,
and extracts any URL links contained in the document.
This expression uses a document conversion library to
convert files from more than 200 different document

Advanced Features Guide Endeca Confidential


Chapter 1
49

types into text. See “Formats Supported by ProFind”


on page 81 for the complete list of supported
document formats. Using CONVERTTOTEXT subsumes the
functionality of PARSE_DOC by doing the same
extraction and conversion on .html and .txt
documents. CONVERTTOTEXT is only available as part of
Endeca ProFind.
• PARSE_DOC expression – Extracts .html and .txt
documents, converts the document body to text, and
extracts any URL links contained in the document.
PARSE_DOC is available as part of both Endeca InFront
and Endeca ProFind.

To add a text extraction expression to a record


manipulator:
1. In the Pipeline diagram of Developer Studio,
double-click the record manipulator. The Expression
editor displays.
2. After the expression, add either the
RETRIEVE_URL
CONVERTTOTEXT or the PARSE_DOC expression using the
examples below as a guide. No nested expressions or
expression nodes are required.
• For CONVERTTOTEXT:

<EXPRESSION NAME="CONVERTTOTEXT" TYPE="VOID">

• For PARSE_DOC:

<EXPRESSION NAME="PARSE_DOC" TYPE="VOID">

3. Click Check Syntax to ensure the expressions are well


formed.

Endeca Confidential Content Acquisition System


50

4. Click Commit Changes and close the Expression


editor.

Note: It is not necessary to provide attribute values for the LABEL


or URL attributes.

For additional information on expression configuration,


see the Endeca XML Reference.

For a description of each property generated by either of


the text extracting and converting expressions, see
“Properties Generated by CAS” on page 73.

Identifying the Language of the Documents

If your pipeline requires explicitly identifying multiple


source documents that may be in multiple languages, you
can use the ID_LANGUAGE expression in your record
manipulator. This identification requirement may be
necessary if you crawl a set of source documents where
each document may be in a different language, and an
aspect of your application depends on identifying the
language. For example, an application might organize
documents to be navigated by language.

The ID_LANGUAGE expression examines a property that


you specify, determines the language of the document,
and tags the record with a corresponding language value
in an Endeca.Document.Language property. ISO 639 lists
the valid language codes. See http://www.oasis-
open.org/cover/iso639a.html for a full list of the
language codes.

Advanced Features Guide Endeca Confidential


Chapter 1
51

If you do not use the ID_LANGUAGE expression in your


pipeline, the RETRIEVE_URL expression attempts to
determine a language value based on the Content-type
header of the document that a Web server returns to CAS.
If no value exists for the Content-type header, then a text
extraction expression (for example, CONVERTTOTTEXT or
PARSE_DOC) attempts to determine the encoding value and
to generate the property.

The advantage of explicitly using the ID_LANGUAGE


expression is two fold: you can specify any property to
examine, and you can modify the number of bytes to
examine in the property. Increasing the number of bytes
leads to more accurate language detection. Decreasing
the number of bytes improves processing performance.

To identify the language of a document:


1. In the Pipeline view of Developer Studio, double-click
the record manipulator. The Expression Editor
displays.
2. After the text extraction expression, add the
ID_LANGUAGE expression using the example below as a
guide. Here are several important points to consider
when configuring ID_LANGUAGE:
• The PROPERTY expression node specifies the
property to use for language identification.
Typically, this is the Endeca.Document.Body
property.
• The LANG_PROP_NAME expression node specifies the
property to store a value representing the language

Endeca Confidential Content Acquisition System


52

of the document. If unspecified, the value is stored


in Endeca.Document.Language.
• The LANG_ID_BYTES expression node specifies the
number of bytes Forge uses to determine the
language. A larger number provides a more accurate
determination, but requires more processing time.
The default value is 300 bytes.
3. Click Check Syntax to ensure the expressions are well
formed.
4. Click Commit Changes and close the Expression
editor.

Note: It is not necessary to provide attribute values for the LABEL


or URL attributes.

Here is an example of an ID_LANGUAGE expression


configured to examine 500 bytes of Endeca.
Document.Body.

For additional information on expression configuration,


see the Endeca XML Reference.

For a description of each property generated by


ID_LANGUAGE, see “Properties Generated by CAS” on
page 73.

Removing Document Body Properties

As a system clean up task, you may want to remove the


files indicated by each record’s Endeca.Document.Body
property. These files are no longer necessary after the

Advanced Features Guide Endeca Confidential


Chapter 1
53

text extraction expression runs. This is an optional task in


a CAS pipeline that can occur after a text extraction
expression evaluates each record.

As part of CAS document processing, the following two


steps occur in the record manipulator:
• RETRIEVE_URL retrieves a URL and automatically
exports its contents to a file indicated by
Endeca.Document.Body.

• A text extraction expression, for example,


CONVERTTOTEXT or PARSE_DOC, examines the file
indicated by Endeca.Document.Body, converts the
contents of the file to text, and stores the text in
Endeca.Document.Text.

After the text extraction expression completes, you can


use a REMOVE_EXPORTED_PROP expression to remove the
exported file indicated by Endeca.Document.Body and also
the Endeca.Document.Body property if desired.

To add REMOVE_EXPORTED_PROP to a pipeline:


1. In the Pipeline view of Developer Studio, double-click
the Record Manipulator. The Expression Editor
displays.
2. After the text extraction expression (either
CONVERTTOTEXT or PARSE_DOC), add a
REMOVE_EXPORTED_PROP expression using the example
below as a guide. Here are several important points to
consider when configuring REMOVE_EXPORTED_PROP:

Endeca Confidential Content Acquisition System


54

• The PROP_NAME expression node specifies the name


of the property that indicates the file to remove.
Typically, this is the Endeca.Document.Body property.
• The URL expression node specifies the URL that files
were written to (by RETRIEVE_URL). This value may
be either an absolute path or a path relative to the
location of the Pipeline.epx file.
• The PREFIX expression node specifies any prefix
used in the file name to remove.
• The REMOVE_PROPS expression node specifies
whether to remove the property from the record
after deleting the file where the property was stored.
TRUE removes the property from the record after
removing the corresponding file. FALSE does not
remove the property.
3. Click Check Syntax to ensure the expressions are well
formed.
4. Click Commit Changes and close the Expression
editor.

Note: It is not necessary to provide attribute values for the LABEL


or URL attributes.

For additional information on expression configuration,


see the Endeca XML Reference.

Modifying Records with a Perl Manipulator

Although there is no requirement that a CAS pipeline use


a Perl manipulator component, this component is useful

Advanced Features Guide Endeca Confidential


Chapter 1
55

to perform more extensive record modification during


processing. For example, the component can be used to
strip values out of a property such as Endeca.
Document.Text and add the values back to a record for
use in dimension mapping. The component can be used
to concatenate properties and add a resulting new
property to a record, and so on.

For information about how to add a Perl manipulator


component to a pipeline, see the Endeca Developer
Studio Help. For information about how to implement
Perl code in a Perl manipulator, see the Endeca Forge API
Guide for Perl.

Creating a Spider

The spider component is the core of a CAS pipeline.


Working in conjunction with a record adapter and a
record manipulator, the spider forms a
document-processing loop whose function is to get
documents into a CAS pipeline. The primary function of
the spider in a loop is to crawl URLs, filter URLs, send
URLs to the record adapter, and manage the URL queue
until all source documents are processed. For a review of
the role of each component in this loop, see “URL and
Record Processing” on page 28.

The Spider editor is where you indicate the URLs to


crawl, create URL filters to determine which documents to
crawl, as well as specify timeout, proxy, and other
configuration information that controls how the crawl
proceeds.

Endeca Confidential Content Acquisition System


56

Once configured and run, the spider loops through


processing documents in a CAS pipeline as described in
the steps below. Note the spider’s tasks begin at step 5 in
the larger process described earlier in “URL and Record
Processing” on page 28. These steps focus only on the
spider's document processing loop.
1. For the first loop of source document processing, the
spider crawls the root URL indicated on the Root URLs
tab of the Spider editor.
2. Based on the root URL that the spider crawls, the
record adapter creates a record containing the URL,
indicated by Endeca.Identifier, and a limited set of
metadata properties.
3. The newly created record then flows down to the
record manipulator where the following takes place:
• The document associated with the URL is fetched
(using the RETRIEVE_URL expression) and stored in
Endeca.Document.Body.

• Content (searchable text) is extracted from


Endeca.Document.Body (using the CONVERTTOTEXT or
PARSE_DOC expression) and stored in
Endeca.Document.Text.

• Any URLs in Endeca.Document.Body are extracted


for additional crawling and, by default, are stored in
Endeca.Relation.References.

4. The record based on the root URL then moves


downstream to the spider where additional URLs
(those extracted from the root URL and stored in
Endeca.Relation.References) are queued for crawling.

Advanced Features Guide Endeca Confidential


Chapter 1
57

5. The spider crawls URLs from the record as indicated in


the Endeca.Relation.References properties. This is the
next loop of source document processing.
6. Based on the queued URL that the spider crawls, the
record adapter creates a record containing the URL,
indicated by Endeca.Identifier, and a limited set of
metadata properties.
7. Steps 3 through 6 repeat until the spider processes all
URLs and the record adapter creates corresponding
records.

To create a spider:
1. In the Project tab of Developer Studio, double-click
Pipeline Diagram.
2. In the Pipeline Diagram editor, choose New > Spider.
The New Spider editor displays.
3. In the Name box, type a unique name for the spider.
This should be the same name you specified as the
value of URL_SOURCE when you created the record
adapter.
4. If you want to limit the number of hops from the root
URL (specified on the Root URLs tab), enter a value in
the Maximum hops field. The Maximum hops value
specifies the number of links that may be traversed
beginning with the root URL before the spider reaches
the document at a target URL. For example, if
http://www.endeca.com is a root URL and it links to a
document at http://www.endeca.com/news.html, then
http://www.endeca.com/news.html is one hop away
from the root.

Endeca Confidential Content Acquisition System


58

5. If you want to limit the depth of the crawl from the


root URL, enter a value in the Maximum depth field.
Maximum depth is based on the number of separators
in the path portion of the URL. For example,
http://endeca.com has a depth of zero (no
separators). Whereas, http://endeca.com/
products/index.shtml has a depth of one. The
/products/ portion of the URL constitutes one
separator.
6. To specify the User-Agent HTTP header that the spider
should present to Web servers, enter the desired value
in the Agent name field. The Agent name identifies the
name of the spider, as it will be referred to in the
User-agent field of a Web server’s robots.txt file. If
you provide a name, the spider adheres to the
robots.txt standard. If you do not provide a name, the
spider responds only to rules in a robots.txt file
where the value of the User-agent field is “*”.
Note: (A robots.txt file allows Web-server administrators to
identify robots, like spiders, and control what URLs a robot
may or may not crawl on a Web server. The file specifies a
robot’s User-agent name and the rules associated with the
name. These crawling rules configured in robots.txt are
often known as the robots.txt standard or, more formally, as
the Robots Exclusion Standard. For more information on
this standard, see http://www.robotstxt.org/wc/
robots.html.)

7. To instruct the spider to ignore the robots.txt file on


a Web server, check Ignore robots. By ignoring the
file, the spider does not obey the robots.txt standard
and proceeds with the crawl with the parameters you
configure.

Advanced Features Guide Endeca Confidential


Chapter 1
59

8. If you want the spider to reject cookies, check Disable


Cookies. If you leave this unchecked, CAS adds
cookie information to the records during the crawl,
and CAS also stores and sends cookies to the server as
it crawls.
9. For the full crawl described in this chapter, do not
provide any value in the Differential Crawl URL box.
For information about configuring a differential crawl,
contact your Endeca technical team for assistance.

Specifying Root URLS to Crawl

On the Root URLs tab, you provide the starting points for
the spider to crawl. Each root URL must have a scheme of
file, http, or https. The URL must be absolute and well
formed. A useful URL reference is available at
http://www.w3.org/Addressing/URL.

In addition to starting a crawl from a root URL, you can


also start a crawl by posting data to a URL if necessary.
You can simulate a form post (the HTTP POST protocol)
by specifying a root URL with post syntax and values. To
construct a POST URL, postfix the URL with “?”, add
name=value pairs delimited by “&”, and then add a “$”
followed by the post data. For example, given this URL:

http-post://web01.qa:8080/qa/post/NavServlet?arg0=foo&arg1=bar$link=1/3

The spider executes an HTTP POST request to


web01.qa:8080/qa/post/NavServlet with query data:
arg0=foo&arg1=bar and post data: link=1/3.

Endeca Confidential Content Acquisition System


60

When you run the pipeline, the spider validates each root
URL, checks whether or not the URL passes the
appropriate filters, including the site’s robots.txt
exclusions (if the Ignore Robots checkbox is not set). If a
root URL is invalid or does not pass any of the filters an
appropriate message is logged.

To specify root URLs:

1. In the Spider editor, select the Root URLs tab.


2. In the URL text box, type the location from which the
spider starts crawling. This value can use file, http,
https, or form post URLs.
3. Click Add.
4. Repeat steps 2 and 3 for additional locations.

Configuring URL Extraction Settings

On the URL Configuration tab, you provide the name for


the properties used to store queued URLs, and you
provide URL filters.
• Enqueue URL – indicates the name of the property
that stores links (URLs) to other documents to crawl.
When a spider crawls a root URL, CAS extracts any
URLs contained on the root and adds those URLs as
properties to the record. The spider queues the URLs,
crawls them, and CAS creates additional records until
all URLs stored in this property are processed. In a
simple crawler application, you only need to specify
the Endeca.Relation.References property here. This is
the default property name produced by either text

Advanced Features Guide Endeca Confidential


Chapter 1
61

extraction expression that holds the URLs to be


queued.
• URL Filter – specifies the filters by which the spider
includes or excludes URLs during a crawl. Filters are
expressed as wildcards or Perl regular expressions.
URL filters are mutually exclusive; that is, URL filter A
does not influence URL filter B and vice versa. At least
one URL filter is required to allow the spider make
additional processing loops over the root URL.

To configure URL extraction settings:

1. In the Spider editor, click the URL Configuration tab.


2. Right-click the Enqueue URLs folder and click Add.
The Enqueue URL editor displays.
3. Enter a property name in the Enqueue URL editor that
designates the property of the record that contains
links to queue.
4. Optionally, select Remove if you want to remove the
property from the record after its value has been
queued.
5. Click OK.
6. If necessary, repeat steps 2 through 5 to add additional
queue URL properties.
7. Select the URL Filters folder and click Add. The URL
Filter editor displays.
8. In the URL Filter text box, enter either a wildcard filter
or regular expression filter according to the following
guidelines and samples. There are additional samples
in “Example Syntax of URL Filters” on page 64.

Endeca Confidential Content Acquisition System


62

Filters can be specified either by using wildcard filters


for example, *.endeca.com or Perl regular expressions,
for example /.*\.html/i. Generally, you should use
Wildcard patterns for Host filters and use Regular
Expression patterns for URL filters.
This example shows a host include filter. It uses a
wildcard to include all hosts that are in the endeca.com
domain:

This example shows a URL inclusion filter that uses a


regular expression filter to include all .html files,
regardless of case:

9. In the Type frame, select either Host or URL.


• Host filters apply only to the host name portion of a
URL.

Advanced Features Guide Endeca Confidential


Chapter 1
63

• URL filters are more flexible and can filter URLs


based on whether the entire URL matches the
specified pattern. For example, the spider may crawl
a file system in which a directory named
presentations contains PowerPoint documents that,
for some reason, should not be crawled. They can
be excluded using a URL exclusion filter with the
pattern /.*\/presentations\/.*\.ppt/.
10.In the Action frame, select either Include or Exclude.
Include indicates that the spider crawls documents
that match the URL filter. Exclude indicates that the
spider excludes documents that match the URL filter. A
URL must pass both inclusion and exclusion filters for
the spider to queue it. In other words, a URL must
match at least one inclusion filter and a URL also must
not match any exclusion filter.
11.In the Pattern frame, select either Wildcard or Regular
expression depending on the syntax of the filter you
specified in step 5.
12.Repeat steps 7 through 10 to create additional URL
filters as necessary. At a minimum, the spider requires
one host inclusion filter that corresponds to each root
URL you specified on the Root URL tab. For example,
if you set up a spider to crawl http://endeca.com,
then the spider needs a host include filter for
endeca.com. The filter allows the spider to include any
links found on the root for additional processing. If
you omit this filter, the spider processes the root URL
but not the URLs that the roots contains.

Endeca Confidential Content Acquisition System


64

Example Syntax of URL Filters

Here are additional examples of common URL filter


syntax:
• To crawl only file systems (not HTTP or HTTPS hosts),
use a URL inclusion filter with a regular expression
pattern of: /^file/i
• To crawl only documents with an .htm or .html
extension, use a URL inclusion filter with a regular
expression pattern of: /\.html?$/i
• To crawl the development branch of the Example
corporate Web site, use a URL inclusion filter with a
regular expression pattern of:
/example\.com\/dev\/.*/i

This pattern confines the crawler to URLs of the form:


example.com/dev/

• To restrict a crawler so that it does not crawl URLs on


a corporate intranet (for example, those located on
host intranet.foo.com/dev), use a Host exclusion
filter with a regular expression pattern of:
/intranet\.example\.com/

Advanced Features Guide Endeca Confidential


Chapter 1
65

Specifying a Record Source for the Spider

A spider requires an upstream pipeline component to act


as its record source. In most cases, this record source is
the record manipulator that contains the RETRIEVE_URL
and the text extraction expression. The record source
could also be a record adapter or another spider.

To specify a record source:

1. In the Spider editor, select the Sources tab.


2. From the Record source list, choose the name of the
record manipulator that you created.
3. Optionally, specify timeouts and proxy server settings
as described in the two sections that follow.
4. Click OK to finish creating the spider.
5. From the File menu, choose Save.

Specifying Timeouts

The spider may be configured with three timeout values


specified in the Timeout tab. These values control
connection timeouts and URL retrieval timeouts for each
URL that the spider fetches. Providing values on this tab is
optional.

If you do provide values, the spider sends them with


each URL to the record adapter. The record adapter
generates Endeca.Fetch properties for each record. The
property values become parameters to the RETRIEVE_URL
expression during the fetch. For a description of each

Endeca Confidential Content Acquisition System


66

Endeca.Fetch property, see “Properties Generated by


CAS” on page 73.

To specify timeouts:

1. In the Pipeline Diagram editor, double click the Spider


component. The Spider editor displays.
2. Click the Timeout tab.
3. If you want to limit the time that the spider spends
retrieving a URL before aborting the fetch, type a value
in the “Maximum time spent fetching a URL” text box.
4. If you want to limit the time that the spider spends
making a connection to a host before aborting the
retrieve operation, type a value in the “Maximum time
to wait for a connection to be made” text box.
5. If you want to abort a fetch based on transfer rate,
type a value in the Bytes/Sec for at Least text box and
the Second text box.

Specifying Proxy Servers

You can configure a spider to use a proxy servers when


accessing HTTP or HTTPS URLs. There are several ways
to configure the spider for use with proxy servers:
• You can specify a single proxy server, through which
the spider accesses both HTTP and HTTPS URLs.
• You can specify separate proxy servers for HTTP URLs
and HTTPS URLs.
• You can bypass proxy server settings for a specified
URL.

Advanced Features Guide Endeca Confidential


Chapter 1
67

You specify these settings on the Proxy tab of the Spider


editor.

To specify a single proxy server for HTTP and HTTPS:

1. Click the Proxy tab.


2. Select Use a Proxy Server to Fetch URLs from the list.
3. In the Host text box of the Proxy server frame, type
the name of the proxy server.
4. In the Port text box, type the port number that the
proxy server listens to for URL requests from the
spider.
5. If you want to bypass the specified proxy server for a
URL, click the Bypass URLs button. The Bypass URLs
editor displays.
6. Type the name of the host you want to access without
the use of a proxy server and click Add. You can use
wildcards to indicate a number of Web servers within
a domain. Repeat this step as necessary for additional
URLs.

7. Click OK.

To specify separate proxy servers for HTTP and HTTPS:

1. On the Proxy tab of the Spider editor, select Use


Separate HTTP/HTTPS Proxy Servers from the list.
2. In the Host text box of the HTTP Proxy server frame,
type the name of the proxy server.

Endeca Confidential Content Acquisition System


68

3. In the Port text box, type the port number that the
proxy server listens to, for HTTP URL requests from
the spider.
4. In the Host text box of the HTTPS Proxy server frame,
type the name of the proxy server.
5. In the Port text box, type the port number that the
proxy server listens to, for HTTPS URL requests from
the spider.
6. If you want to bypass the specified proxy server for a
URL, click the Bypass URLs button. The Bypass URLs
editor displays.
7. Type the name of the host you want to access without
the use of a proxy server and click Add. Repeat this
step as necessary for additional URLs.
8. Click OK on the Bypass URLs editor.
9. Click OK on the Spider editor.
10.From the File menu, choose Save.

Removing any Unnecessary Records after a Crawl

After the CAS components of a pipeline have processed


all the source documents, you may want to remove any
records that merely reflect source data structure before
Forge writes out these records with an indexer adapter.
This record removal is typically necessary when CAS
creates records based on directory pages, index pages, or
other forms of source documents that reflect the structure
of the source data but do not correspond to a source
document that you need in an application.

Advanced Features Guide Endeca Confidential


Chapter 1
69

If you do not remove these records before indexing, the


records become available to users of your Endeca
application. For example, suppose a spider crawls a
directory list page at ..\data\incoming\red\index.html
and creates a corresponding record. You are unlikely to
want users to search the record for the index.html page
because it primarily contains a list of links; however, the
spider must crawl the index page to queue and retrieve
the other pages that index.html links to, such as
..\data\incoming\red\product1.html,
..\data\incoming\red\product2.html,
..\data\incoming\red\product3.html, and so on.

You can remove records from a pipeline using a


REMOVE_RECORD expression. In the pipeline, the
REMOVE_RECORD expression must appear in a record
manipulator that is placed after the CAS record processing
loop. Specifically, the expression must appear after the
spider component because the spider needs to crawl all
URLs that may appear on a directory page.

Endeca Confidential Content Acquisition System


70

For example, see the position of the RemoveRecords


component in the following example CAS pipeline:

Record
Manipulator

To add REMOVE_RECORD to a pipeline:

1. In the Project tab of Developer Studio, double-click


Pipeline Diagram.
2. In the Pipeline Diagram editor, click New.
3. Select Record > Manipulator. The New Record
Manipulator editor displays.
4. In the Name text box, type in the name of this record
manipulator.

Advanced Features Guide Endeca Confidential


Chapter 1
71

5. From the Record source drop-down list, choose the


name of the spider that you created.
6. From the Dimension source drop-down list, chose the
dimension source for the pipeline.
7. Click OK to add the new record manipulator to the
project.
8. Open the PropertyMapper component and change its
record source to the new record manipulator you just
created.
9. From the File menu, choose Save.
10.In the Pipeline Diagram, double-click the record
manipulator. The Expression Editor displays.
11.Starting at the first line in the Expression editor, insert
a REMOVE_RECORD expression using the example below
as a guide. The REMOVE_RECORD expression appears on
line 14. REMOVE_RECORD is typically used within an IF
expression to remove records that meet or do not
meet certain criteria. There are no nested expressions
within REMOVE_RECORD to configure how it functions.
12.Click Check Syntax to ensure the expressions are well
formed.
13.Click Commit Changes and close the Expression
editor.

Note: It is not necessary to provide attribute values for the LABEL


or URL attributes.

For additional information on expression configuration,


see the Endeca XML Reference.

Endeca Confidential Content Acquisition System


72

Handling Crawling Errors

Processing source documents, including retrieving and


extracting text can introduce problems. This section lists
several common errors and any workarounds, if
applicable.
• .pdf content in Endeca.Document.Text displays as
binary data: If CAS processes .pdf files and the content
from those files appears in your application as binary
data, the .pdf files may contain custom-encoded
embedded fonts.
• CAS cannot always correctly display content that
contains custom-encoded embedded fonts. To solve
the issue, CAS attempts to substitute a system font for
the custom-encoded font. The substitution succeeds if
the encoding in the substituted system font is the same
as the custom encoding in the embedded font. When
the substitution is not successful, you see binary data
in Endeca.Document.Text.

Here are several issues related to retrieving documents


from HTTP hosts, and an explanation of how the spider
handles them:
• Connection timeout: The spider retries the request five
times. Each timeout is logged in an informational
message. After a fifth timeout, an error message is
logged, and the record for the offending URL is
created with its Endeca.Document.Status property set
to Fetch Aborted.
• URL not found: The spider logs a warning message
that the URL could not be located and creates a record

Advanced Features Guide Endeca Confidential


Chapter 1
73

with its Endeca.Document.Status property set to “Fetch


Failed.”
• Malformed URL: The spider logs a warning message
that the URL is malformed and creates a record with its
Endeca.Document.Status property set to “Fetch Failed.”

• Authentication failure: The spider logs a warning


message that the URL could not be retrieved and
creates a record with its Endeca.Document.Status
property set to “Fetch Failed.”

Properties Generated by CAS


The following table describes the properties generated by
the record adapter:
Property Name Description

Endeca.Identifier The source URL of this document

Endeca.Document.NumberOfHops The minimum number of hops (link


traversals) to get to this document from a
root URL during the crawl.

Endeca.Fetch.Timeout The maximum time in seconds that Forge


should wait to retrieve the resource
indicated by Endeca.Identifier before
aborting the retrieve operation.

Endeca.Fetch.ConnectTimeout The maximum time in seconds that Forge


should wait to establish a connection to the
server hosting the resource indicated by
Endeca.Identifier before aborting the
retrieve operation.

Endeca Confidential Content Acquisition System


74

Property Name Description

Endeca.Fetch.TransferRateLow The minimum transfer rate in bytes per


SpeedLimit second below which Forge should abort the
retrieve operation after the timeout
specified in
Endeca.Fetch.TransferRateLow
SpeedTime.

Endeca.Fetch.TransferRateLowS The maximum time in seconds that Forge


peedTime should allow the transfer operation to fall
below the minimum transfer rate specified
in Endeca.Fetch.TransferRate
LowSpeedLimit before aborting.

Endeca.Fetch.Proxy The URI (HOST:PORT) of the proxy server


that Forge should use to retrieve the
resource indicated by Endeca.Identifier.

Endeca.Fetch.UserAgent The User-Agent HTTP header that Forge


should present to the server hosting the
resource indicated by Endeca.Identifier. For
more information about user agent settings,
see the procedure on page 55.

Note: Records created by the record adapter may contain


properties prefixed with the name Endeca.Fetch that describe
how to access the resource identified by the Endeca.Identifier
property. The Endeca.Fetch property values are created based
on the values you provide, if any, in the Timeout tab of the
Spider editor. When the spider sends a URL to the record
adapter, the URL is accompanied by the timeout configuration
values that determine how the URL should be retrieved. See
“Specifying Timeouts” on page 65 for more information.

Advanced Features Guide Endeca Confidential


Chapter 1
75

The following table describes the properties generated by


the RETRIEVE_URL expression:
Property Name Description

Endeca.Document.Body The body of the retrieved document. The


value of this property is a path to the file
that stores the document body on disk. A
text extraction expression, such as
CONVERTTOTEXT or PARSE_DOC, uses
Endeca.Document.Body to extract the
text of the document (Endeca.
Document.Text).

Endeca.Document.Status The status of the fetch. This property can


have any of the following values:
• Fetch Succeeded - The document
fetch was successful and the
document body is in the file indicated
by the Endeca.Document.Body
property.
• Fetch Skipped - The document was
not retrieved because based on the
value of Endeca.Document.
Revision, the document should not
yet be considered for re-fetch. For
example, if the HTTP Expires header
specifies a time in the future, the
document will not be fetched.
• Fetch Aborted - The document was
not retrieved. This status indicates that
a potentially transient phenomenon,
for example a timeout, prevented the
document from being fetched.
Additional information can often be
found in scheme-specific properties.
continued

Endeca Confidential Content Acquisition System


76

Property Name Description

Endeca.Document.Status • Fetch Failed - The document was not


continued retrieved. This status indicates that a
non-transient phenomenon, for
example an HTTP error like
Document Not Found, caused the
failure. Additional information can
often be found in scheme-specific
properties.

Endeca.Document.Revision The scheme-specific revision information


of the document. This value is typically a
timestamp of the document's last
modified date and may also include
revision information about references
and metadata. The information contained
in the revision varies from scheme to
scheme (for example, file vs. http).

Endeca.Document.IsRedirection If the document is a redirection to


another document (for example, an
HTTP Redirect, or a symbolic link file),
this property has the value true. The
document’s URL that is redirected to is
stored in an Endeca.Relation.
References property.

Endeca.Document.IsUnchanged If the document is unchanged, either


because the fetch was skipped or
because fetch determined that the
document has not changed, this property
has the value true. This property is
generated in differential crawl pipelines,
not full crawl pipelines.

Endeca.Document.MimeType The MIME Type of the document, if it


can be determined. Common examples
of this property value include text/html,
application/pdf, image/gif, and so on.

Advanced Features Guide Endeca Confidential


Chapter 1
77

Property Name Description

Endeca.Document.Encoding The encoding of the body of the


document, if it can be determined. This
property value is an ISO code that
describes the encoding, for example,
ISO-8859-1. The RETRIEVE_URL
expression attempts to determine this
value based on the Content-type header
of the document that a Web server
returns to CAS.
If no value exists for the Content-type
header, then a text extraction expression
(for example, CONVERTTOTTEXT or
PARSE_DOC) attempts to determine the
encoding value and to generate the
property. If the value cannot be
determined, Forge logs an error.

Endeca.Relation.References If a document references other


documents, each reference is placed into
this property. This is the default property
that stores URLs to be queued by the
spider.

Endeca.ACL.Allow.Read (Write, Security information about the


Execute, Delete, and so on) document. There can be several dozen
access control list (ACL) properties.
These properties take their names from
system security attributes. For more
information, see the Endeca Security
Guide.

Endeca.ACL.Deny.Read (Write, Security information about the


Execute, Delete, and so on) document. For more information, see the
Endeca Security Guide.

Endeca Confidential Content Acquisition System


78

Property Name Description

Endeca.Cookie Information about cookies associated


with the fetched URL. This information
may include the name of the cookie, the
Web server’s domain, the path to the
Web server, fetch dates, and remove
dates. When RETRIEVE_URL gets a Set
Cookie header as part of its HTTP
response, RETRIEVE_URL can pass this
value back to the server, when
appropriate, to simulate a session.

Endeca.Document.Info.<scheme>.* Scheme-specific metadata supplied by


each scheme. For example, the HTTP
scheme supplies HTTP protocol
information in the property
Endeca.Document.Info.http.protocol and
the FILE scheme supplies the File type in
the Endeca.Document.Info.file.
Attribute property.

The following table describes properties generated by


extracting document text:
Property Name Description

Endeca.Document.Encoding The document encoding, which is added if it


does not already exist.

Endeca.Document.Text The extracted text of the document. The text is


extracted from the file indicated by
Endeca.Document.Body.

Endeca.Title The title of the document.

Advanced Features Guide Endeca Confidential


Chapter 1
79

Property Name Description

Endeca.Relation.References This property references the document


indicated. Endeca.Relation.References is
the default property name produced by either
text extraction expression. The property stores
the any additional URLs to be queued. Property
values are either absolute URLs or URLs relative
to the record’s Endeca.Identifier property.
For example, crawling a directory overview
page that has links to three sub-category pages
produces the following Endeca.Relation.
References properties:
<PROP NAME=
"Endeca.Relation.References">
<PVAL>http://endeca.com/products/
</PVAL>
</PROP>
<PROP NAME=
"Endeca.Relation.References">
<PVAL>red/index.html</PVAL>
</PROP>
<PROP NAME=
"Endeca.Relation.References">
<PVAL>white/index.html</PVAL>
</PROP>
<PROP NAME=
"Endeca.Relation.References">
<PVAL>sparkling/index.html</PVAL>
</PROP>

In addition to the properties described above, a text


extraction expression may create additional properties
with arbitrary names that describe metadata about a
document, if such metadata exists. For example, if a
document contains HTML META tags, the expression
creates corresponding properties. Or, for example, if you

Endeca Confidential Content Acquisition System


80

parse MS Word documents that contain an Author


attribute, the expression creates a corresponding property
for author.

The following table describes properties generated by


classifying documents with Stratify:.
Property Name Description

Endeca.Stratify.Topic.HID This property corresponds to the ID value


<hierarchy ID>= <topic ID> of a topic in your published Stratify
taxonomy. Each topic in your taxonomy has
an ID value assigned by Stratify. For
example, if an Eating Disorders topic has an
ID of 209722 in a health care taxonomy
whose hierarchy ID is 15, then the Endeca
property is Endeca.Stratify.Topic.
HID15="209722".

Endeca.Stratify.Topic.Topic. This property corresponds to a topic name


Name.HID <hierarchy ID>.TID from your published Stratify taxonomy for
<topic ID>=<topic name> its corresponding topic ID. For example, for
the Eating Disorders topic in the health care
taxonomy mentioned earlier, this property
is Endeca.Stratify.Topic.
Name.HID15.TID2097222="Eating
Disorders".

Endeca.Stratify.Topic.Score. This property indicates classification score


HID <hierarchy ID>.TID between an unstructured document and the
<topic ID>=<score> topic it has been classified into. The value
of <score> is a percentage expressed as a
value between zero and one. Zero indicates
the lowest classification score-0%, and one
indicates the highest score-100%. You can
use this property to remove records from
your application that have a low score for
classification matching, for example,
Endeca.Stratify.Topic.Score.HID15
.TID2097222="0.719380021095276".

Advanced Features Guide Endeca Confidential


Chapter 1
81

Property Generated by ID LANGUAGE


Property Name Description

Endeca.Document.Language The document language code. ISO 639 lists the


valid language codes. Common examples
include the following: en for English, es for
Spanish, fr for French, de for German, ja for
Japanese, and ko for Korean. See
http://www.oasis-open.org/cover/
iso639a.html for a full list of the language
codes.

Formats Supported by ProFind


Endeca ProFind supports the following source document
formats.
WORD PROCESSING FORMATS
GENERIC TEXT
ANSI Text ..................................................................................7 & 8 bit
ASCII Text .................................................................................7 & 8 bit
EBCDIC
HTML .................................................... through 3.0 (some limitations)
IBM FFT ............................................................................... All versions
IBM Revisable Form Text.................................................... All versions
Microsoft Rich Text Format (RTF) ...................................... All versions
Text Mail (MIME).................................................... No specific version
Unicode Text ....................................................................... All versions
WML ...................................................................................... Version 5.2
DOS WORD PROCESSORS
DEC WPS Plus (DX) ............................................................ through 4.0
DEC WPS Plus (WPL).......................................................... through 4.1

Endeca Confidential Content Acquisition System


82

DisplayWrite 2 & 3 (TXT) ................................................... All versions


DisplayWrite 4 & 5 .................................................through Release 2.0
Enable ............................................................................3.0, 4.0 and 4.5
First Choice.......................................................................... through 3.0
Framework......................................................................................... 3.0
IBM Writing Assistant ...................................................................... 1.01
Lotus Manuscript .................................................................. Version 2.0
MASS11 ................................................................. Versions through 8.0
Microsoft Word ..................................................... Versions through 6.0
Microsoft Works.................................................... Versions through 2.0
MultiMate .............................................................. Versions through 4.0
Navy DIF.............................................................................. All versions
Nota Bene ............................................................................. Version 3.0
Novell WordPerfect .............................................. Versions through 6.1
Office Writer ...............................................................Versions 4.0 - 6.0
PC-File Letter ........................................................ Versions through 5.0
PC-File+ Letter ...................................................... Versions through 3.0
PFS:Write................................................................ Versions A, B and C
Professional Write................................................. Versions through 2.1
Q&A ...................................................................................... Version 2.0
Samna Word................................... Versions through Samna Word IV+
SmartWare II ...................................................................... Version 1.02
Sprint..................................................................... Versions through 1.0
Total Word ............................................................................ Version 1.2
Volkswriter 3 & 4.................................................. Versions through 1.0
Wang PC (IWP) .................................................... Versions through 2.6
WordMARC ....................................... Versions through Composer Plus
WordStar................................................................ Versions through 7.0
WordStar 2000....................................................... Versions through 3.0

Advanced Features Guide Endeca Confidential


Chapter 1
83

XyWrite .......................................................... Versions through III Plus


WINDOWS WORD PROCESSORS
Adobe FrameMaker (MIF).................................................... Version 6.0
Hangul ................................................ Version 97 and 2002 (text only)
JustSystems Ichitaro......................... Versions 5.0, 6.0, 8.0 – 13.0, 2004
JustWrite ................................................................ Versions through 3.0
Legacy ................................................................... Versions through 1.1
Lotus AMI/AMI Professional ................................ Versions through 3.1
Lotus Word ProVersions 96 through Millennium Edition 9.6, text only
Microsoft Write ..................................................... Versions through 3.0
Microsoft Word .................................................. Versions through 2003
Microsoft WordPad .............................................................. All versions
Microsoft Works.................................................... Versions through 4.0
Novell Perfect Works............................................................ Version 2.0
Novell/Corel WordPerfect .................................. Versions through 12.0
Professional Write Plus......................................................... Version 1.0
Q&A Write ........................................................................... Version 3.0
Star Office/Open Office Writer.. Star Office Versions 5.2, 6.x, and 7.x
......................................................Open Office version 1.1 (text only)
WordStar................................................................................ Version 1.0
MACINTOSH WORD PROCESSORS
MacWrite II ........................................................................... Version 1.1
Microsoft Word (Mac) .......................................... Versions 4.0 — 2004
Microsoft Works (Mac) ......................................... Versions through 2.0
Novell WordPerfect ...................................... Versions 1.02 through 3.0
SPREADSHEET FORMATS
Enable .............................................................Versions 3.0, 4.0 and 4.5
First Choice ........................................................... Versions through 3.0
Framework............................................................................ Version 3.0

Endeca Confidential Content Acquisition System


84

Lotus 1-2-3 (DOS & Windows)............................ Versions through 5.0


Lotus 1-2-3 (OS/2)................................................ Versions through 2.0
Lotus 1-2-3 Charts (DOS & Windows) ................ Versions through 5.0
Lotus 1-2-3 for SmartSuite...................... Versions 97 – Millennium 9.6
Lotus Symphony.............................................. Versions 1.0,1.1 and 2.0
Microsoft Excel Charts................................................Versions 2.x - 7.0
Microsoft Excel (Mac) ........................... Versions 3.0 – 4.0, 98 — 2004
Microsoft Excel (Windows)......................... Versions 2.2 through 2003
Microsoft Multiplan .............................................................. Version 4.0
Microsoft Works (Windows) ................................ Versions through 4.0
Microsoft Works (DOS) ........................................ Versions through 2.0
Microsoft Works (Mac) ......................................... Versions through 2.0
Mosaic Twin ......................................................................... Version 2.5
Novell Perfect Works............................................................ Version 2.0
PFS: Professional Plan.......................................................... Version 1.0
Quattro Pro (DOS) ............................................... Versions through 5.0
Quattro Pro (Windows) ..................................... Versions through 12.0
SmartWare II ....................................................................... Version 1.02
Star Office/Open Office Calc..... Star Office Versions 5.2, 6.x, and 7.x
......................................................Open Office version 1.1 (text only)
SuperCalc 5........................................................................... Version 4.0
VP Planner 3D ...................................................................... Version 1.0
PRESENTATION FORMATS
Corel/Novell Presentations ................................ Versions through 12.0
Harvard Graphics for DOS ...................................... Versions 2.x & 3.x
Harvard Graphics (Windows).................................. Windows versions
Freelance (Windows) ....................... Versions through Millennium 9.6
Freelance for OS/2 ............................................... Versions through 2.0
Microsoft PowerPoint (Windows) .............. Versions 3.0 through 2003

Advanced Features Guide Endeca Confidential


Chapter 1
85

Microsoft PowerPoint (Mac) ................. Versions 4.0, 98 through 2004


StarOffice / OpenOffice Impress............... StarOffice 5.2, 6.x, and 7.x
.................................................................... OpenOffice 1.1 (text only)
GRAPHICS FORMATS
Adobe Photoshop (PSD)...................................................... Version 4.0
Adobe Illustrator............................................ Versions through 7.0, 9.0
Adobe FrameMaker graphics (FMV) ............ Vector/raster through 5.0
Adobe Acrobat (PDF)........................ Versions 2.1, 3.0 – 6.0, Japanese
Ami Draw (SDW) .................................................................. Ami Draw
AutoCAD Interchange and Native Drawing formats ... DXF and DWG
AutoCAD Drawing......... Versions 2.5 - 2.6, 9.0 - 14.0, 2000i and 2002
AutoShade Rendering (RND)............................................... Version 2.0
Binary Group 3 Fax............................................................. All versions
Bitmap (BMP, RLE, ICO, CUR, OS/2 DIB & WARP) .......... All versions
CALS Raster (GP4)...................................................Type I and Type II
Corel Clipart format (CMX).................................. Versions 5 through 6
Corel Draw (CDR) ..................................................... Versions 3.x – 8.x
Corel Draw (CDR with TIFF header) ....................... Versions 2.x – 9.x
Computer Graphics Metafile (CGM).......ANSI, CALS NIST version 3.0
Encapsulated PostScript (EPS) ...................................TIFF header only
GEM Paint (IMG).................................................... No specific version
Graphics Environment Mgr (GEM)............................. Bitmap & vector
Graphics Interchange Format (GIF) ...................... No specific version
Hewlett Packard Graphics Language (HPGL)........................Version 2
IBM Graphics Data Format (GDF) ...................................... Version 1.0
IBM Picture Interchange Format (PIF) ................................ Version 1.0
Initial Graphics Exchange Spec (IGES) ............................... Version 5.1
JFIF (JPEG not in TIFF format) ........................................... All versions
JPEG (including EXIF)......................................................... All versions

Endeca Confidential Content Acquisition System


86

Kodak Flash Pix (FPX)........................................................ All versions


Kodak Photo CD (PCD)....................................................... Version 1.0
Lotus PIC.............................................................................. All versions
Lotus Snapshot .................................................................... All versions
Macintosh PICT1 & PICT2 ................................................. Bitmap only
MacPaint (PNTG).....................................................No specific version
Micrografx Draw (DRW) ...................................... Versions through 4.0
Micrografx Designer (DRW) ................................ Versions through 3.1
Micrografx Designer(DSF) ............................ Windows 95, version 6.0
Novell PerfectWorks (Draw)................................................ Version 2.0
OS/2 PM Metafile (MET) ...................................................... Version 3.0
Paint Shop Pro 6 (PSP) ................... Windows only, versions 5.0 – 6.0
PC Paintbrush (PCX and DCX).............................................All version
Portable Bitmap (PBM) ....................................................... All versions
Portable Graymap (PGM) .......................................No specific version
Portable Network Graphics (PNG)...................................... Version 1.0
Portable Pixmap (PPM)...........................................No specific version
Postscript (PS)............................................................................. Level II
Progressive JPEG .....................................................No specific version
Sun Raster (SRS) ......................................................No specific version
Star Office/Open Office Draw...........Star Office 5.2, 6.x, and 7.x and
....................................................... OpenOffice version 1.1 (text only)
TIFF .......................................................................... Versions through 6
TIFF CCITT Group 3 & 4 ........................................ Versions through 6
Truevision TGA (TARGA) .......................................................Version 2
Visio (preview) ........................................................................Version 4
Visio ............................................................... Versions 5, 2000 — 2003
WBMP ......................................................................No specific version
Windows Enhanced Metafile (EMF) .......................No specific version

Advanced Features Guide Endeca Confidential


Chapter 1
87

Windows Metafile (WMF) ...................................... No specific version


WordPerfect Graphics (WPG & WPG2)Versions through 2.0, 7 and 10
X-Windows Bitmap (XBM) ...........................................x10 compatible
X-Windows Dump (XWD)............................................x10 compatible
X-Windows Pixmap (XPM) ...........................................x10 compatible
COMPRESSED FORMATS
GZIP. .................................................................................... All versions
LZA Self Extracting Compress............................................. All versions
LZH Compress ..................................................................... All versions
Microsoft Binder ............................................................ Versions 7.0-97
...................... (conversion of Binder is supported only on Windows)
MIME Text Mail
UUEncode
UNIX Compress
UNIX TAR
ZIP..................................................... PKWARE versions through 2.04g
DATABASE FORMATS
Access ................................................................... Versions through 2.0
dBASE ................................................................... Versions through 5.0
DataEase ...............................................................................Version 4.x
dBXL...................................................................................... Version 1.3
Enable .............................................................Versions 3.0, 4.0 and 4.5
First Choice ........................................................... Versions through 3.0
FoxBase................................................................................. Version 2.1
Framework............................................................................ Version 3.0
Microsoft Works (Windows) ................................ Versions through 4.0
Microsoft Works (DOS) ........................................ Versions through 2.0
Microsoft Works (Mac) ......................................... Versions through 2.0
Paradox (DOS) ..................................................... Versions through 4.0

Endeca Confidential Content Acquisition System


88

Paradox (Windows) ............................................. Versions through 1.0


Personal R:BASE ................................................................... Version 1.0
R:BASE 5000 ......................................................... Versions through 3.1
R:BASE System V.................................................................. Version 1.0
Reflex .................................................................................... Version 2.0
Q & A.................................................................... Versions through 2.0
SmartWare II ....................................................................... Version 1.02
OTHER FORMATS
Executable (EXE, DLL)
Executable (Windows) NT
Microsoft Outlook Express (EML) ..........................No specific version
Microsoft Outlook Folder (PST) ........ Versions 97, 98, 2002, and 2002
Microsoft Outlook Message (MSG) .................................... All versions
Microsoft Project.................................... Versions 98 - 2003 (text only)
vCard..................................................................................... Version 2.1

Advanced Features Guide Endeca Confidential


Chapter 1
Chapter 2
Web Crawling with Authentication

This chapter describes how to configure a Forge crawler


pipeline to access sites that require client authentication
over HTTP using either basic authentication or HTTPS. It
also describes how to set up Forge to require authentication
from the server when using HTTPS.

Note: If Forge is to be used to crawl a file system, you must ensure


that the Forge process is run from an account that is granted all of
the appropriate permissions to access the target data.

This chapter assumes that you have already created a Forge


pipeline for Web crawling, as described in “Content
Acquisition System” on page 23.

Configuring Basic Authentication


When Forge connects to a Web site that requires basic
authentication, it needs to provide the site with a valid
username and password before the Web server will transmit
a response. You can use a key ring file to supply Forge with
an appropriate username/password pair to access a
particular site that requires basic authentication.
90

The following is a sample key ring file that could be used


to configure Forge for basic authentication:
<KEY_RING>
<SITE HOST="www.endeca.com" PORT="6000">
<HTTP>
<REALM NAME="Sales Documents">
<KEY>BOcxV3wFSGuoBqbhPHkFGmA=</KEY>
</REALM>
</HTTP>
</SITE>
</KEY_RING>

To use this key ring file, you specify its location via the
third argument of the RETRIEVE_URL expression in the
Forge crawler pipeline, which is used to fetch URLs from
the targeted Web server, as shown below (the relevant
line is in boldface):
<EXPRESSION TYPE="VOID" NAME="RETRIEVE_URL">
<!-- this expression generates a filename for
the retrieved file: -->
<EXPRESSION TYPE="STRING" NAME="CONCAT">
<EXPRESSION TYPE="STRING" NAME="CONST">
<EXPRNODE NAME="VALUE" VALUE="&cwd;"/>
</EXPRESSION>
<EXPRESSION TYPE="STRING" NAME="DIGEST">
<EXPRESSION TYPE="PROPERTY" NAME="IDENTITY">
<EXPRNODE NAME="PROP_NAME" VALUE="Endeca.Identifier"/>
</EXPRESSION>
</EXPRESSION>
</EXPRESSION>
<!-- this expression node specifies the path to the
key ring file: -->
<EXPRNODE NAME="KEY_RING" VALUE="key_ring.xml"/>
</EXPRESSION>

Advanced Features Guide Endeca Confidential


Chapter 2
91

The path to the key ring file is expressed relative to the


pipeline file or as an absolute path. In the above
example, the key ring file is in the same directory as the
pipeline file.

Note that the specified key ring applies only to the


RETRIEVE_URL expression from which it is referenced.

The following subsections describe each element of the


sample key ring file in detail.

KEY_RING Element

The KEY_RING element is the root element of the key ring


file. All other components of the key ring file are
contained within the KEY_RING element.

SITE Element

The SITE element is used to refer to a target Web site or


server. All of the directives within a SITE element are
targeted at the site or server specified by the parent SITE
element. For example, the HTTP element in the sample
key ring file refers to an HTTP connection to the Web site
on host www.endeca.com at port 6000.

The SITE element may contain one sub-element for each


URL scheme by which it can be accessed. The
authentication parameters for each of these schemes are
specified in the body of each scheme sub-element. The
two schemes that currently support authentication are

Endeca Confidential Web Crawling with Authentication


92

HTTP and HTTPS, represented by HTTP and HTTPS


elements, respectively.

The SITE element has one required attribute, HOST, and


one optional attribute, PORT.

HOST Attribute

The value of the HOST attribute should be the


fully-qualified domain name of the server that hosts the
target site.

PORT Attribute

If the target site is not accessed via the default port for all
relevant URL schemes, the PORT attribute can be used to
specify the port explicitly. If the PORT attribute is
unspecified, the default port for each access scheme
specified will be used.

For example, the following sample key ring file would be


used to specify the authentication configuration settings
for accessing host www.endeca.com via port 80 for HTTP
and port 443 for HTTPS:
<KEY_RING>
<SITE HOST="www.endeca.com">
<HTTP>
<!-- HTTP Settings... -->
</HTTP>
<HTTPS>
<!-- HTTPS Settings... -->
</HTTPS>
</SITE>
</KEY_RING>

Advanced Features Guide Endeca Confidential


Chapter 2
93

HTTP Element

The HTTP element is used to encapsulate the basic


authentication settings for accessing the parent host via
HTTP. Some parts of a site may be password-protected,
while others are not. The parts of an HTTP site that
require authentication are called realms.

A realm is an arbitrary name for a directory on an HTTP


server and all of its contents, including subdirectories. For
instance, the realm “Sales Documents” (referenced in the
sample key ring file) might refer to the directory:
http://www.endeca.com:6000/sales/

which in turn contains the “contracts” and “bookings”


subdirectories, each of which may contain some Word
documents or Excel spreadsheets. If a Forge crawler
attempted to access any of this content, including the
“sales”, “contracts”, or “bookings” directories themselves,
it would be prompted for a username and password to
gain access to the “Sales Documents” realm.

To provide Forge with a username/password pair for


accessing this realm, a REALM element is used. An HTTP
site may have many realms, so an HTTP element may
contain any number of REALM sub-elements.

REALM Element

Each REALM element is used to setup basic authentication


for a particular named realm on the target site. The REALM
element has one required attribute, NAME, which specifies

Endeca Confidential Web Crawling with Authentication


94

the name of the realm. The body of a REALM element must


contain one (and only one) KEY element, which
encapsulates the username and password combination
that should be used by Forge to access the specified
realm on the target site.

KEY Element

The body of a KEY element can contain a


username/password pair or a pass phrase. For protection,
Forge expects the contents of a KEY element to be
encrypted. “Using Forge to Encrypt Keys and Pass
Phrases” on page 100 describes how Forge can be used
to encrypt username/password pairs or pass phrases for
use in KEY elements in such a way that only Forge itself is
capable of decoding them.

Configuring HTTPS Authentication


HTTPS configuration is similar to HTTP authentication
configuration. Forge supports HTTPS authentication of
the server, client authentication with certificates, and
secure communication over HTTPS. The following
sub-sections describe how to use a key ring file to
configure an HTTPS connection to a particular site, given
various security constraints.

Advanced Features Guide Endeca Confidential


Chapter 2
95

Boot-Strapping Server Authentication

To make an HTTPS connection, all that is often required


is for Forge (as a client) to be able to authenticate the
server. When Forge connects to a server via HTTPS it will
attempt to validate the server’s certificate by checking its
signature.

Therefore, Forge must be supplied with the public keys


of the certificate authority (CA) that signed the server’s
certificate. This information can be provided via a key
ring file that contains a CA_DB element, in as this example:
<KEY_RING>
<CA_DB>eneCA.pem</CA_DB>
<SITE HOST="www.endeca.com" PORT="6000">
<HTTP>
<REALM NAME="Sales Documents">
<KEY>BOcxV3wFSGuoBqbhPHkFGmA=</KEY>
</REALM>
</HTTP>
</SITE>
</KEY_RING>

CA_DB Element

The body of a CA_DB element specifies the path to a PEM


format certificate which contains one or more public keys
that Forge should use to validate the CA signatures it
encounters on server certificates when it retrieves URLs
via HTTPS. The path to this certificate may be relative to
the parent pipeline XML file or an absolute path.

Endeca Confidential Web Crawling with Authentication


96

If Forge is unable to find the public key of the CA that


signed a server certificate that it receives when attempting
to initiate an HTTPS transfer, it will fail to retrieve the
requested document and report an error. If a certificate
chain is necessary to validate the server certificate, the
public key of each CA along the chain must be present in
the CA_DB in order for host authentication to succeed.

Disabling Server Authentication for a Host

By default, Forge always attempts to validate CA


signatures for every HTTPS host. Host authentication can
be disabled for an individual host, however, by setting
the AUTHENTICATE_HOST attribute of the appropriate HTTPS
element in the key ring to FALSE (for more information,
see the AUTHENTICATE_HOST Attribute section).

HTTPS Element

The HTTPS element is the analog of the HTTP element. It


encapsulates the HTTPS configuration information that
applies to a particular site, which is defined by the HTTPS
element’s parent SITE element.

AUTHENTICATE_HOST Attribute

The HTTPS element has one optional attribute,


AUTHENTICATE_HOST. This attribute specifies whether or not
to verify the CA signature of server certificates received
from the target host. By default, the value of this attribute

Advanced Features Guide Endeca Confidential


Chapter 2
97

is TRUE. To disable host authentication for HTTPS


connections to the target host, set this attribute to FALSE:

<HTTPS AUTHENTICATE_HOST="FALSE"/>

Configuring Client Authentication

Some HTTPS servers may require clients to authenticate


themselves. A client does this by presenting a certificate
that has been signed by a CA that the server trusts. In
order for Forge to be able to connect to a server that
requires client authentication, it must be supplied with an
appropriate client certificate as well as an associated
private key, as illustrated by this example:
<KEY_RING>
<CA_DB>cacert.pem</CA_DB>
<SITE HOST="www.endeca.com" PORT="6000">
<HTTPS>
<CERT PATH="clientcert.pem" PRIV_KEY_PATH="clientkey.key">
<KEY>AqS6+A3u+ivX</KEY>
</CERT>
</HTTPS>
</SITE>
</KEY_RING>

CERT Element

One CERT element can be inserted in the body of an


HTTPS element to bootstrap the HTTPS connection with a
certificate and corresponding private key for a site that
requires client authentication. The CERT element has two
required attributes, PATH and PRIV_KEY_PATH, which
specify the locations of the certificate and private key.

Endeca Confidential Web Crawling with Authentication


98

If these files are protected by a pass phrase, the pass


phrase can be provided in the body of a KEY child
element of the CERT element, as in the above example.

As with HTTP username/password keys, Forge expects a


CERT’s key to be stored in an encrypted form. For more
information on how to put a CERT’s key in an encrypted
form, see “Using Forge to Encrypt Keys and Pass Phrases”
on page 100.

PATH Attribute

The PATH attribute of a CERT element specifies the location


of the certificate file. The certificate must be stored in the
PEM format. The path may be expressed relative to the
pipeline file or as an absolute path.

PRIV_KEY_PATH Attribute

The PRIV_KEY_PATH attribute specifies the path to a PEM


format file containing the private key associated with the
certificate referenced in the PATH element. This path also
may be expressed relative to the pipeline file or as an
absolute path.

Authenticating with a Microsoft Exchange Server

A key ring file may also be used to specify authentication


configuration for a Microsoft Exchange Server when using
a record adapter with an EXCHANGE format. The
Exchange server will expect a valid username and
password combination, which may be specified via a KEY

Advanced Features Guide Endeca Confidential


Chapter 2
99

element embedded in an EXCHANGE_SERVER element within


a key ring, as in the following example:
<KEY_RING>
<EXCHANGE_SERVER HOST="exchange.mycompany.com">
<KEY>B9qtQOON6skNTFTHm9rnn04=</KEY>
</EXCHANGE_SERVER>
</KEY_RING>

EXCHANGE_SERVER Element

This element opens a block of configuration for


authenticating to an Exchange server. It has one required
attribute, which is the HOST attribute.

The HOST attribute specifies the name of the Exchange


server to which the supplied configuration information
applies.

Authenticating with a Proxy Server


A key ring file may also be used to specify authentication
configuration for proxy servers.

Note: Basic authentication is the only method supported by


Forge for authenticating with proxy servers.

The proxy server will expect a valid username and


password combination, which may be specified via a KEY

Endeca Confidential Web Crawling with Authentication


100

element embedded in a PROXY element within a key ring,


as in the following example:
<KEY_RING>
<PROXY HOST="proxy.mycompany.com" PORT="8080">
<KEY>J9dtQOOR6skPTFTHm5rnn08=</KEY>
</PROXY>
</KEY_RING>

PROXY Element

The PROXY element contains configuration for proxy


authenticating. It has two required attributes:
• The HOST attribute specifies the host name of the proxy
server for which the supplied configuration
information applies.
• The PORT attribute specifies the port number on the
proxy host specified in the HOST attribute for which the
supplied configuration information applies.

Using Forge to Encrypt Keys and Pass Phrases


Forge requires the username/password pairs or pass
phrases kept in KEY elements within the key ring file to be
stored in an encrypted form which only Forge can
decode. Forge provides a command-line argument,
--encryptKey, which should be used to put the contents
of KEY elements in this form. The encrypt key flag has the
following syntax:
forge --encryptKey [username:]passphase

Advanced Features Guide Endeca Confidential


Chapter 2
101

Encrypting a Username/Password Pair

The following example shows how to run Forge to


encrypt a username/password pair (username=sales,
password=endeca) for use in an HTTP block of a key ring
file:
forge --encryptKey sales:Endeca

As the example illustrates, the username and password


must be entered together, separated by a colon, as the
argument to the --encryptKey flag. Forge then outputs
the encrypted key, which you then insert in the body of
the applicable KEY element.

Encrypting a Pass Phrase

To encrypt the pass phrase “burning down the house”


Forge should be executed with the following command:
forge --encryptKey "burning down the house"

Endeca Confidential Web Crawling with Authentication


102

Advanced Features Guide Endeca Confidential


Chapter 2
SECTION II
Record Features
104

Advanced Features Guide Endeca Confidential


Chapter 3
Creating Aggregated Records

The Endeca aggregated records feature allows the end user


to group records by dimension or property values. By
configuring aggregated records, you enable the Navigation
Engine to handle a group of multiple records as though it
were a single record, based on the value of the rollup key. A
rollup key can be any property or dimension with the
rollup attribute set to true, as described in “Enabling Record
Aggregation” on page 107.

Aggregated records are typically used to eliminate duplicate


display entries. For example, an album by the same title
may exist in several formats, with different prices. Each title
is represented in the Navigation Engine as a distinct Endeca
record. When querying the Navigation Engine, you may
want to treat these instances as a single record. This is
accomplished by creating an Endeca aggregated record.

From a performance perspective, aggregated Endeca


records are not an expensive feature. However, they should
only be used when necessary, because they add
organization and implementation complexity to the
application (particularly if the rollup key is different from
the display information).
106

Aggregated Record Behavior


Aggregated records behave differently than ordinary
records, as follows:
• Representative values—Given a single record,
evaluating the record’s information is straightforward.
However, aggregated records consist of many records,
which can have different representative values.
Generally for display and other logic requiring record
values, a single representative record from the
aggregated record is used. The representative record is
the individual record that occurs first in order of the
underlying records in the aggregated record. This
order is determined by either a specified sort key or a
relevance ranking strategy.
• Sort—The sort feature is first applied to all records in
the data set (prior to aggregating the records). The
record at the top of this set is the record with the
highest sort value. Given the sorted set of records,
aggregated records are created by iterating over the
set in descending order, aggregating records with the
same rollup key. An aggregated record’s rank is equal
to that of the highest ranking record in that aggregated
record set. The result is the same as aggregating all
records on the rollup key, taking the highest value of
the sort key for these aggregated records and sorting
the set based on this value.
Note that if you have a defined list of sort keys, the
first key is the primary sort criterion, the second key is
the secondary sort criterion, and so on.

Advanced Features Guide Endeca Confidential


Chapter 3
107

• More control—The user may want to gain more


control over representative values and/or sort. For
example, the desired behavior may be to sort the
aggregated records by the maximum price. This can
be accomplished by configuring a derived property. In
this case the property would derive from price with
the MAX function applied. This derived property could
be configured as a sort key, ensuring that the
aggregated records are sorted by the maximum price
of the records in the set.

The presentation developer has more power over


retrieving the representative values. The individual
records are returned with the aggregated record.
Therefore, the developer has all the information
necessary to correctly represent aggregated records (at
the cost of increased complexity). However, to achieve
the desired sort behavior, the Navigation Engine must be
configured correctly, because the internals of this
operation are not exposed to the presentation developer.

Enabling Record Aggregation


In Developer Studio, you enable aggregate Endeca record
creation by allowing record rollups based on properties
and dimensions.

Proper configuration of this feature requires that the


rollup key is a single assign value. That is, each record
should have at most one value from this dimension or
property. If the value is not single assign, the “first”

Endeca Confidential Creating Aggregated Records


108

(arbitrarily-chosen) value is used to create the aggregated


record. This can cause the results to vary arbitrarily,
depending upon the navigation state of the user. In
addition, features such as sort can change the grouping of
aggregated records that are assigned multiple values of
the rollup key.

To enable a property for record aggregation:

1. In the Project tab of Developer Studio, double-click


Properties to open the Properties view.
2. Select a property and click Edit. The Property editor is
displayed.
3. In the General tab, check Rollup.
4. Click OK.
5. From the File menu, choose Save.

To enable a dimension for record aggregation:

1. In the Project tab of Developer Studio, double-click


Dimensions to open the Dimensions view.
2. Select a dimension and click Edit. The Dimension
editor is displayed.
3. In the Dimension editor, click the Advanced tab.
4. Check Enable for Rollup.
5. Click OK.

Advanced Features Guide Endeca Confidential


Chapter 3
109

Generating and Displaying Aggregated Records


The general procedure of generating and displaying
aggregated records is as follows:
1. Determine which rollup keys are available to be used
for an aggregated record navigation query.
2. Create an aggregated record navigation query by using
one of the available rollup keys. This rollup key is
called the active rollup key, while all the other rollup
keys are inactive.
3. Retrieve the list of aggregated records from the
Navigation object and display their attributes.

These steps are discussed in detail below.

Determining the Available Rollup Keys

Assuming that you have a navigation state, the following


objects and method calls are used to determine the
available rollup keys. These rollup keys can be used in
subsequent queries to generate aggregated records.
• Navigation.getRollupKeys() - Gets the rollup keys
applicable for this navigation query. The .NET version
is the Navigation.RollupKeys property. The rollup
keys are returned as an ERecRollupKeyList object.
• ERecRollupKeyList.size() - Gets the number of rollup
keys in the ERecRollupKeyList object (Java). The .NET
and COM APIs have the ERecRollupKeyList.Count
property.

Endeca Confidential Creating Aggregated Records


110

• ERecRollupKeyList.getKey() - Gets the rollup key


from the ERecRollupKeyList object, using a zero-based
index (Java). The COM version is the
ERecRollupKeyList.Item() method and the .NET API
has the ERecRollupKeyList.Item property. The rollup
key is returned as an ERecRollupKey object.
• ERecRollupKey.getName() - Gets the name of the
rollup key (Java and COM). The .NET version is the
ERecRollupKey.Name property.

• ERecRollupKey.isActive() - Returns true if this rollup


key was applied in the navigation query or false if it
was not.

The rollup keys are retrieved from the Navigation object


in an ERecRollupKeyList object. Each ERecRollupKey in
this list contains the name and active status of the rollup
key:
• The name is used to specify the rollup key in a
subsequent navigation or aggregated record query.
• The active status indicates whether the rollup key was
applied to the current query.

The following code fragments show how to retrieve a list


of rollup keys, iterate over them, and display the names
of keys that are active in the current navigation state.

Advanced Features Guide Endeca Confidential


Chapter 3
111

Sample Java Code for Retrieving Rollup Keys

// Get rollup keys from the Navigation object


ERecRollupKeyList rllupKeys = nav.getRollupKeys();
// Loop through rollup keys
for (int i=0; i< rllupKeys.size(); i++) {
// Get each rollup key from the list
ERecRollupKey rllupKey = rllupKeys.getKey(i);
// If the key is active, display the key name
if (rllupKey.isActive()) {
%>Active rollup key: <%= rllupKey.getName() %><%
}
}

Sample .NET Code for Retrieving Rollup Keys

// Get rollup keys from the Navigation object


ERecRollupKeyList rllupKeys = nav.RollupKeys;
// Loop through rollup keys
for (int i=0; i< rllupKeys.Count; i++) {
// Get each rollup key from the list
ERecRollupKey rllupKey = (ERecRollupKey)rllupKeys[i];
// If the key is active, display the key name
if (rllupKey.IsActive()) {
%>Active rollup key: <%= rllupKey.Name %><%
}
}

Endeca Confidential Creating Aggregated Records


112

Sample COM Code for Retrieving Rollup Keys

' Get rollup keys from the Navigation object


dim rllupKeys
set rllupKeys = nav.GetRollupKeys()
' Loop through rollup keys
For i = 1 to rllupKeys.Count
' Get rollup key
Dim rllupKey
set rllupKey = rllupKeys(i)
' If the key is active, display the key name
if (rllupKey.isActive()) then
%>Active rollup key: <%= rllupKey.GetName() %><%
end if
Next

Creating Aggregated Record Navigation Queries

You can generate aggregated records with URL query


parameters or with Presentation API methods.

Note: Regardless of how many properties or dimensions you


have enabled as rollup keys, you can specify a maximum of one
rollup key per navigation query.

Specifying the Rollup Key for the Navigation Query

To generate aggregated Endeca records, the query must


be appended with an Nu parameter. The value of the Nu
parameter specifies a rollup key for the returned
aggregated records, using the following syntax:
Nu=<rollupkey>

For example:
controller.jsp?N=0&Nu=Winery

Advanced Features Guide Endeca Confidential


Chapter 3
113

The records associated with the navigation query are


grouped with respect to the rollup key prior to
computing the subset specified by the Nao parameter (that
is, if Nu is specified, Nao applies to the aggregated records
rather than individual records). Aggregated records only
apply to a navigation query. Therefore, the Nu query
parameter is only valid with an N parameter.

The equivalent Java API method to the Nu parameter is


the ENEQuery.setNavRollupKey() method; for example:
usq.setNavRollupKey(“Winery”);

The .NET version is the ENEQuery.NavRollupKey property.

When the aggregated record navigation query is made,


the returned Navigation object which will contain an
AggrERecList object.

Setting the Maximum Number of Returned Records

You can use the Np parameter to control the maximum


number of Endeca records returned in any aggregated
record. Set the parameter to 0 (zero) for no records, 1 for
one record, or 2 for all records. For example:
controller.jsp?Np=1&N=1&Nu=Winery

The ENEQuery.setNavERecsPerAggrERec() method is the


equivalent API method. The .NET version is the
ENEQuery.NavERecsPerAggrERec property.

Endeca Confidential Creating Aggregated Records


114

Creating Aggregated Record Queries

An aggregated record request is similar to an ordinary


record request with these exceptions:
• If you are using URL query parameters, the A
parameter is specified (instead of R). The value of the
A parameter is the record spec of the aggregated
record.
• If you are using API methods, use the
ENEQuery.setAggrERecSpec() method to specify the
aggregated record to be queried for. The .NET version
is the ENEQuery.AggrERecSpec property.
• The element returned is an aggregated record (not a
record).

Similar to an ordinary record, An (instead of N) is the


user’s navigation state. Only records that satisfy this
navigation state are included in the aggregated record. In
addition, the Au parameter must be used to specify the
aggregated record rollup key.

The following are examples of queries using An:

controller.jsp?An=0&A=32905&Au=Winery

controller.aspx?A=7&An=123&Au=ssn

The following example, from the nav_agg_records.jps


page in the JSP reference, shows how the UrlGen class to
construct the URL query string:

Advanced Features Guide Endeca Confidential


Chapter 3
115

// Create aggregated record request (start from empty request)


UrlGen urlg = new UrlGen("", "UTF-8");
urlg.addParam("A",aggrec.getSpec());
urlg.addParam("An",usq.getNavDescriptors().toString());
urlg.addParam("Au",usq.getNavRollupKey());
urlg.addParam("eneHost",(String)request.getAttribute("eneHost"));
urlg.addParam("enePort",(String)request.getAttribute("enePort"));
urlg.addParam("displayKey",(String)request.getParameter("displayKey"));
urlg.addParam("sid",(String)request.getAttribute("sid"));
String url = CONTROLLER+"?"+urlg;
%>
<a href="<%= url %>">
%>

Note that the ENEQuery.setAggrERecSpec() method


provides the aggregated record specification to the A
parameter, the ENEQuery.getNavDescriptors() method
gets the navigation values for the An parameter, and the
ENEQuery.getNavRollupKey() method gets the name of
the rollup key for the Au parameter.

Displaying Aggregated Records

The following sections describe how to handle


aggregated records that have been returned by the
Navigation Engine and how to display them.

Retrieving an Aggregated Record from a ENEQueryResults Object

On an aggregated record request, the aggregated record


is returned as an AggrERec object in the ENEQueryResults
object. Use these methods:

Endeca Confidential Creating Aggregated Records


116

• ENEQueryResults.containsAggrERec() returns true if


the ENEQueryResults object contains an aggregated
record.
• ENEQueryResults.getAggrERec() retrieves the AggrERec
object from the ENEQueryResults object. The .NET
version is the ENEQueryResults.AggrERec property.

For example:

// Make Navigation Engine request


ENEQueryResults qr = nec.query(usq);
// Check for an AggrERec object in ENEQueryResults
if (qr.containsAggrERec()) {
AggrERec aggRec = qr.getAggrERec();
...
}

Retrieving an Aggregated Record List from a Navigation Object

On an aggregated record navigation query, a list of


aggregated records (an AggrERecList object) is returned
in the Navigation object. Use these methods:
• Navigation.getAggrERecs() retrieves a list of
aggregated records returned by the navigation query,
as an AggrERecList object. The .NET version is the
Navigation.AggrERecs property.

Note: By default, the Navigation Engine returns a maximum


of 10 aggregated records. To change this number, use the
ENEQuery.setNavNumAggrERecs() method.

• Navigation.getTotalNumAggrERecs() returns the


number of aggregated records that matched the
navigation query. Typically, this number is much
higher than the number of aggregated records

Advanced Features Guide Endeca Confidential


Chapter 3
117

returned in the Navigation object, unless you used the


ENEQuery.setNavNumAggrERecs() method to change the
default number of 10 returned aggregated records.

Displaying Aggregated Record Attributes

After you retrieve an aggregated record, you can use the


following AggrERec class methods:
• getERecs() gets the Endeca records (ERec objects) that
are in this aggregated record. The .NET version is the
ERecs property.

• getProperties() returns the properties (as a


PropertyMap object) of the aggregated record. The
.NET version is the Properties property.
• getRepresentative() gets the Endeca record (ERec
object) that is the representative record of this
aggregated record. The .NET version is the
Representative property.

• getSpec() gets the specification of the aggregated


record to be queried for. The .NET version is the Spec
property.
• getTotalNumERecs() returns the number of Endeca
records (ERec objects) that are in this aggregated
record. The .NET version is the TotalNumERecs
property.

The following code snippets illustrate these methods.

Endeca Confidential Creating Aggregated Records


118

Sample Java Code for AggrERec Methods

Navigation nav = qr.getNavigation();


// Get total number of aggregated records that matched the query
long nAggrRecs = nav.getTotalNumAggrERecs();
// Get the aggregated records from the Navigation object
AggrERecList aggrecs = nav.getAggrERecs();
// Loop over the aggregated record list
for (int i=0; i<aggrecs.size(); i++) {
// Get individual aggregate record
AggrERec aggrec = (AggrERec)aggrecs.get(i);
// Get number of records in this aggregated record
long recCount = aggrec.getTotalNumERecs();
// Get the aggregated record's attributes
String aggrSpec = aggrec.getSpec();
PropertyMap propMap = aggrec.getProperties();
ERecList recs = aggrec.getERecs();
ERec repRec = aggrec.getRepresentative();
}

Sample .NET Code for AggrERec Properties

Navigation nav = qr.Navigation;


// Get total number of aggregated records that matched the query
long nAggrRecs = nav.TotalNumAggrERecs;
// Get the aggregated records from the Navigation object
AggrERecList aggrecs = nav.AggrERecs;
// Loop over the aggregated record list
for (int i=0; i<aggrecs.Count; i++) {
// Get individual aggregate record
AggrERec aggrec = (AggrERec)aggrecs[i];
// Get number of records in this aggregated record
long recCount = aggrec.TotalNumERecs;
// Get the aggregated record's attributes
String aggrSpec = aggrec.Spec;
PropertyMap propMap = aggrec.Properties;
ERecList recs = aggRec.ERecs;
ERec repRec = aggrec.Representative;
}

Advanced Features Guide Endeca Confidential


Chapter 3
119

Sample COM Code for AggrERec Methods

dim nav
set nav = qr.getNavigation()
' Get total number of aggregated records that matched the query
dim nAggrRecs
set nAggrRecs = nav.GetTotalNumERecs()
' Get the aggregated records from the Navigation object
dim aggrecs
set aggrecs = nav.GetAggrERecs()
' Loop over aggregated record list
For i = 1 to aggrecs.Count
' Get individual aggregate record
dim aggrec
set aggrec = aggrecs(i)
' Get number of records in this aggregated record
Dim recCount
recCount = aggrec.GetTotalNumERecs()
' Get the Aggregated Record's Specifier, Property Map,
' List of Records, and Representative Record
dim aggrSpec
set aggrSpec = aggrec.GetSpec()
dim aggPropsMap
set aggPropsMap = aggrec.GetProperties()
dim recs
set recs = aggRec.GetERecs()
dim repRec
set repRec = aggrec.GetRepresentative()
Next

Displaying the Records in the Aggregated Record

You display the Endeca records (ERec objects) in an


aggregated record with the same procedures as described
in the “Working with Endeca Records” chapter in the
Endeca Developer’s Guide.

Endeca Confidential Creating Aggregated Records


120

In the following example, a list of aggregated records is


retrieved from the Navigation object and the properties
of each representative record are displayed.

Sample Java Code for Displaying the Representative Record

// Get aggregated record list from the Navigation object


AggrERecList aggrecs = nav.getAggrERecs();
// Loop over aggregated record list
for (int i=0; i<aggrecs.size(); i++) {
// Get an individual aggregated record
AggrERec aggrec = (AggrERec)aggrecs.get(i);
// Get representative record of this aggregated record
ERec repRec = aggrec.getRepresentative();
// Get property map for representative record
PropertyMap repPropsMap = repRec.getProperties();
// Get property iterator to loop over the property map
Iterator repProps = repPropsMap.entrySet().iterator();
// Display representative record properties
while (repProps.hasNext()) {
// Get a property
Property prop = (Property)repProps.next();
// Display name and value of the property
%>
<tr>
<td>Property name: <%= prop.getKey() %></td>
<td>Property value: <%= prop.getValue() %>
</tr>
<%
}
}

Advanced Features Guide Endeca Confidential


Chapter 3
121

Sample .NET Code for Displaying the Representative Record

// Get aggregated record list from the Navigation object


AggrERecList aggrecs = nav.AggrERecs;
// Loop over aggregated record list
for (int i=0; i<aggrecs.Count; i++) {
// Get an individual aggregated record
AggrERec aggrec = (AggrERec)aggrecs[i];
// Get representative record of this aggregated record
ERec repRec = aggrec.Representative;
// Get property map for representative record
PropertyMap repPropsMap = repRec.Properties;
// Get property list for representative record
System.Collections.Ilist repPropsList = repPropsMap.EntrySet;
// Display representative record properties
foreach (Property repProp in repPropsList) {
%>
<tr>
<td>Property name: <%= repProp.Key %></td>
<td>Property value: <%= repProp.Value %>
</tr>
<%
}
}

Endeca Confidential Creating Aggregated Records


122

Sample COM Code for Displaying the Representative Record

' Get Aggregated Records list from the Navigation object


dim aggrecs
set aggrecs = nav.GetAggrERecs()
' Loop over aggregated record list
For i = 1 to aggrecs.Count
' Get an individual aggregated record
dim aggrec
set aggrec = aggrecs(i)
' Get representative record of this aggregated record
dim repRec
set repRec = aggrec.GetRepresentative()
' Get property map for representative record
dim repPropsMap
set repPropsMap = repRec.GetProperties()
' Get property iterator for representative record
dim repProps
set repProps = repPropsMap.EntrySet()
' Display representative record properties
For k = 1 to repProps.Count
' Get property
set prop = repProps(k)
' Display property
%>
<tr>
<td>Representative property name: <%= prop.GetKey() %></td>
<td>Representative property value: <%= prop.GetValue() %>
</tr>
<%
Next
Next

Advanced Features Guide Endeca Confidential


Chapter 3
Chapter 4
Using Derived Properties

A derived property is a property that is calculated by


applying a function to properties or dimension values from
each member record of an aggregated record.
Subsequently, the resultant derived property is assigned to
the aggregated record.

Aggregated records are a prerequisite to derived properties.


(If you are not already familiar with specifying a rollup key
and creating aggregated records, see “Creating Aggregated
Records” on page 105.)

To see how derived properties work, consider a book


application for which only unique titles are to be displayed.
The books are available in several formats (various covers,
special editions, and so on) and the price varies by format.
Specifying Title as the rollup key aggregates books of the
same title, regardless of format. To control the aggregated
record’s representative price (for display purposes), use a
derived property.

For example, the representative price can be the price of


the aggregated record’s lowest priced member record. The
derived property used to obtain the price in this example
would be configured to apply a minimum function to the
Price property.
124

Specifying Derived Properties


The DERIVED_PROP element in the Derived_props.xml file
specifies a derived property. The attributes of the
DERIVED_PROP element are:

• DERIVE_FROM - The property or dimension from which


the derived property will be calculated.
• FCN - The function to be applied to the DERIVE_FROM
properties of the aggregated record. Valid functions
are MIN, MAX, AVG, or SUM. Any dimension or property
type can be used with the MIN or MAX functions. Only
INTEGER or FLOAT properties may be used in AVG and
SUM functions.

• NAME - The name of the derived property. This name


can be the same as the DERIVE_FROM attribute.

Note: Developer Studio currently does not support configuring


derived properties. The workaround is to hand-edit the XML file
to add the DERIVED_PROP element.

Below is an example of the XML element that defines the


derived property described in the book example above.
<DERIVED_PROP
DERIVE_FROM="PRICE"
FCN="MIN"
NAME="LOW_PRICE"
/>

Similarly, a derived property can derive from dimension


values, if the dimension name is specified in the
DERIVE_FROM attribute. In addition, the function attribute

Advanced Features Guide Endeca Confidential


Chapter 4
125

(FCN) can be MAX, AVG or SUM, depending on the desired


behavior.

Displaying Derived Properties


The Presentation API’s semantics for a derived property
are similar to those of ordinary properties, though there
are a few differences. Derived properties apply only to
aggregated Endeca records. Therefore, the Navigation
Engine query must be properly formulated to include a
rollup key (see “Creating Aggregated Records” on
page 105 for more information).

In the aggregated record, the derived properties are


accessed via the getProperties method (Properties
property for .NET) and the representative properties are
obtained by calling the getProperties method on the
representative record. The representative record is
retrieved from the aggregated record via the
getRepresentative method (Representative property for
.NET).

The following code example demonstrates how to


display the names and values of an aggregated record’s
derived properties. (For an example of how to display the
representative record’s property values, see “Creating
Aggregated Records” on page 105.)

Endeca Confidential Using Derived Properties


126

Sample Java Code


// Get aggregated record list
AggrERecList aggrecs = nav.getAggrERecs();
for (int i=0; i<aggrecs.size(); i++) {
// Get individual aggregated record
AggrERec aggrec = (AggrERec)aggrecs.get(i);
// Get all derived properties.
PropertyMap derivedProps = aggrRec.getProperties();
Iterator derivedPropIter =
derivedProps.entrySet().iterator();
// Loop over each derived property, handle as an
ordinary property.
foreach (Property derivedProp in derivedPropsList) {
// Display property
%>
<tr><td>Derived property name: <%= derivedProp.Key
%></td>
<td>Derived property value: <%= derivedProp.Value
%></td></tr>
<%
}
}

Advanced Features Guide Endeca Confidential


Chapter 4
127

Sample .NET Code


// Get aggregated record list
AggrERecList aggrecs = nav.AggrERecs;
// Loop over aggregated record list
for (int i=0; i<aggrecs.Count; i++) {
// Get an individual aggregated record
AggrERec aggrec = (AggrERec)aggrecs[i];
// Get all derived properties.
PropertyMap derivedPropsMap = aggrec.Properties;
// Get property list for agg record
System.Collections.IList derivedPropsList =
derivedPropsMap.EntrySet;
// Loop over each derived property, handle as an
ordinary property.
while (derivedPropIter.hasNext()) {
Property prop = (Property) derivedPropIter.next( );
// Display property
%>
<tr>
<td>Derived property name: <%= prop.getKey()
%></td>
<td>Derived property value: <%= prop.getValue()
%></td>
</tr>
<%
}
}

Endeca Confidential Using Derived Properties


128

Sample COM Code


' Get Aggregated Records list from the Navigation
object
dim aggrecs
set aggrecs = nav.GetAggrERecs()
' Loop over aggregated record list
For i = 1 to aggrecs.Count
' Get an individual aggregated record
dim aggrec
set aggrec = aggrecs(i)
' Get all derived properties
dim derivedPropsMap
set derivedPropsMap = aggrec.GetProperties()
' Get property iterator
dim derivedProps
set derivedProps = derivedProps.EntrySet()
' Display derived properties
For k = 1 to derivedProps.Count
' Get property
set prop = repProps(k)
' Display property
%>
<tr>
<td>Derived property name: <%= prop.GetKey() %></td>
<td>Derived property value: <%= prop.GetValue() %>
</tr>
<%
Next
Next

Troubleshooting Derived Properties


A derived property can derive from either a property or a
dimension. The DERIVE_FROM attribute specifies the
property name or dimension name, respectively. Avoid
name collisions between properties and dimensions, as
this is likely to be confusing.

Advanced Features Guide Endeca Confidential


Chapter 4
129

Derived Property Performance


Some overhead is introduced to calculate derived
properties. In most cases this should be negligible.
However, large numbers of derived properties and more
importantly, aggregated records with many member
records may degrade performance.

Endeca Confidential Using Derived Properties


130

Advanced Features Guide Endeca Confidential


Chapter 4
Chapter 5
Selecting a Record Set Based on a Key

A set of Endeca records is returned with every navigation


query result. By default, each record includes the values
from all the keys (properties and dimensions) that have
record page and record list attributes. These attributes are
set with the Show with Record (for record page) and Show
with Record List (for record list) checkboxes, as configured
in Developer Studio.

However, if you do not want all the key values, you can
control the characteristics of the records returned by
navigation queries by using the Select feature.

About the Select Feature


The Select feature allows you to select specific keys
(Endeca properties and/or dimensions) from the data so
that only a subset of values will be transferred for Endeca
records in a query result set. The Select functionality allows
the application developer to determine these keys
dynamically, instead of at Dgraph or Agraph startup. This
selection will override the default record page and record
list fields.
132

A Web application that does not make use of all of the


properties and dimension values on a record can be more
efficient by only requesting the values that it will use. The
ability to limit what fields are returned is useful for
exporting bulk-format records and other scenarios. For
example, if a record has properties that correspond to the
same data in a number of languages, the application can
retrieve only the properties that correspond to the current
language. Or, the application may render the record list
using tabs to display different sets of data columns (e.g.,
one tab to view customer details and another to view
order details without always returning the data needed to
populate both tabs).

This functionality prevents the transferring of unneeded


properties and dimension values when they will not be
used by the front-end Web application. It therefore makes
the application more efficient because the unneeded data
does not take up network bandwidth and memory on the
application server.

The Select feature can also be used to specifically request


fields that are not transferred by default.

Configuring the Select Feature


No system configuration is required for the Select feature.
In other words, no instance configuration is required in
Developer Studio and no Dgidx or Dgraph/Agraph flags
are required to enable selection of properties and

Advanced Features Guide Endeca Confidential


Chapter 5
133

dimensions. Any existing property or dimension can be


selected.

Using URL Query Parameters for Select


A query for selected fields is the same as any valid
navigation query. Therefore, the Navigation parameter (N)
is required for the request.

Selecting Keys in the Application


With the Select feature, the Web application can specify
which properties and dimensions should be returned for
the result record set from the navigation query.

Using the Java Selection Method

You set the selection list on the ENEQuery object with the
setSelection() method, which has this syntax:

ENEQuery.setSelection(FieldList selectFields)

where selectFields is a list of property or dimension


names that should be returned with each record. You can
populate the FieldList object with string names (such as
"P_WineType") or with Property or Dimension objects. In
the case of objects, the FieldList.addField method will
automatically extract the string name from the object and
add it to the FieldList object.

Endeca Confidential Selecting a Record Set Based on a Key


134

During development, you can check which fields are set


with the ENEQuery.getSelection() method, which returns
a FieldList object.

The FieldList object will contain a list of Endeca


property and/or dimension names for the query. For
details on the methods of the FieldList class, see the
Endeca Javadocs.

The following is a simple Java example of setting an


Endeca property and dimension for a navigation query:
// Create a query
ENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");
// Create an empty selection list
FieldList fList = new FieldList();
// Add an Endeca property to the list
fList.addField("P_WineType");
// Add an Endeca dimension to the list
fList.addField("Designation");
// Add the selection list to the query
usq.setSelection(fList);
// Make the ENE query
ENEQueryResults qr = nec.query(usq);

When the ENEQueryResults object is returned, it will have


a list of records that have been tagged with the
P_WineType property and the Designation dimension.
You extract the records as with any record query.

Note: The setSelection() and getSelection() methods are


also available in the UrlENEQuery class.

Advanced Features Guide Endeca Confidential


Chapter 5
135

Using the .NET Selection Property

In a .NET application, the ENEQuery.Selection property is


used to get and set the FieldList object. You can add
properties or dimensions to the FieldList object with the
FieldList.AddField method.

The following is a C# example of setting an Endeca


property and dimension for a navigation query:
// Create a query
ENEQuery usq = new UrlENEQuery(queryString, "UTF-8");
// Create an empty selection list
FieldList fList = new FieldList();
// Add an Endeca property to the list
int i = fList.AddField("P_WineType");
// Add an Endeca dimension to the list
i = fList.AddField("Designation");
// Add the selection list to the query
usq.Selection = fList;
// Make the ENE query
ENEQueryResults qr = nec.Query(usq);

Using the COM/Perl Selection Methods

In a COM or Perl application, the Endeca property and/or


dimension names for the record query is supplied via an
Endeca FieldList collection object. You add the key
names with FieldList.Add methods.

You then use the ENEQuery.SetSelection() method to set


the FieldList object in the query to the Navigation
Engine.

Endeca Confidential Selecting a Record Set Based on a Key


136

The following is a COM example of setting an Endeca


property and dimension for a navigation query:
' Create a query
Dim usq
Set usq = Server.CreateObject("Endeca.UrlENEQuery")
usq.init Request.QueryString, "UTF-8"
' Create an empty selection list
Dim flist
Set flist = Server.CreateObject("Endeca.FieldList")
' Add an Endeca property to the list
flist.Add("P_WineType")
' Add an Endeca dimension to the list
flist.Add("Designation")
' Add the selection list to the query
usq.SetSelection(flist)
' Make Navigation Engine query
Dim qr
Set qr = nec.Query(usq)

Advanced Features Guide Endeca Confidential


Chapter 5
Chapter 6
Bulk Export of Records

The Bulk Export feature allows your application to perform


a navigation query for a large number of records. Each
record in the resulting record set is returned from the
Navigation Engine in a bulk export-ready format, which is a
gzipped format. The records can then be exported to
external tools, such as a Microsoft Excel or a CSV (comma
separated value) file.

Applications are typically limited in the number of records


that can be requested by the memory requirements of the
front-end application server. The Bulk Export feature adds a
means of delaying parsing and ERec or AggrERec object
instantiation, which allows front-end applications to handle
requests for large numbers of records.

Configuring the Bulk Export Feature


Endeca properties and/or dimensions that will be included
in a result set for bulk exporting must be configured in
Developer Studio with the Show with Record List checkbox
enabled. When this checkbox is set, the property or
dimension will appear in the record list display.
138

No Dgidx or Dgraph flags are necessary to enable the


bulk exporting of records. Any property or dimension
that has the Show with Record List attribute is available to
be exported.

Using URL Query Parameters for Bulk Export


A query for bulk export records is the same as any valid
navigation query. Therefore, the Navigation parameter (N)
is required for the request.

Retrieving Bulk Records in the Application


By using members from the ENEQuery and ENavigation
classes, you can set the number of bulk-format records to
be returned by the Navigation Engine and then retrieve
them from the Navigation query object.

Setting the Number of Bulk Records

When creating the navigation query, the application can


specify the number of Endeca records or aggregated
records that should be returned in a bulk format with
these Java/COM/Perl methods:
• ENEQuery.setNavNumBulkERecs() sets the maximum
number of Endeca records (ERec objects) to be
returned in a bulk format from a navigation query.
• ENEQuery.setNavNumBulkAggrERecs() sets the
maximum number of aggregated Endeca records

Advanced Features Guide Endeca Confidential


Chapter 6
139

(AggrERec objects) to be returned in bulk format from


a navigation query.

The MAX_BULK_ERECS_AVAILABLE constant can be used with


either method to specify that all of the records that match
the query should be exported; for example:
usq.setNavNumBulkERecs(MAX_BULK_ERECS_AVAILABLE);

To find out how many records will be returned for a


bulk-record navigation query, use these Java/COM/Perl
methods:
• ENEQuery.getNavNumBulkERecs() is for Endeca records.
• ENEQuery.getNavNumBulkAggrERecs() is for aggregated
Endeca records.

The ENEQuery.NavNumBulkAggrERecs and


ENEQuery.NavNumBulkERecs properties are the .NET
versions of the above methods.

Note: All of the above methods are also available in the


UrlENEQuery class.

The Java example sets the maximum number of


bulk-format records to 5,000 for a navigation query:
// Set Navigation Engine connection
ENEConnection nec = new HttpENEConnection(eneHost,enePort);
// Create a query
ENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");
// Specify the maximum number of records to be returned
usq.setNavNumBulkERecs(5000);
// Make the query to the Navigation Engine
ENEQueryResults qr = nec.query(usq);

Endeca Confidential Bulk Export of Records


140

Retrieving the Bulk-format Records

The list of Endeca records is returned from the Navigation


Engine inside the standard Navigation object. The records
are returned compressed in a gzipped format. The format
is not directly exposed to the application developer; the
developer only has access to the bulk data through the
the methods from the language being used.

It is up to the front-end application developer to


determine what to do with the retrieved records. For
example, you can display each record’s property and/or
dimension values, as documented in Endeca Developer’s
Guide. You can also write code to properly format the
property and dimension values for export to an external
file, such as a Microsoft Excel file or a CSV file.

Using Java Bulk Export Methods

The list of Endeca records is returned as a standard Java


Iterator object. To access the bulk-record iterator, use
one of these methods:
• Navigation.getBulkERecIter() returns an Iterator
object containing the list of Endeca bulk-format
records (ERec objects).
• Navigation.getBulkAggrERecIter() returns an
Iterator object containing the list of aggregated
Endeca bulk-format records (AggrERec objects).

Advanced Features Guide Endeca Confidential


Chapter 6
141

The Iterator class provides access to the bulk-exported


records with the following methods:
• Iterator.hasNext()—Returns true if the Iterator has
more records.
• Iterator.next()—Extracts (using gunzip) and returns
the next record in the iteration. The record is returned
as either an ERec or AggrERec object, depending on
which Navigation method was used to retrieve the
iterator.

The following Java code fragment shows how to set the


maximum number of bulk-format records to 5,000 and
then obtain a record list and iterate through the list:
// Create a query
ENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");
// Specify the maximum number of bulk export records to be returned
usq.setNavNumBulkERecs(5000);
// Make the query to the Navigation Engine
ENEQueryResults qr = nec.query(usq);
// Verify we have a Navigation object before doing anything.
if (qr.containsNavigation()) {
// Get the Navigation object
Navigation nav = ENEQueryResults.getNavigation();
// Get the Iterator object that has the ERecs
Iterator bulkRecs = nav.getBulkERecIter();
// Loop through the record list
while (bulkRecs.hasNext()) {
// Get a record, which will be gunzipped
ERec record = (ERec)bulkRecs.next();
// Display its properties or format the record for export
...
}
}

Endeca Confidential Bulk Export of Records


142

Using COM/Perl Bulk Export Methods

In a COM or Perl application, the list of Endeca records is


returned as an Endeca ERecIter collection object. To
access this object, use the Navigation.GetBulkERecIter()
or Navigation.GetBulkAggrERecIter() method.

The ERecIter class provides HasNext(), NextERec(), and


NextAggrERec() methods to access to the bulk-exported
records. The latter two methods will gunzip the next
result record and materialize the per-record object.

This COM sample shows how to set the records to 5,000


and iterate through the returned ERecIter object.

Dim usq ' Create a query


Set usq = Server.CreateObject("Endeca.UrlENEQuery")
usq.init Request.QueryString, "UTF-8"
usq.SetNavNumBulkERecs(5000)
' Make the Navigation Engine request
Dim qr
Set qr = nec.Query(usq)
If qr.ContainsNavigation() then
Dim nav
Set nav = qr.GetNavigation()
' Get the ERecIter object that has the records
Dim bulkrecs
Set bulkrecs = nav.GetBulkERecIter()
' Loop over ERecIter object and get records
Do While bulkrecs.HasNext()
' Get individual bulk-format record
dim rec
set rec = bulkrecs.NextERec()
' Display properties or format for export
' ...
Loop
End If

Advanced Features Guide Endeca Confidential


Chapter 6
143

Using .NET Bulk Export Methods

In a .NET application, the list of Endeca records is


returned as an Endeca ERecEnumerator object. To retrieve
this object, use the Navigation.BulkAggrERecEnumerator
or Navigation.BulkERecEnumerator property.

The following .NET code sample shows how to set the


maximum number of bulk-format records to 5000, obtain
the record list, and iterate through the collection:
// Create a query
ENEQuery usq = new UrlENEQuery(queryString, "UTF-8");
// Set max number of returned bulk-format records
usq.NavNumBulkERecs = 5000;
// Make the query to the Navigation Engine
ENEQueryResults qr = nec.Query(usq);
// First verify we have a Navigation object.
if (qr.ContainsNavigation()) {
// Get the Navigation object
Navigation nav = ENEQueryResults.Navigation;
// Get the ERecEnumerator object that has the ERecs
ERecEnumerator bulkRecs = nav.BulkERecEnumerator;
// Loop through the record list
while (bulkRecs.MoveNext()) {
// Get a record, which will be gunzipped
ERec record = (ERec)bulkRecs.Current;
// Display its properties or format for export
...
}
}

After the ERecEnumerator object is created, an enumerator


is positioned before the first element of the collection,
and the first call to MoveNext() moves the enumerator
over the first element of the collection. After the end of
the collection is passed, subsequent calls to MoveNext()

Endeca Confidential Bulk Export of Records


144

return false. The Current property will gunzip the


current result record in the collection and materialize the
per-record object.

Performance Impact for Bulk Export Records


Unneeded overhead is typically experienced when
exporting records from a Navigation Engine without the
Bulk Export feature. Currently, the front-end converts the
on-wire representation of all the records into objects in
the API language, which is not appropriate for bulk
export given the memory footprint that results from
multiplying a large number of records by the relatively
high overhead of the Endeca record object format. For
export, converting all of the result records to API
language objects at once requires an unacceptable
amount of application server memory.

Reducing the per-record memory overhead allows you to


output a large number of records from existing
applications. Without this feature, applications that want
to export large amounts of data are required to split up
the task and deal with a few records at a time to avoid
running out of memory in the application server’s
threads. This division of exports adds query processing
overhead to the Navigation Engine which reduces system
throughput and slows down the export process.

In addition, the compressed format of bulk-export


records further reduces the application's memory usage.

Advanced Features Guide Endeca Confidential


Chapter 6
Chapter 7
Record Filters

Note: This feature is available for customers who have purchased


Endeca InFront’s Custom Catalogs or Endeca ProFind’s User
Access Filters component.

Record filters allow an Endeca application to define


arbitrary subsets of the total record set and dynamically
restrict search and navigation results to these subsets.

For example, the catalog might be filtered to a subset of


records appropriate to the specific end user or user role.
The records might be restricted to contain only those visible
to the current user based on security policies. Or, an
application might allow end users to define their own
custom record lists (that is, the set of parts related to a
specific project) and then restrict search and navigation
based on a selected list. Record filters enable these and
many other application features that depend on applying
Endeca search and navigation to dynamically defined and
selected subsets of the data.

Record filters support Boolean syntax using property values


and dimension values as base predicates and standard
Boolean operators (AND, OR, and NOT) to compose complex
expressions. For example, a filter can consist of a list of part
number property values joined in a multi-way OR
expression. Or, a filter might consist of a complex nested
146

expression of ANDs, ORs, and NOTs on dimension IDs and


property values.

Filter expressions can be saved and loaded from XML


files, or passed directly as part of a Navigation Engine
query. In either case, when a filter is selected, the set of
visible records is restricted to those matching the filter
expression. For example, record search queries will not
return records outside the selected subset, and refinement
dimension values are restricted to lead only to records
contained within the subset.

Finally, it is important to note that record filters are


case-sensitive.

Record Filter Syntax


Record filters can be specified directly within a
Navigation Engine query. For example, the complete
Boolean expression representing the desired record
subset can be passed directly in an application URL.

In some cases, however, filter expressions require


persistence (in the case where the application allows the
end user to define and save custom part lists) or may
grow too large to be passed conveniently as part of the
query (in the case where a filter list containing thousands
of part numbers). To handle cases such as these, the
Navigation Engine also supports file-based filter
expressions.

Advanced Features Guide Endeca Confidential


Chapter 7
147

File-based filter expressions are simply files stored in a


defined location containing XML representations of filter
expressions. This section describes both the ENE Query
and XML syntaxes for filter expressions.

ENE Query Syntax

The following BNF grammar describes the syntax for


query-level filter expressions:

<filter> ::= <and-expr>


| <or-expr>
| <not-expr>
| <filter-expr>
| <literal>
<and-expr> ::= AND(<filter-list>)
<or-expr> ::= OR(<filter-list>)
<not-expr> ::= NOT(<filter>)
<filter-expr> ::= FILTER(<string>)
<filter-list> ::= <filter>
| <filter>,<filter-list>
<literal> ::= <pval>
| <dval-id>
| <dval-path>
<pval> ::= <prop-key>:<prop-value>
<prop-key> ::= <string>
<prop-value> ::= <string>
<dval-path> ::= <string>
| <string>:<dval-path>
<dval-id> ::= <unsigned-int>
<string> ::= Any character string.

The following five special reserved characters must be


prepended with an escape character (\) for inclusion in a
string: ( ) , : \

Endeca Confidential Record Filters


148

Basically, the syntax supports prefix-oriented Boolean


functions (AND, OR, and NOT), colon-separated paths for
dimension values and property values, and numeric
dimension value IDs.

The following example illustrates a basic filter expression:

OR(AND(Manufacturer:Sony,1001),
AND(Manufacturer:Aiwa,NOT(1002)),
Manufacturer:Denon)

This expression will match the set of records satisfying


any of the following statements:
• Value for Manufacturer property is Sony and record
assigned dimension value 1001.
• Value for Manufacturer is Aiwa and record not
assigned dimension value 1002.
• Value for Manufacturer property is Denon.

Aside from the nested Boolean operations illustrated by


the above example, a key aspect of query filter
expressions is the ability to refer to file-based filter
expressions using the FILTER operator. For example, if a
filter is stored in a file called MyFilter, that filter can be
selected as follows:

FILTER(MyFilter)

FILTER operators can be combined with normal Boolean


operators to compose filter operations, as in this example:

AND(FILTER(MyFilter),NOT(Manufacturer:Sony))

Advanced Features Guide Endeca Confidential


Chapter 7
149

The expression selects records that are satisfied by the


expression contained in the file MyFilter but that are not
assigned the value Sony to the Manufacturer property.

XML Syntax for File-based Record Filter Expressions

The syntax for file-based record filter expressions closely


mirrors the query level syntax, with the following
differences:
• In place of the AND, OR, NOT, and FILTER operators, the
FILTER_AND, FILTER_OR, FILTER_NOT, and FILTER_NAME
XML elements are used, respectively.
• In place of the property and dimension value syntax
used for query expressions, the PROP, DVAL_ID, and
DVAL_PATH common to other Endeca XML DTDs are
used to refer to property and dimension values.
• Instead of parentheses to enclose operand lists,
normal XML element nesting (implicit in the locations
of element start and end tags) is used.

The full DTD for XML file-based record filter expressions


is provided in the filter.dtd file packaged with the Endeca
software release.

For example, the following query expression:

OR(AND(Manufacturer:Sony,1001),
AND(Manufacturer:Aiwa,NOT(1002)),
Manufacturer:Denon)

Endeca Confidential Record Filters


150

is represented as a file-based expression using the


following XML syntax:
<FILTER>
<FILTER_OR>
<FILTER_AND>
<PROP
NAME="Manufacturer"><PVAL>Sony</PVAL></PROP>
<DVAL_ID ID="1001"/>
</FILTER_AND>
<FILTER_AND>
<PROP
NAME="Manufacturer"><PVAL>Aiwa</PVAL></PROP>
<FILTER_NOT>
<DVAL_ID ID="1002"/>
</FILTER_NOT>
</FILTER_AND>
<PROP NAME="Manufacturer"><PVAL>Denon</PVAL></PROP>
</FILTER_OR>
</FILTER>

Just as file-based expressions can be composed with


query expressions, file expressions can also be composed
within other file expressions. For example, the following
query expression:

AND(FILTER(MyFilter),NOT(Manufacturer:Sony))

can be represented as a file-based expression using the


following XML:
<FILTER>
<FILTER_AND>
<FILTER_NAME NAME="MyFilter"/>
<FILTER_NOT>
<PROP
NAME="Manufacturer"><PVAL>Sony</PVAL></PROP>
</FILTER_NOT>
</FILTER_AND>
</FILTER>

Advanced Features Guide Endeca Confidential


Chapter 7
151

Enabling Properties for Use in Record Filters


All dimension values are automatically enabled for use in
record filter expressions. Properties must be explicitly
enabled for use in record filters by using Developer
Studio.

To configure an existing property for use in record filters:


1. In the Project tab of Developer Studio, double-click
Properties.
2. From the Properties view, select a property and click
Edit. The Property editor is displayed.
3. Check Enable for Record Filters.
4. Click OK. The Properties view is redisplayed.
5. From the File menu, choose Save.

Data Configuration for File-based Filter Expressions


To use file-based filter expressions in an application, you
must create a directory to contain record filter files in the
same location where the Navigation Engine index data
will reside. The name of this directory must be
<index_prefix>.fcl.

For example, if the Navigation Engine index data resides


in the directory:
/usr/local/endeca/my_app/data/partition0/dgidx_output/

Endeca Confidential Record Filters


152

and the index data prefix is:


/usr/local/endeca/my_app/data/partition0/dgidx_output/index

then the directory created to contain record filter files


must be:
/usr/local/endeca/my_app/data/partition0/dgidx_output/index.fcl

Record filters that are needed by the application should


be stored in this directory, which is searched
automatically when record filters are selected in an ENE
query. For example, if in the above case you create a
filter file with the path:
/usr/local/endeca/my_app/data/partition0/dgidx_output/index.fcl/MyFilter

then the filter expression stored in this file will be used


when the query refers to the filter MyFilter. For example,
the URL query:

N=0&Nr=FILTER(MyFilter)

will use this file filter.

Record Filter Result Caching


The Navigation Engine caches the results of all record
filter evaluations for re-use on subsequent ENE queries as
part of the global dynamic cache. Both file-based record
filters (that is, record filter expressions using the FILTER
operator), and UrlENEQuery-based record filters are
cached.

Advanced Features Guide Endeca Confidential


Chapter 7
153

The one caveat to this general rule is that any information


derived from file-based record filters is not cached. This
means that navigation refinements, navigation counts,
and so on are not cached if they result from file-based
record filters.

Therefore, Endeca recommends that you use the


ENEQuery (via UrlENEQuery) based record filters instead
of file-based record filters whenever possible.

ENE URL Query Parameters for Record Filters


Three ENE URL query parameters are available to control
the use of record filters:
• Nr - Links to the ENEQuery.setNavRecordFilter
method. The Nr parameter can be used to specify a
record filter expression that will restrict the results of a
navigation query.
• Ar - Links to the ENEQuery.setAggrERec
NavRecordFilter method. The Ar parameter can be
used to specify a record filter expression that will
restrict the records contained in an aggregated-record
result returned by the Navigation Engine.
• Dr - Links to the ENEQuery.setDimSearch
NavRecordFilter method. The Dr parameter can be
used to specify a record filter expression that will
restrict the universe of records considered for a
dimension search. Only dimension values represented
on at least one record satisfying the specified filter will
be returned as search results.

Endeca Confidential Record Filters


154

Sample Queries
<application>?N=0&Nr=FILTER(MyFilter)
<application>?A=2496&An=0&Ar=OR(10001,20099)
<application>?D=Hawaii&Dn=0&Dr=NOT(Subject:Travel)

Record Filter Performance Implications

Memory Cost

The evaluation of record filter expressions is based on the


same indexing technology that supports navigation
queries in the Navigation Engine. Because of this, there is
no additional memory or indexing cost associated with
using navigation dimension values in record filters.

Expression Evaluation
Because expression evaluation is based on composition
of indexed information, most expressions of moderate
size (that is, tens of terms/operators) do not add
significantly to request processing time.

Furthermore, because the Navigation Engine caches the


results of record filter operations on an LRU (least
recently used) basis, the costs of expression evaluation
are typically only incurred on the first use of a filter
during a navigation session. However, some expected
uses of record filters have known performance bounds,
which are described in the following two sections.

Advanced Features Guide Endeca Confidential


Chapter 7
155

Record Filters

Record filters can impact the following areas:


• Spelling auto-correction and spelling Did You Mean
• Memory cost
• Expression evaluation

Interaction with Spelling Auto-correction and Spelling Did You Mean

Record filters impose an extra cost on spelling


auto-correction and spelling Did You Mean.

Memory Cost

The evaluation of record filter dimension value


expressions is based on the same indexing technology
that supports navigation queries in the Dgraph. Because
of this, there is no additional memory or indexing cost
associated with using navigation dimension values in
record filters. When using property values in record filter
expressions, additional memory and indexing cost is
incurred because properties are not normally indexed for
navigation.

This feature is controlled in Developer Studio by the


Enable for Record Filters setting on the Property editor.

Expression Evaluation

Because expression evaluation is based on composition


of indexed information, most expressions of moderate
size (that is, tens of terms and operators) do not add

Endeca Confidential Record Filters


156

significantly to request processing time. Furthermore,


because the Dgraph caches the results of record filter
operations, the costs of expression evaluation are
typically only incurred on the first use of a filter during a
navigation session. However, some expected uses of
record filters have known performance bounds, which
are described in the following two sections.

Large OR Filters (“Part Lists”)

One common use of record filters is the specification of


lists of individual records to identify data subsets (for
example, custom part lists for individual customers,
culled from a superset of parts for all customers).

The total cost of processing records can be broken down


into two main parts: the parsing cost and the evaluation
cost. For large expressions such as these, which will
commonly be stored as file-based filters, XML parsing
performance dominates total processing cost.

XML parsing cost is linear in the size of the filter


expression, but incurs a much higher unit cost than actual
expression evaluation. Though lightweight, expression
evaluation exhibits non-linear slowdown as the size of
the expression grows.

OR expressions with a small number of operands perform


linearly in the number of results, even for large result
sets. While the expression evaluation cost is reasonable
into the low millions of records for large OR expressions,
parsing costs relative to total query execution time can
become too large, even for smaller numbers of records.

Advanced Features Guide Endeca Confidential


Chapter 7
157

Part lists beyond approximately one hundred thousand


records generally result in unacceptable performance (10
seconds or more load time, depending on hardware
platform). Lists with over one million records can take a
minute or more to load, depending on hardware.
Because results are cached, load time is generally only an
issue on the first use of a filter during a session. However,
long load times can cause other Dgraph requests to be
delayed and should generally be avoided.

Large-scale Negation

In most common cases, where the NOT operator is used


in conjunction with other positive expressions (that is,
AND with a positive property value), the cost of negation
does not add significantly to the cost of expression
evaluation.

However, the costs associated with less typical,


large-scale negation operations can be significant. For
example, while still sub-second, top-level negation
filtering (such as “NOT availability=FALSE”) of a record
set in the millions does not allow high throughput
(generally less than 10 operations per second).

If possible, attempt to rephrase expressions to avoid the


top-level use of NOT in Boolean expressions. For
example, in the case where you want to list only available
products, the expression “availablity=TRUE” will yield
better performance than “NOT availability=FALSE.”

Endeca Confidential Record Filters


158

Advanced Features Guide Endeca Confidential


Chapter 7
SECTION III
Dimension Features
160

Advanced Features Guide Endeca Confidential


Chapter 8
Using Inert Dimension Values

Marking a dimension value as inert makes it non-navigable.


That is, the dimension value should not be included in the
navigation state.

From an end user perspective, the behavior of an inert


dimension value is similar to the behavior of a dimension
within a dimension group: With dimension groups, the
dimension group behaves like a dimension and the
dimension itself behaves like an inert child dimension
value. When the user selects the dimension, the navigation
state is not changed, but instead the user is presented with
the child dimension values. Similarly, when a user selects
an inert dimension value, the navigation state is not
changed, but the children of the dimension value are
displayed for selection.

Whether or not a dimension value should be inert is a


subjective design decision about the navigation flow within
a dimension. Two examples of when you might use inert
dimension values are the following:
• You want the “More...” option to be displayed at the
bottom of an otherwise long list. To do this, use
Developer Studio’s Dimension editor to enable dynamic
ranking for the dimension and generate a “More…”
dimension value.
162

• You want to define other dimension values that


provide additional information to users, but for which
it is not meaningful to filter items.

Note that the inert dimension value feature is purely a


presentation feature and has no performance impact on
the system.

Configuring Inert Dimension Values


To configure an existing dimension as non-navigable:

1. In the Project tab of Developer Studio, double-click


Dimensions to open the Dimensions view.
2. Select a dimension and click Edit. The Dimension
editor is displayed.
3. Select a dimension and click Values. In the Dimension
Values view, the Inert column indicates which
dimensions have been marked as inert.
4. Select a dimension value and click Edit. The
Dimension Value editor is displayed.
5. Check Inert.
6. Click OK. The Dimensions view is redisplayed, with a
Yes indicator in the Inert column for the changed
dimension.
7. From the File menu, choose Save.

There are no Dgidx or Dgraph flags necessary to mark a


dimension value as inert. Once a dimension has been

Advanced Features Guide Endeca Confidential


Chapter 8
163

marked as inert in Developer Studio, the Presentation API


will be aware of its status.

Using Inert Dimension Values in the Application


When sending the new navigation state to the Navigation
Engine, the Endeca application should check the value of
the isNavigable() method on each DimVal object. Only
dimension values that are navigable (that is, not inert)
should be sent to the Navigation Engine, for example, via
the ENEQuery.setNavDescriptors() method.

Setting the Inert attribute for a dimension value indicates


to the Presentation API that the dimension value should
be inert. However, it is up to the front-end application to
check for inert dimension values and handle them in an
appropriate manner.

The following code snippet shows how a DimVal object


is checked to determine if it is a navigable or inert
dimension value. In the example, the N parameter is
added to the navigation request only if the dimension
value is navigable (not inert).

Endeca Confidential Using Inert Dimension Values


164

Sample Java Code for Inert Dimension Values

// Get refinement list for a Dimension object


DimValList refs = dim.getRefinements();
// Loop over refinement list
for (int k=0; k < refs.size(); k++) {
// Get refinement dimension value
DimVal dimref = refs.getDimValue(k);//
// Create request to select refinement value
urlg = new UrlGen(request.getQueryString(), "UTF-8");
// If refinement is navigable, change the Navigation parameter
if (dimref.isNavigable()) {
urlg.addParam("N",
(ENEQueryToolkit.selectRefinement(nav,dimref)).toString());
urlg.addParam("Ne",Long.toString(rootId));
}
// If refinement is non-navigable, change only the exposed
// dimension parameter (Leave the Navigation parameter as is)
else {
urlg.addParam("Ne",Long.toString(dimref.getId()));
}
}

Advanced Features Guide Endeca Confidential


Chapter 8
165

Sample .NET Code for Inert Dimension Values

// Get refinement list for a Dimension object


DimValList refs = dim.Refinements;
// Loop over refinement list
for (int k=0; k < refs.Count; k++) {
// Get refinement dimension value
DimVal dimref = (DimVal)refs[k];
// Create request to select refinement value
urlg = new UrlGen(Request.Url.Query.Substring(1), "UTF-8");
// If refinement is navigable, change the Navigation parameter
if (dimref.IsNavigable()) {
urlg.addParam("N",
(ENEQueryToolkit.SelectRefinement(nav,dimref)).ToString());
urlg.AddParam("Ne",rootId.ToString());
}
// If refinement is non-navigable, change only the exposed
// dimension parameter (Leave the Navigation parameter as is)
else {
urlg.AddParam("Ne",dimref.Id.ToString());
}
}

Endeca Confidential Using Inert Dimension Values


166

Sample COM Code

' Get refinement list for a Dimension object


dim refs
set refs = dimn.GetRefinements
' Loop over refinement list
For k = 1 to refs.Count
// Get refinement dimension value
set dimref = refs(k)
' Create request to expose dimension values
set urlg = Server.CreateObject("Endeca.UrlGen")
urlg.init Request.QueryString, "UTF-8"
' If refinement is navigable, change the Navigation parameter
if (dimref.isNavigable) then
set qt = Server.CreateObject("Endeca.ENEQueryToolkit")
set newref = qt.SelectRefinement(nav, dimref)
urlg.addParam "N" , newref.idString
urlg.addParam "Ne", rootId
' If refinement is non-navigable, change only the exposed
' dimension parameter (Leave the Navigation parameter as is)
else
urlg.addParam "Ne", dimref.getId()
end if
Next

Advanced Features Guide Endeca Confidential


Chapter 8
Chapter 9
Working with Externally Created
Dimensions

This document describes how to include and work with an


externally created dimension in a Developer Studio project.
This capability allows you to build all or part of a logical
hierarchy for your data set outside of Developer Studio and
then import that logical hierarchy as an Endeca dimension
available for use in search and Guided Navigation.

An externally created dimension describes a logical


hierarchy of a data set; however, the dimension hierarchy is
transformed from its source format to Endeca compatible
.xml outside of Developer Studio. The logical hierarchy of
the dimension conforms to Endeca’s external interface for
describing a data hierarchy (external_dimensions.dtd)
before you import the dimension into your project. Once
you import an externally created dimension, its ownership
is wholly transferred to Developer Studio. As such, you can
modify the dimension in any way necessary using
Developer Studio.

It is important to clarify the difference between an


externally managed taxonomy and an externally created
dimension to determine which feature document is
appropriate for your purposes. The two concepts are similar
yet have two important key differences: externally managed
168

taxonomies and externally created dimensions differ in


how you include them in a Developer Studio project and
how Developer Studio treats them once they are part of a
project. Use the table below to determine which one you
are working with.

The following table compares an externally managed


taxonomy and an externally created dimension..
Externally Managed Externally Created
Taxonomy Dimension

How do you modify Any changes to the dimension You generally do not update
or update the must be made in third-party the source file for the hierarchy
hierarchy after it is in tool. You then export the after you import it into your
the project? taxonomy from the tool, and project. If you do update the
Forge transforms the taxonomy file and re-import, then any
and re-integrates the changes changes you made to the
into your project. dimension using Developer
Studio are discarded.After you
import the hierarchy, you can
modify a dimension just as if
you created it manually using
Developer Studio.

How does Developer The third-party tool that created After you import the file,
Studio manage the the file retains ownership. The Developer Studio takes full
hierarchy? dimension is almost entirely ownership of the dimension
read-only in the project. You and its dimension values. You
cannot add or remove can modify any characteristics
dimension values from the of the dimension and its
dimension. However, you can dimension values.
modify whether dimension
values are inert and collapsible.

How do you create Created using a third-party tool. Created either directly in an
the .xml file? .xml file or created using a
third-party tool.

Advanced Features Guide Endeca Confidential


Chapter 9
169

Externally Managed Externally Created


Taxonomy Dimension

How do you include Read in to a pipeline using a By choosing Import External


the file in a Developer dimension adapter with Format Dimension on the File menu.
Studio project? set to XML - Externally During import, Developer
Managed. Forge transforms the Studio creates internal
taxonomy file in to a dimension dimensions and dimension
according to the .xslt file that values for each node in the
you specify on the Transformer file's hierarchy.If you create the
tab of the dimension adapter. file using a third-party tool and
any .xml transformation is
necessary, you must transform
the file outside the project
before you choose Import
External Dimension on the File
menu. The file must conform to
external_dimensions.dtd
before you import it.

If you are working with externally created dimensions,


use this chapter. If you are working with an externally
managed taxonomy, see “Working with an Externally
Managed Taxonomy” on page 177.

An overview of the process to include an externally


created dimension in a Developer Studio project is as
follows:
1. You create a dimension hierarchy either manually in
an .xml file or you create a dimension using a
third-party tool. The dimension file must conform to

Endeca Confidential Working with Externally Created Dimensions


170

Endeca’s external_dimensions.dtd file (described


below).
2. You import the .xml file for the dimension into
Developer Studio, and modify the dimension and
dimension values as necessary.

XML Requirements
When you create an external dimension-whether by
creating it directly in an .xml file or by transforming it
from a source file—the dimension must conform to
Endeca’s external_dimensions.dtd file before you import
it into your project. External_dimensions.dtd defines
Endeca-compatible .xml used to describe dimension
hierarchies in an Endeca system. This file is located in
%ENDECA_ROOT%\version\conf\dtd on Windows and
$ENDECA_ROOT/version/conf/dtd on UNIX.

Also, an XML declaration that specifies the


external_dimensions.dtd file is required in an external
dimensions file. If you omit specifying the DTD in the
XML declaration, none of the DTD’s implied values or
other default values, such as classification values, are
applied to the external dimensions during Data Foundry
processing. Here is an example XML declaration that
should appear at the beginning of an external dimension
file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE external_dimensions SYSTEM "external_dimensions.dtd">

Advanced Features Guide Endeca Confidential


Chapter 9
171

Here is a very simple example of an external dimension


file with the required XML declaration and two
dimensions.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE external_dimensions SYSTEM "external_dimensions.dtd">

<external_dimensions>
<node id="1" name="color" classify="true">
<node id="2" name="red" classify="true"/>
<node id="3" name="blue" classify="true"/>
</node>

<node id="10" name="size" classify="true">


<node id="20" name="small" classify="true"/>
<node id="30" name="med" classify="true"/>
</node>

</external_dimensions>

You can describe a dimension hierarchy using any of the


three syntax options described in the following section.

XML Syntax to Specify Dimension Hierarchy

The XML elements available to external_dimensions.dtd


allow a flexible XML syntax to describe a dimension
hierarchy. There are three different syntax approaches
you can choose from when building the hierarchy
structure of your externally created dimension. All three
are supported by external_dimensions.dtd. Each
approach provides a slightly different syntax structure to
define a dimension and express the parent/child
relationship among dimensions and dimension values.
The three syntax choices are as follows:

Endeca Confidential Working with Externally Created Dimensions


172

• Use nested node elements within node elements.


• Use the parent attribute of a node to reference a
parent’s node ID.
• Use the child element to reference child’s node ID.

You can use only one of the three approaches to describe


a hierarchy within a single .xml file. In other words, do
not mix different syntax structures within one file. Any
node element without a parent node describes a new
dimension. You can describe as many dimensions as
necessary in a single .xml file.

The following examples show each approach to building


a dimension hierarchy. The these examples are
semantically equivalent: each describes the same
dimension and child dimension values.

Example of Using Nested node Elements

This example shows nested dimension values red and


blue within the dimension color.
<node name="color" id="1">
<node name="red" id="2"/>
<node name="blue" id="3"/>
</node>

Advanced Features Guide Endeca Confidential


Chapter 9
173

Example of Using Parent Attributes

This example shows the red and blue dimension values


using the parent attribute. The value of the parent
attribute references the ID for the dimension color.
<node name="color" id="1"/>
<node id="2" name="red" parent="1"/>
<node id="3" name="blue" parent="1"/>

Example of Using Child Elements

This example uses child elements to indicate that red and


blue are dimension values of the color dimension. The ID
of each child element references the ID of the red and
blue nodes.
<node name="color" id="1">
<child id="2"/>
<child id="3"/>
</node>
<node name="red" id="2"/>
<node name="blue" id="3"/>

Note: You can also find additional information and examples


of using the elements in external_dimensions.dtd in the Endeca
XML Reference.

Node ID Requirements

Each node element in your dimension hierarchy must


have an id attribute. Depending on your requirements,
you may choose to provide any of the following values
for the id attribute:

Endeca Confidential Working with Externally Created Dimensions


174

• Name - If the name of a dimension value is what


determines its identity, then provide the id attribute
with the name.
• Path - If the path from the root node to the dimension
value determines its identity, then provide a value
representing the path in the id attribute.
• Existing identifier - If a node already has an identifier,
then that identifier can be used in the id attribute.

You can provide an arbitrary id value as long as the value


is unique. If you are including multiple .xml files, the
identifier must be unique across multiple files.

There is one scenario where an id attribute is optional. It


is optional only if you are using an externally created
dimension and also defining your dimension hierarchy
using nested node sub-elements (rather than using parent
or child ID referencing).

Importing an Externally Created Dimension

You add an externally created dimension to your pipeline


by importing it from the File menu of Developer Studio.
Once you import the .xml file into Developer Studio, the
dimension appears in the Dimensions view, and
Developer Studio has full read-write ownership of the
dimension. You can modify any aspects of a dimension
and its dimension values as if you created it manually
using Developer Studio.

Advanced Features Guide Endeca Confidential


Chapter 9
175

To import an externally created dimension:

1. From the File menu, choose Import External


Dimensions. The Import External Dimensions dialog
box displays.
2. Specify the .xml file that defines the dimensions.
3. Chose a dimension adapter from the “Dimension
adapter to receive imported dimensions” drop-down
list.
4. Click OK. The dimensions appear in the Dimensions
editor for you to configure as necessary.

Note: Unlike the procedure to import an externally managed


taxonomy, you do not need to run a baseline update to import
an externally created dimension.

Endeca Confidential Working with Externally Created Dimensions


176

Advanced Features Guide Endeca Confidential


Chapter 9
Chapter 10
Working with an Externally Managed
Taxonomy

This document describes how to include and work with an


externally managed taxonomy in a Developer Studio
project. This capability allows you to build all or part of a
logical hierarchy for your data set outside of Developer
Studio and use Developer Studio to transform that logical
hierarchy into Endeca dimensions and dimension values for
use in search and Guided Navigation.

An externally managed taxonomy is a logical hierarchy for a


data set that is built and managed using a third-party tool.
Once you include an externally managed taxonomy in your
project, it becomes a dimension whose hierarchy is
managed by the third-party tool that created it. In
Developer Studio, you cannot add or remove dimension
values from it. If you want to modify a dimension or its
dimension values, you have to edit the taxonomy using the
third-party tool and then update the taxonomy in your
project.

It is important to clarify the difference between an


externally managed taxonomy and an externally created
dimension to determine which feature document is
appropriate for your purposes. The two concepts are similar
yet have two important key differences: externally managed
178

taxonomies and externally created dimensions differ in


how you include them in a Developer Studio project and
how Developer Studio treats them once they are part of a
project. Use the table below to determine which one you
are working with.

The following table compares an externally managed


taxonomy and an externally created dimension:
Externally Managed Externally Created
Taxonomy Dimension

How do you modify Any changes to the dimension You generally do not update
or update the must be made in third-party the source file for the hierarchy
hierarchy after it is in tool. You then export the after you import it into your
the project? taxonomy from the tool, and project. If you do update the
Forge transforms the taxonomy file and re-import, then any
and re-integrates the changes changes you made to the
into your project. dimension using Developer
Studio are discarded.After you
import the hierarchy, you can
modify a dimension just as if
you created it manually using
Developer Studio.

How does Developer The third-party tool that created After you import the file,
Studio manage the the file retains ownership. The Developer Studio takes full
hierarchy? dimension is almost entirely ownership of the dimension
read-only in the project. You and its dimension values. You
cannot add or remove can modify any characteristics
dimension values from the of the dimension and its
dimension. However, you can dimension values.
modify whether dimension
values are inert and collapsible.

How do you create Created using a third-party tool. Created either directly in an
the .xml file? .xml file or created using a
third-party tool.

Advanced Features Guide Endeca Confidential


Chapter 10
179

Externally Managed Externally Created


Taxonomy Dimension

How do you include Read in to a pipeline using a By choosing Import External


the file in a Developer dimension adapter with Format Dimension on the File menu.
Studio project? set to XML - Externally During import, Developer
Managed. Forge transforms the Studio creates internal
taxonomy file in to a dimension dimensions and dimension
according to the .xslt file that values for each node in the
you specify on the Transformer file's hierarchy.If you create the
tab of the dimension adapter. file using a third-party tool and
any .xml transformation is
necessary, you must transform
the file outside the project
before you choose Import
External Dimension on the File
menu. The file must conform to
external_dimensions.dtd
before you import it.

If you are working with an externally managed


taxonomy, use this chapter. If you are working with
externally created dimensions, see “Working with
Externally Created Dimensions” on page 167.

An overview of the process to include an externally


managed taxonomy in a Developer Studio project is as
follows:
1. You build an externally managed taxonomy using a
third-party tool. This document does not describe any
third-party tools or procedures that you might use to
perform this task.
2. You create an .xslt style sheet that instructs Forge how
to transform the taxonomy into Endeca .xml that

Endeca Confidential Working with an Externally Managed Taxonomy


180

conforms to external_dimensions.dtd. This


requirement is described below in XSLT and XML
Requirements.
3. You configure your Developer Studio pipeline to
perform the following tasks:
• Describe the location of an externally managed
taxonomy and an .xslt style sheet with a dimension
adapter.
• Transform an externally managed taxonomy into an
externally managed dimension by running a
baseline update.
• Add an externally managed dimension to the
Dimensions view and the Dimension Values view.

After you finish the tasks listed above, you can perform
additional pipeline configuration that uses the externally
managed dimension, and then run a second baseline
update to process and tag your Endeca records.

Note: You must set up and use the Endeca Manager in projects
that incorporate an externally managed taxonomy.

XSLT and XML Requirements


In order to transform an externally managed taxonomy
into an externally managed dimension, you have to
create an .xslt style sheet that instructs Forge how to map
the taxonomy .xml to Endeca .xml. The mapping in your
.xslt style sheet and your resulting hierarchy must
conform to the Endeca external_dimensions.dtd file.

Advanced Features Guide Endeca Confidential


Chapter 10
181

Both the .xslt and .xml requirements are further described


in the sections below.

XSLT Mapping

In order for Developer Studio to process the .xml from


your externally managed taxonomy, you have to create
an .xslt stylesheet that instructs Forge how to map the
.xml elements in an externally managed taxonomy to
Endeca-compatible .xml. Later in this document, you will
configure the Transformer tab of a dimension adapter
with the path to the .xslt style sheet and the path to the
taxonomy .xml file, and then run a baseline update to
transform the external taxonomy into an Endeca
dimension.

The external_dimensions.dtd defines Endeca-compatible


.xml to describe dimension hierarchies. This file is located
in %ENDECA_ROOT%\version\conf\dtd on Windows and
$ENDECA_ROOT/version/conf/dtd on UNIX. You can
describe a dimension hierarchy using any of the three
syntax options described in the following section.

Note: You can also find additional information and examples


of using the elements in external_dimensions.dtd in the Endeca
XML Reference.

XML Syntax to Specify Dimension Hierarchy

The .xml elements available to external_dimensions.dtd


allow a flexible .xml syntax to describe dimension
hierarchy. There are three different syntax approaches

Endeca Confidential Working with an Externally Managed Taxonomy


182

you can choose from when building the hierarchy


structure of your externally managed dimension. All three
are supported by external_dimensions.dtd. Each
approach provides a slightly different syntax structure to
define a dimension and express the parent/child
relationship among dimensions and dimension values.
The three syntax choices are as follows:
• Use nested node elements within node elements.
• Use the parent attribute of a node to reference a
parent’s node ID.
• Use the child element to reference child’s node ID.

You can use only one of the three approaches to describe


a hierarchy within a single .xml file. In other words, do
not mix different syntax structures within one file. Any
node element without a parent node describes a new
dimension. You can describe as many dimensions as
necessary in a single .xml file.

The following examples show each approach to building


a dimension hierarchy. These examples are semantically
equivalent: each describes the same dimension and child
dimension values.

Advanced Features Guide Endeca Confidential


Chapter 10
183

Example of Using Nested node Elements

This example shows nested dimension values red and


blue within the dimension color.
<node name="color" id="1">
<node name="red" id="2"/>
<node name="blue" id="3"/>
</node>

Example of Using parent Attributes

This example shows the red and blue dimension values


using the parent attribute. The value of the parent
attribute references the ID for the dimension color.
<node name="color" id="1"/>
<node name="red" id="2" parent="1"/>
<node name="blue" id="3" parent="1"/>

Example of Using child Elements

This example uses child elements to indicate that red and


blue are dimension values of the color dimension. The ID
of each child element references the ID of the red and
blue nodes.
<node name="color" id="1">
<child id="2"/>
<child id="3"/>
</node>
<node name="red" id="2"/>
<node name="blue" id="3"/>

Endeca Confidential Working with an Externally Managed Taxonomy


184

Node ID Requirements and Identifier Management in Forge

When you transform the hierarchy structure from an


external taxonomy, each node element in your dimension
hierarchy must have an id attribute. Forge ensures that
each identifier is unique across an Endeca
implementation by creating a mapping between a node’s
ID and an internal identifier that Forge creates.

This internal mapping ensures that Forge assigns the


same identifier to a node from an external taxonomy each
time the taxonomy is processed. For example, if you
provide updated versions of a taxonomy file, Forge
determines which dimension values map to dimension
values from a previous version of the file according to the
internal identifier. However, there is a scenario where
Forge does not preserve the mapping between the id
attribute and the internal identifier that Forge creates for
the dimension value. This scenario occurs if you
reorganize a dimension value to become a child of a
different parent dimension. Reorganizing a dimension
value within the same parent dimension does not affect
the id mapping when Forge reprocesses updated files.

Depending on your requirements, you may choose to


provide any of the following values for the id attribute:
• Name - If the name of a dimension value is what
determines its identity, then the .xslt style sheet should
fill the id attribute with the name.
• Path - If the path from the root node to the dimension
value determines its identity, then the .xslt style sheet

Advanced Features Guide Endeca Confidential


Chapter 10
185

should put a value representing the path in the id


attribute.
• Existing identifier - If a node already has an identifier,
then that identifier can be used in the id attribute.

You can provide an arbitrary ID as long as the value is


unique. If you are including multiple .xml files, the
identifier must be unique across all files. As described
above forge ensures that identifiers are unique across the
system.

Pipeline Configuration
The following sections describe the pipeline
configuration requirements to incorporate an externally
managed taxonomy into your Developer Studio project.

Integrating an Externally Managed Taxonomy

You use a dimension adapter to read in .xml from an


externally managed taxonomy and transform it to an
externally managed Endeca dimension. If necessary, you
can import and transform multiple taxonomies by using a
different dimension adapter for each taxonomy file.

To perform the taxonomy transformation, you configure a


dimension adapter with the .xml file of the taxonomy and
the .xslt style sheet that Forge uses to transform the
taxonomy file's .xml elements. You then build the rest of
your pipeline and run a baseline update. When the

Endeca Confidential Working with an Externally Managed Taxonomy


186

update runs, Forge transforms the taxonomy into a


dimension that you can load and examine in the
Dimensions view.

To integrate an externally managed taxonomy:

1. In the Project tab of Developer Studio, double-click


Pipeline Diagram.
2. In the Pipeline Diagram editor, choose New >
Dimension > Adapter. The Dimension Adapter editor
displays.
3. In the Dimension Adapter Name text box, enter a
unique name for the dimension adapter.
4. In the General tab, do the following:
• In the Direction frame, select Input.
• In the Format field, select XML - Externally
Managed.
• In the URL field, enter the path to the source
taxonomy file. This path can be absolute or relative
to the location of your project’s Pipeline.epx file.
• Check Require Data if you want Forge to generate
an error if the file does not exist or is empty.
5. In the Transformer tab, do the following:
• In the Type field, enter XSLT.
• In the URL field, specify the path to the .xslt file you
created.
6. Click OK.
7. From the File menu, select Save.

Advanced Features Guide Endeca Confidential


Chapter 10
187

8. If necessary, repeat steps 2 through 6 to include


additional taxonomies.
9. Create a dimension server to provide a single point of
reference for other pipeline components to access
dimension information. For more information about
dimension servers, see the Endeca Developer Studio
Help.

Note: If you want to modify an externally managed dimension,


see “Updating an Externally Managed Taxonomy in Your
Pipeline” on page 190.

Transforming an Externally Managed Taxonomy

In order to transform your externally managed taxonomy


into an Endeca dimension you have to run a baseline
update. Running the update allows Forge to transform the
taxonomy and store a temporary copy of the resulting
dimension in the Endeca Manager. After you run the
update, you can then create a dimension in the
Dimensions view according to “Loading an Externally
Managed Dimension” below.

Note: The Endeca Manager must be running before you start a


baseline update. Also you must add any required pipeline
components to your pipeline for the update to run. For example,
you cannot run the update without a property mapper.
However, you can temporarily add a default property mapper
and later configure it with property and dimension mapping.

Endeca Confidential Working with an Externally Managed Taxonomy


188

To transform an externally managed taxonomy:

1. Ensure you have sent the latest instance configuration


to the Endeca Manager.
2. In the Endeca Manager toolbar, click Start Baseline
Update.

Note: To reduce processing time for large source data sets, you
may want to run the baseline update using the -n flag for
Forge. (The -n flag controls the number of records processed in
a pipeline, for example, -n 10 processes ten records.) You can
specify the flag in the Forge field of the Endeca Manager
Settings dialog box.

Loading an Externally Managed Dimension

After you transform an external taxonomy into an Endeca


dimension, you can then load the dimension in the
Dimensions view and add its dimension values to the
Dimension Values view.

Rather than click New, as you would to manually create a


dimension in Developer Studio, you instead click
Discover in Dimensions view to add an externally
managed dimension. Developer Studio discovers the
dimension by reading in the dimension’s temporary file
that Forge created when you ran the first baseline update.
Next, you load the dimension values in the Dimension
Values editor.

Note: Because the dimension values are externally managed,


you cannot add or remove dimension values. You can however
modify whether dimension values are inert or collapsible.

Advanced Features Guide Endeca Confidential


Chapter 10
189

To load a dimension and its dimension values:

1. In the Project tab of Developer Studio, double-click


Dimensions. The Dimensions view displays.
2. Click the Discover button to add the externally
managed dimension to the Dimensions view. The
dimension appears in the Dimensions view with its
Type column set to Externally Managed.
3. In Dimensions view, select the externally managed
dimension and click Values. The Dimension Values
view appears with the root dimension value of the
externally managed dimension displayed.
4. Select the root dimension value and click Load. The
remaining dimension values display.
5. Repeat steps 3 to 4 for any additional externally
managed taxonomies you integrated in your project.

Note: Most characteristics of an externally managed dimension


and its dimension values are not modifiable. These
characteristics either appear as unavailable or Developer Studio
displays a message indicating what actions are possible. If you
want to modify these characteristics of a dimension or its
dimension values, you have to edit the source taxonomy by
performing the tasks described in “Updating an Externally
Managed Taxonomy in Your Pipeline” on page 190.

Running a Second Baseline Update

After loading dimension values and building the rest of


your pipeline, you must run a second baseline update to
process and tag your Endeca records.

Endeca Confidential Working with an Externally Managed Taxonomy


190

The second baseline update performs property and


dimension mapping that could not be performed in the
first baseline update because the externally managed
dimensions had not yet been transformed and available
for mapping.

To run a second baseline update:

1. Ensure you have sent the latest instance configuration


to the Endeca Manager.
2. In the Endeca Manager toolbar, click Start Baseline
Update.

Updating an Externally Managed Taxonomy in Your Pipeline

If you want to modify an externally managed taxonomy


and replace it with a newer version, you have to revise
the taxonomy using the third-party tool that created it,
and then repeat the process of incorporating the
externally managed taxonomy into your pipeline as
described in “Pipeline Configuration” on page 185.

Advanced Features Guide Endeca Confidential


Chapter 10
Chapter 11
Classifying Documents with Stratify

This document describes how to integrate Stratify


taxonomies and Stratify classification capabilities into a
Developer Studio project. Incorporating Stratify into your
project allows you to classify unstructured source data, for
example, Web pages, .pdf documents, and Microsoft Word
documents, for use in Endeca Guided Navigation
applications.

Unstructured source data requires different processing than


structured source data. Structured data, such as example,
databases, .csv files, character-delimited files, fixed-width
files and so on, have name/value pairs that Endeca can
translate into dimensions and Endeca properties.

Unstructured data, on the other hand, is not composed of


name/value pairs that Endeca can translate into dimensions
and Endeca properties. For unstructured data, you have to
use tools like the Stratify Discovery System to evaluate the
content of an unstructured document and assign the
document a topic based on classification logic that you
configure. In an Endeca pipeline, this topic becomes a
property that can be used like any other property associated
with an Endeca record; for example, it can be manipulated
and mapped to dimensions or Endeca properties.
192

In the Stratify Discovery System, you use the Stratify


Taxonomy Manager to build a taxonomy to organize your
source data, and you use the Stratify Classification Server
to classify unstructured source data against that
taxonomy. Endeca Developer Studio provides the
capability to include a Stratify taxonomy and transform it
to an Endeca dimension. Endeca uses the results of
Stratify’s document classification to tag Endeca records,
that is your unstructured documents, with classification
properties. After the records contain classification
properties, you can map the properties to dimension
values. The interaction between Endeca and Stratify
Discovery System is covered in more detail in “How
Endeca and Stratify Classify Unstructured Documents” on
page 196.

You integrate Stratify into your pipeline by adding a


dimension adapter to transform the Stratify taxonomy, a
Content Acquisition System (CAS) to crawl unstructured
documents, and a record manipulator with a STRATIFY
expression to access the Stratify Classification Server. For
an overview of the process, see “Overview of the
Integration Process” on page 199. You will find it helpful
to read the Content Acquisition System chapter beginning
on page 23 before integrating Stratify into your project.

Sections of This Document

This document contains the following main sections:


• Classifying Documents with Stratify
• Frequently Used Terms and Concepts

Advanced Features Guide Endeca Confidential


Chapter 11
193

• How Endeca and Stratify Classify Unstructured


Documents
• Overview of the Integration Process
• Required Stratify Tools
• Developing a Stratify Taxonomy
• Building a Taxonomy
• Exporting a Taxonomy
• Creating a Pipeline to Incorporate Stratify
• Creating a CAS
• Classifying Documents with Stratify Classification
Server
• Adding a Property Mapper and Indexer Adapter
• Integrating a Stratify Taxonomy
• Running the First Baseline Update
• Loading a Dimension and its Dimension Values
• Mapping a Dimension Based on a Stratify Taxonomy
• Running the Second Baseline Update
• Updating a Taxonomy in Your Pipeline

Frequently Used Terms and Concepts

Endeca concepts and terms for record organization


generally correspond to Stratify terms for document
organization. For example, a Stratify taxonomy contains
topics into which source documents are organized; an

Endeca Confidential Classifying Documents with Stratify


194

Endeca dimension contains dimension values into which


Endeca records are organized. (Remember that Endeca
records are based on source documents.) Both sets of
concepts provide a similar framework to describe a way
of hierarchically organizing information.

There are several frequently used terms in this document:


• Structured data is source data based on name/value
pairs. Common example formats that provide this
structure include databases, .csv files,
character-delimited files, fixed-width files and so on.
For example, this pairing might occur in source data
as Color/Red, Price/$8.00, and Size/Medium. Endeca
can translate these properties into dimensions and
Endeca properties for use in search and Guided
Navigation.
• Unstructured data is source data that does not have
name/value pairs. Common examples of unstructured
data include MS Word documents, .pdf files, Web
pages, and so on. If you have unstructured data,
applications such as Stratify can examine the
document and classify the document into topics based
on logic that you configure. Endeca can use the
document-to-topic classifications to create name/value
properties that can be manipulated and mapped for
Guided Navigation.
• Taxonomies provide a logical hierarchy to organize
data, typically based on a theme. A taxonomy has a
root and topics. Topics provide sub-grouping for the
theme. For example, in a taxonomy whose root theme
is Heath Care, topics may include Physical Health and

Advanced Features Guide Endeca Confidential


Chapter 11
195

Emotional Health. Respective sub-topics may include


Immunizations and Depression. Source documents are
organized in a taxonomy according to a classification
model (described below). After you import a
taxonomy into your pipeline, it becomes an Endeca
dimension.
• Taxonomy topics provide sub-grouping organization in
a Stratify taxonomy. Think of topics as nodes in a
taxonomy structure. Once you import a taxonomy into
your pipeline, the topics in a Stratify taxonomy
become the dimension values of a single Endeca
dimension.
• Externally managed taxonomies are logical hierarchies
to organize your data that are built and managed by a
third-party tool and transformed by Developer Studio
into a dimension. A Stratify taxonomy is an example of
an external managed taxonomy that you can create to
assist in classifying unstructured source documents.
The Stratify Classification Server manages a published
taxonomy to perform classification tasks. If you want
to modify the dimension or dimension values, you
have to edit the taxonomy and taxonomy topics in
Taxonomy Manager, publish it to the Stratify
Classification Server, export the .xml, and integrate the
.xml into Developer Studio.
• Classification model describes the classification
method to organize source documents into topics.
Classification models may use any of the following
methods either individually or in combination:
statistical, Boolean, keyword, and source rules. The
Stratify Classification Server uses both the taxonomy

Endeca Confidential Classifying Documents with Stratify


196

and its classification model to classify unstructured


source documents.

How Endeca and Stratify Classify Unstructured Documents

Endeca components and Stratify components


communicate closely during record processing to classify
each unstructured source document. In particular, a
record manipulator using a STRATIFY expression interacts
with a Stratify Classification Server to classify each record
as it is processed through the pipeline. A simplified
summary of the interaction is as follows: Forge crawls
unstructured source documents, hands them off to Stratify
to classify them, and then Forge appends classification
properties to the record for the corresponding source
document. You map the Stratify properties to a dimension
created from the Stratify taxonomy.

The illustration below shows the interaction between


Endeca components and Stratify components in greater
detail. There are three kinds of flow in the diagram:
• URLs flow from the spider to the record adapter (a
record adapter that uses the Document format).
• Documents flow to the indexer adapter and get turned
into Endeca records.
• The Stratify taxonomy is published to the Stratify
Classification Server, exported from the Taxonomy
Manager as .xml, and transformed in the pipeline as a
dimension. Strictly speaking, this step is not part of the
record processing flow. This step must be performed
only once before you run the pipeline.

Advanced Features Guide Endeca Confidential


Chapter 11
197

When Forge executes this pipeline using Developer


Studio, the flow of URLs and records is as follows:
1. The terminating component (indexer adapter) requests
the next record from its record source (property
mapper). At this point, none of the pipeline

Endeca Confidential Classifying Documents with Stratify


198

components between the indexer adapter and the


record adapter has records to process yet.
2. When the request for the next record reaches the
record adapter, the record adapter asks the spider for
the next URL it is to retrieve (the first iteration through
the URL processing loop, the URL is the root URL
configured on the Root URL tab of the Spider editor).
3. Based on the URL that the spider provides, the record
adapter creates a record containing the URL and a
limited set of metadata.
4. The created record then flows down to the first record
manipulator where the following takes place:
• The document associated with the URL is fetched
(using the RETRIEVE_URL expression).
• Content (searchable text) is extracted from the
document using the CONVERTTOTEXT or PARSE_DOC
expression. Any URLs in the document are also
extracted for additional crawling.
5. The record then moves to the spider where additional
URLs from the document (those extracted in the
record manipulator) are queued for crawling.
6. The created record then flows down to the second
record manipulator where the following takes place:
• The STRATIFY expression requests that the Stratify
Classification Server classify each document. Forge
sends the document as an attachment to a Stratify
Classification Server.
• The Stratify Classification Server examines the
document, including the document’s structure, and

Advanced Features Guide Endeca Confidential


Chapter 11
199

classifies it according to the classification model you


developed in the Stratify Taxonomy Manager.
• The Stratify Classification Server then replies to
Forge with a classification response that indicates
what properties to append to the record.
7. The property mapper performs source property to
dimension and source property to Endeca property
mapping.
8. The indexer adapter receives the record and writes it
out to disk.

The process repeats until there are no URLs in the URL


queue maintained by the spider.

Overview of the Integration Process

There are two main phases to integrating a Stratify


taxonomy and Stratify classification capabilities into a
Developer Studio project. You perform the first phase
using Stratify tools. You perform the second phase using
Endeca tools.

In phase one you develop a Stratify taxonomy by


performing the following tasks:
1. Build and validate the taxonomy, including its
classification model.
2. Publish the taxonomy and classification model to the
Stratify Classification Server.
3. Export the taxonomy as .xml.

Endeca Confidential Classifying Documents with Stratify


200

See “Developing a Stratify Taxonomy” on page 201 for


details about each step listed above.

In phase two you create a Developer Studio pipeline to


incorporate the Stratify taxonomy and access the Stratify
Classification Server by performing the following tasks:
1. Create a CAS.
2. Create a record manipulator to classify documents
with the Stratify Classification Server.
3. Add a property mapper and indexer adapter.
4. Integrate the Stratify taxonomy.
5. Run a baseline update to transform the taxonomy.
6. Load a dimension and its dimension values.
7. Map the dimension in a property mapper.
8. Run another baseline update.

See “Creating a Pipeline to Incorporate Stratify” on


page 204 for details about each step listed above.

Required Stratify Tools

In addition to installing and configuring the Endeca


Navigation Platform, Endeca Developer Studio, and the
Endeca Manager, you must also install and configure a
Stratify Discovery System 3.0 before you begin the
integration process. At a minimum, this installation
requires the following two components of the Stratify
Discovery System:

Advanced Features Guide Endeca Confidential


Chapter 11
201

• Stratify Taxonomy Manager—manages the total


taxonomy lifecycle, using a combination of automated
technologies, advanced analytics and human review to
create, define, test, publish and refine taxonomies.
• Stratify Classification Server—stores the taxonomy and
uses it to classify documents in response to requests
from Forge.

You do not need to use either Stratify Analytics or Stratify


Notification Server as part of the integration.

Developing a Stratify Taxonomy


Before you can build an integrated project you first need
to develop your Stratify taxonomy. Development includes
building the taxonomy itself, publishing it and its
classification model to the Stratify Classification Server,
and then exporting the taxonomy as .xml.

You perform all of the steps involved in building and


exporting a taxonomy using Stratify tools. This document
describes the high-level tasks of building a taxonomy and
some of the detailed procedures for exporting a
taxonomy. Refer to the User’s Guide for Stratify Discovery
System 3.0 for additional details about how to perform
each task.

Endeca Confidential Classifying Documents with Stratify


202

Building a Taxonomy

The following steps outline the high-level process to


build and publish a Stratify taxonomy using the
Taxonomy Manager. For complete documentation of
these steps, see the User’s Guide for Stratify Discovery
System 3.0.
1. Build a taxonomy using the Taxonomy Manager. This
step includes creating the taxonomy in any of the
following ways: manually, generating automatically, or
importing from another external source. This step may
also include editing the taxonomy as necessary during
development.
Note: When creating a Stratify taxonomy, the name you
provide becomes the name of the root dimension value and
the names of the taxonomy’s topics become the names of the
dimension values beneath the root.
Note: Also, if you are building multiple Stratify taxonomies,
the identifier for each Stratify topic must be unique across
all taxonomy files.
2. Define the classification model for the taxonomy. You
can employ any of the four Stratify classification
models: statistical, keyword, source rules, or Boolean.
3. Compile the model of the taxonomy, if you created a
statistical classification model. This step is not
necessary for the other models.
4. Test the taxonomy’s classification model.
5. Publish the taxonomy to the Stratify Classification
Server and version any other published taxonomy if
necessary.

Advanced Features Guide Endeca Confidential


Chapter 11
203

Exporting a Taxonomy

After you have built and published a taxonomy to the


Stratify Classification Server, you must export the
taxonomy as an .xml file. You will later include this .xml
file in Developer Studio and transform it to an Endeca
dimension to use during mapping.

Note: This procedure describes exporting a taxonomy from the


Stratify Taxonomy Manager. If you have additional questions
about the procedure, see the User’s Guide for Stratify Discovery
System 3.0.

To export a taxonomy:

1. Start Stratify Taxonomy Manager.


2. From the View menu, select Taxonomies.
3. In the Select Taxonomies to View dialog box, check
the taxonomy that you want to export.
4. Click OK.
5. In the taxonomy tree, right-click the published
taxonomy.
6. Select Export from the submenu.
7. In the Export dialog box, perform the following steps:
a. In the Topic field, confirm that you selected the
correct taxonomy.
b.In the To File field, provide a path and file name for
the taxonomy you want to export.
c. Select the Export Children option.

Endeca Confidential Classifying Documents with Stratify


204

d.Under Advanced Options, select All.


8. Click OK.
9. If necessary, repeat steps 2 through 8 to export
multiple taxonomies.

Creating a Pipeline to Incorporate Stratify

A Developer Studio project uses CAS to crawl


unstructured source documents. Next, a record
manipulator accesses the Stratify Classification Sever,
which classifies the Endeca record for each source
document. The following sections of this document
describe how to create and configure a pipeline that
includes Stratify taxonomies and classification
capabilities.

There are two specific additions to a typical CAS pipeline


that allow you to incorporate Stratify taxonomies and
classification capabilities:
• A dimension adapter is required to read in and
transform the Stratify taxonomy.
•A STRATIFY expression is required after the
RETRIEVE_URL and text extraction expressions (either
PARSE_DOC or CONVERTTOTEXT) to communicate with the
Stratify Classification Server and classify documents.

Because of the similarity between a CAS pipeline and a


CAS pipeline with Stratify, this document relies on
cross-references to the Content Acquisition System
chapter. In cases where the two CAS pipelines are exactly

Advanced Features Guide Endeca Confidential


Chapter 11
205

the same, this document references the CAS procedure.


In cases where a typical CAS pipeline differs with a CAS
pipeline including Stratify, this document indicates those
differences.

This diagram shows an example pipeline similar to the


one you create in this document.

Endeca Confidential Classifying Documents with Stratify


206

Creating a CAS

Begin your pipeline by creating a CAS that crawls your


unstructured documents. Most of the steps to create a
CAS pipeline that includes Stratify are common to a
typical CAS pipeline. The pipeline differs in the
components that follow the spider. After you finish the
procedure in this section, go on to “Classifying
Documents with Stratify Classification Server” on
page 207.

To create a CAS:

1. Create a record adapter to read source documents


(required). See “Creating a Record Adapter to Read
Documents” on page 41.
2. Create a record manipulator (required). See “Creating
a Record Manipulator” on page 43 for this task and
also for the following bullet items:
a. Add a RETRIEVE_URL expression (required).
b.Convert documents to text (required).
c. Identify the language of the documents (optional).
d.Remove document body properties (optional but
recommended).
3. Create a spider (required). See “Creating a Spider” on
page 55.

Advanced Features Guide Endeca Confidential


Chapter 11
207

Classifying Documents with Stratify Classification Server

A STRATIFY expression is required after the RETRIEVE_URL,


and text extraction expressions (either PARSE_DOC or
CONVERTTOTEXT). The STRATIFY expression identifies a
Stratify Classification Server that classifies the
unstructured document associated with an Endeca record.

For the sake of pipeline clarity, we recommend you add


the STRATIFY expression in its own record manipulator
that follows the spider component. The recommended
position of a record manipulator containing the STRATIFY

Endeca Confidential Classifying Documents with Stratify


208

expression is after the spider component and before the


property mapper.

Record manipulator
with STRATIFY
expression

If you have more than one Stratify Classification Server in


your environment, then you need one STRATIFY
expression to specify the host, port, hierarchy ID, and so
on for each server. Typically, a single taxonomy is
published to a single Stratify Classification Server. Note,
however, that you can publish multiple taxonomies to a
single Stratify Classification Server.

Advanced Features Guide Endeca Confidential


Chapter 11
209

To add a STRATIFY expression to a record manipulator:

1. In the Project tab of Developer Studio, double-click


Pipeline Diagram.
2. In the Pipeline Diagram editor, choose New >
Record > Manipulator. The Record Manipulator editor
displays.
3. Click OK.
4. Double-click the record manipulator.
5. Add a STRATIFY expression as shown in the example
below. If necessary, add additional STRATIFY
expressions for each Stratify Classification Server in
your environment.

Nested expression nodes within STRATIFY configure how


it functions. The following expression nodes are required:
• STRATIFY_HOST – The machine name or IP address of
the Stratify Classification Server.
• STRATIFY_PORT – The port on which the Stratify
Classification Server listens for requests from Forge.
• HIERARCHY_ID – The identifier of a Stratify classification
model. To determine the VALUE of HIERARCHY_ID:
1. Navigate to the working directory of the Stratify
Classification Server that contains your classification
model and taxonomy files. This directory is typically
located at <Stratify Install
Directory>\ClassificationServer\ClassificationSer
ver\ClassificationServerWorkDir\Taxonomy-N, where
N is the number of the directory that contains the
classification model you want to use with your Endeca

Endeca Confidential Classifying Documents with Stratify


210

project. (Your environment may have multiple


\Taxonomy-N directories each containing different
classification model and taxonomy files.)
2. Note the number at the end of the of \Taxonomy-N
directory. This number is the value of HIERARCHY_ID.
For example, if the classification model you want to
use is stored in ...\Taxonomy-2, then HIERARCHY_ID
should have VALUE="2". If you published more than
one taxonomy to your Stratify Classification Server,
include a HIERARCHY_ID node for each taxonomy.
• IDENTIFIER_PROP_NAME – The Endeca identifier for the
record being processed. The default is
Endeca.Identifier.

• BODY_PROP_NAME – The property that the Stratify


Classification Server examines to classify the
document. The default property is
Endeca.Document.Body. You can provide either
Endeca.Document.Body or Endeca.Document.Text.
However, specifying Endeca.Document.Body provides
better classification because Forge can send the
document to the Stratify Classification Server as an
attachment, and the Stratify Classification Server can
use the attachment to determine structural information
of the document that aids in classification. If you
specify Endeca.Document.Text, Forge sends the
converted text of the document without any of its
structural information.

Note: It is not necessary to provide attribute values for the LABEL


or URL attributes.

Advanced Features Guide Endeca Confidential


Chapter 11
211

For general information about how to create expressions,


see the Endeca Developer Studio Help.

When you run the pipeline, here is how the classification


process takes place:
1. For each record that passes through the record
manipulator, the STRATIFY expression requests that the
Stratify Classification Server classify the document
indicated by Endeca.Document.Body.
2. Forge sends the document as an attachment to the
Stratify Classification Server.
3. The Stratify Classification Server examines the
document, including the document’s structure, and
classifies it according to the classification model you
developed in Taxonomy Manager. (You indicate the
classification model in the HIERARCHY_ID expression
node.)
4. The Classification Server then replies to Forge with a
classification response that indicates what property
values to append to the record.

The STRATIFY expression generates at least the following


properties to append to each record:
• Endeca.Stratify.Topic.HID<hierarchy ID>=
<topic ID>

This property corresponds to the ID value of a topic in


your published Stratify taxonomy. Each topic in your
taxonomy has an ID value assigned by Stratify. The
value of <hierarchy ID> corresponds to your
HIERARCHY_ID expression node. For example, if an
Eating Disorders topic has a topic ID of 2097222 in a

Endeca Confidential Classifying Documents with Stratify


212

health care taxonomy whose hierarchy ID is 15, then


the Endeca property is Endeca.Stratify.
Topic.HID15="2097222".

• Endeca.Stratify.Topic.Topic.Name.HID<hierarchy
ID>.TID<topic ID>=<topic name>

This property corresponds to a topic name from your


published Stratify taxonomy for its corresponding
topic ID. For example, for the Eating Disorders topic
in the health care taxonomy mentioned earlier, this
property is Endeca.Stratify.Topic.Name.HID15.
TID2097222="Eating Disorders".
• Endeca.Stratify.Topic.Score.HID<hierarchy
ID>.TID<topic ID>=<score>

This property indicates classification score between an


unstructured document and the topic it has been
classified into. The value of <score> is a percentage
expressed as a value between zero and one. Zero
indicates the lowest classification score (0%), and one
indicates the highest score (100%). You can use this
property to remove records from your application that
have a low score for classification matching, for
example, Endeca.Stratify.Topic.Score.
HID15.TID2097222="0.719380021095276".

Adding a Property Mapper and Indexer Adapter

Although the pipeline does not have any external


dimensions available yet for mapping, because you have
not integrated the Stratify taxonomy yet, you need to add
a property mapper in the pipeline to successfully run the
first baseline update. You will come back to this property

Advanced Features Guide Endeca Confidential


Chapter 11
213

mapper later in the process to add dimension mapping


for your external taxonomy.

For now, create an empty property mapper and select the


upstream record manipulator as its record source. Also
add an indexer adapter to your project as you would for a
basic pipeline. There are no special configuration
requirements for the indexer adapter in a pipeline that
integrates Stratify. See the Developer Studio Help for
more information about creating a property mapper and
indexer adapter.

Integrating a Stratify Taxonomy

You use a dimension adapter to read in the Stratify


taxonomy XML and transform it to Endeca-compatible
XML. If necessary, you can include and transform
multiple Stratify taxonomies by using a single dimension
adapter for each taxonomy file.

In the Dimension Adapter editor, you specify the XML file


of the taxonomy that Forge transforms. To perform the
taxonomy transformation, you have to run a baseline
update. When the update runs, Forge transforms the
taxonomy into an externally managed Endeca dimension
that you can load in the Dimensions view.

After you integrate the Stratify taxonomy, a newly


transformed dimension is very similar to a dimension
created using Developer Studio. However, because the
dimension is externally managed by Stratify, you cannot
edit the dimension or its dimension values using

Endeca Confidential Classifying Documents with Stratify


214

Developer Studio. If you want to modify an externally


managed dimension, you have to repeat the process of
integrating Stratify into your pipeline. For more
information, see “Updating a Taxonomy in Your Pipeline”
on page 223.

To integrate a Stratify taxonomy:


1. In the Project tab of Developer Studio, double-click
Pipeline Diagram.
2. In the Pipeline Diagram editor, choose New >
Dimension > Adapter. The Dimension Adapter editor
displays.
3. In the Dimension Adapter Name text box, enter a
unique name for the dimension adapter.
4. In the General tab, do the following:
• In the Direction frame, select Input.
• In the Format field, select Stratify.
• In the URL field, enter the path to the source
taxonomy file that you exported in “Exporting a
Taxonomy” on page 203. This path must be relative
to the location of your project’s Pipeline.epx file.
• Check Require data if you want Forge to generate an
error if the file does not exist or is empty.
5. Click OK.
6. From the File menu, select Save.
7. If necessary, repeat steps 2 through 6 to include
additional Stratify taxonomies.

Advanced Features Guide Endeca Confidential


Chapter 11
215

8. Create a dimension server to provide a single point of


reference for other pipeline components to access
dimension information from multiple dimension
adapters. See the Endeca Developer Studio Help for
more information.

Running the First Baseline Update

In order to transform the source .xml from your Stratify


taxonomy into an Endeca dimension, you have to run a
baseline update. Running the update allows Forge to
transform the taxonomy and store a copy of the resulting
dimensions.xml in the Endeca Manager. After you run the
update, you can then load the dimension as described in
“Loading a Dimension and its Dimension Values” on
page 216. When you load the dimension, Developer
Studio reads the dimensions.xml file and displays the
dimension values in the Dimension Values view.

Note: The Endeca Manager must be running before you start a


baseline update.

To transform the taxonomy:

1. Place a copy of your source taxonomy file into the


directory you specified in the Incoming Directory field
of the Web Studio’s Provisioning System page.
2. Ensure you have sent the latest instance configuration
to the Endeca Manager.
3. In the Endeca Manager toolbar, click Start Baseline
Update.

Endeca Confidential Classifying Documents with Stratify


216

Note: To reduce processing time for large source data sets, you
may want to run the baseline update using the -n flag for
Forge. (The -n flag controls the number of records processed in
a pipeline.) You can specify the flag in the Forge field of the
Endeca Manager Settings dialog box.

Loading a Dimension and its Dimension Values

After you transform a Stratify taxonomy into an Endeca


dimension, you can then add the dimension to the
Dimensions view and load the dimension values in the
Dimension Values view.

Rather than choose New, as you would to create a


standard dimension in Developer Studio, you instead
click Discover in the Dimensions view to see an
externally managed dimension. Developer Studio
discovers the dimension by reading in the
dimensions.xml file created when you ran the first
baseline update. Next, you click Load in the Dimension
Values view and Forge reads in dimension value
information from the dimensions.xml file and presents the
dimension values in the Dimension Values view.

The core characteristics of an externally managed


dimension and its dimension values are not modifiable in
the Dimension Values view. These features appear as
unavailable or Developer Studio displays a message
indicating what actions are possible. If you want to
modify these features of a dimension or its dimension
values, you have to edit the features in the source
taxonomy by performing the tasks described in “Updating
a Taxonomy in Your Pipeline” on page 223. You can

Advanced Features Guide Endeca Confidential


Chapter 11
217

however modify whether dimension values are inert or


collapsible.

Note: You must set up and use the Endeca Manager to view
dimensions and dimension values in Developer Studio. If you
choose to use Developer Studio in standalone mode for pipeline
development, you can run Forge by hand to transform your
taxonomy and create a dimensions.xml file containing the
Endeca dimensions for your taxonomy. However, you will not
be able to view the dimensions in the Dimensions view without
the Endeca Manager.

To load a dimension based on a Stratify taxonomy:

1. In the Project tab of Developer Studio, double-click


Dimensions. The Dimensions view displays.
2. Click the Discover button to display any externally
managed dimensions that you imported in Integrating
a Stratify Taxonomy on page 213. The dimension or
dimensions appear in the Dimensions view with the
value of their External Type column set to External.
3. In the Dimensions view, select the dimension based
on a Stratify taxonomy and click Values. The
Dimension view appears with the root dimension
displayed.
4. Select the root dimension and click Load. The
dimension values appear in the Dimension Values
view with the value of their External Type column set
to External.
5. Repeat steps 3-4 for an additional dimensions based
on a Stratify taxonomy that you integrated into your
project

Endeca Confidential Classifying Documents with Stratify


218

Here is an example of a Dimensions view with one


dimension based on a Stratify taxonomy:

Here is an example of a Dimension Values view with the


loaded dimension values:

Advanced Features Guide Endeca Confidential


Chapter 11
219

About Synonym Values and Dimension Values

After you load the dimension values, you will notice that
each dimension value has two synonyms. One synonym
is the name of the dimension value used for display in an
application, and the second synonym is an ID required
for dimension value mapping. You should not modify
these synonym values.

The synonym with the name of the dimension value has


default settings where Search is checked and (Display) is
enabled as shown in the following example. In this
example, these settings allow an application user to
search and navigate on the Eating Disorders dimension
value.

The ID synonym has default settings on the Synonym


editor where Classify is checked and (Display) is disabled
as shown in the following example. The Classify setting

Endeca Confidential Classifying Documents with Stratify


220

instructs Forge to use the ID synonym during record


classification.

The ID synonym is based on a topic’s id element as


shown in the Stratify taxonomy. In the example above,
here is a portion of the topic element for the Eating
Disorders topic in the Ask Alice taxonomy:
<topic id="2097222" name="Eating Disorders"
...</topic>.

An ID synonym, based on a Stratify id is not intended to


provide an alternative way of describing or searching a
dimension value in the same way that synonyms created
using Developer Studio are often used. The Endeca
Navigation Platform uses the id synonym for dimension
value tagging because an integer based on a Stratify id is
guaranteed to be unique across other external
dimensions; whereas, the name of the dimension value is
not guaranteed to be unique across other external
dimensions.

Advanced Features Guide Endeca Confidential


Chapter 11
221

The following process describes the role of id synonyms


in dimension value mapping.
1. After you transform a taxonomy and load a dimension,
each dimension value has an id synonym that comes
from the topic id in the Stratify taxonomy.
2. After Stratify classifies the unstructured documents,
Forge tags properties to each record that include the
Stratify topic id and Stratify topic name. Remember
this topic ID is the same as the id synonym.
In this example, Forge assigns the properties
Endeca.Stratify.Topic.Name.HID15.TID2097222=
"Eating Disorders" and Endeca.Stratify.
Topic.HID15="2097222" after Stratify classifies an
unstructured document into the Eating Disorders
topic.
3. A property mapper maps a source property named
Endeca.Stratify.Topic.HID<hierarchy id> to a target
dimension for a Stratify taxonomy.
In this example, Endeca.Stratify.Topic.HID15 is
mapped to the AskAlice target dimension.
4. Forge uses the ID synonyms to classify records into
dimension values during mapping.

In this example, any documents that Forge tags with


Endeca.Stratify.Topic.HID15="2097222" Forge also
classifies with the ID synonym 2097222 into the Eating
Disorders dimension value of the AskAlice dimension.

Endeca Confidential Classifying Documents with Stratify


222

Mapping a Dimension Based on a Stratify Taxonomy

Using Discover to add the dimension makes it available


for explicit mapping as a target dimension. You map the
source property Endeca.Stratify.Topic.HID<hierarchy
ID> to your target dimension for the taxonomy. All topics
within that hierarchy are mapped as dimension values.

If you are using more than one Stratify taxonomy in your


pipeline, create one property to dimension mapping for
each taxonomy. For more information about using a
property mapper, see the Endeca Developer Studio Help

Running the Second Baseline Update

After discovering and loading the dimension and


mapping the dimension to a source property, you must
run a second baseline update to process and tag your
Endeca records.

The second baseline update performs property and


dimension mapping that could not be performed in the
first baseline update because the Stratify taxonomy had
not yet been transformed into an Endeca dimension and
available for mapping.

To run a second baseline update:

1. Ensure you have sent the latest instance configuration


to the Endeca Manager.
2. In the Endeca Manager toolbar, click Start Baseline
Update.

Advanced Features Guide Endeca Confidential


Chapter 11
223

Note: If you ran the first baseline update with the Forge -n flag,
you should delete the flag before running the second baseline
update.

Updating a Taxonomy in Your Pipeline

If you want to modify an externally managed dimension


and replace it with a newer version, you have to revise
the taxonomy in Stratify Taxonomy Manager, publish it to
the Stratify Classification Server, include it into Developer
Studio, and make the other necessary changes in
Developer Studio. In effect, you have to repeat the
process of integrating Stratify into your pipeline as
described in this document. This process is necessary
because the Stratify Classification Server must manage all
taxonomy changes in order to properly classify
documents during Forge’s record processing.

Endeca Confidential Classifying Documents with Stratify


224

Advanced Features Guide Endeca Confidential


Chapter 11
SECTION IV
Logging and
Performance Features
226

Advanced Features Guide Endeca Confidential


Chapter 12
Forge Hierarchical Logging System

This section provides a brief introduction to the hierarchical


logging system used by Forge. It covers the features and
design rationale, as well as an outline of the log category
hierarchy. The Forge hierarchical logging system combines
cascading log levels, hierarchical message categories,
runtime configuration, and the ability to direct log messages
to multiple, arbitrary destinations.

Overview
The constituent components of a Forge pipeline tend to
generate large log files. When all of these log messages are
directed to a single log file, key messages are less easily
accessed and often overlooked in the sheer number of
messages.

Providing a means to create logical message groupings and


the capability to place these message groups in separate
logs is the key to organizing this information so that it is
easy for a user to locate the data of interest. While this
functionality is the primary goal, the logging system was
designed with several other important considerations in
mind:
228

• Users should be able to organize log messages


hierarchically in logical message groups.
• Users should be able to specify a verbosity level in log
messages.

Log Levels and Message Categories

The hierarchical logging system provides the capability to


create logical message groupings by specifying two
dimensions for each log message:
• Log level, which indicates what types of messages will
be logged.
• Message category, which defines the component from
which the message comes.

Log Levels

The value of a log level is chosen from the following


ordered list:
• DEBUG—debugging messages, to be used during
development or in tracking down problems.
• VERBOSE—verbose messages, which give as much
information as possible on an event.
• INFO—informational messages, which indicate
information that may be of interest, such as internal
application state changes and events. These do not
indicate errors and processing should continue with
expected output.

Advanced Features Guide Endeca Confidential


Chapter 12
229

• STAT—status messages.
• WARN—warning messages, which indicate that
processing will continue, but the component output
may not be as expected.
• ERR—error messages, which indicate a deviation from
normal processing.
• FTL—fatal messages, which indicate a serious problem
that may cause the issuing component to stop
functioning.

Each log level on the above list includes messages from


all lower levels.

DEBUG
VERBOSE
INFO STAT
WARN ERR
FTL

Message Category

The value of a message category consists of a category


name which adheres to a hierarchical naming convention

Endeca Confidential Forge Hierarchical Logging System


230

in which the dot (.) character specifies a child category.


The category names currently used in Forge are defined
below in the Forge Logging Hierarchy section.

A category is an ancestor of another category if its name


followed by a dot is a prefix of the descendant category
name. A category is a parent of a child category if there
are no ancestors between itself and the descendant
category.

For example, the category named Edf.Pipeline is a parent


of the categories named Edf.Pipeline.RecordPipeline and
Edf.Pipeline.DimensionPipeline. Similarly, Edf is a parent
of Edf.Pipeline and an ancestor of both
Edf.Pipeline.RecordPipeline.RecordServer and
Edf.Pipeline.DimensionPipeline.DimensionServer.

This hierarchical naming convention defines a partial


ordering of message categories using an inclusion
relation. In the example above, the categories
Edf.Pipeline.RecordPipeline and
Edf.Pipeline.DimensionPipeline are sub-categories that
are included within the Edf.Pipeline parent category. The
Edf.Pipeline category includes log messages generated by
all of its descendent categories. That is, any message in
the Edf.Pipeline.RecordPipeline category also belongs to
both the Edf.Pipeline category and the Edf category. Log

Advanced Features Guide Endeca Confidential


Chapter 12
231

messages bubble up to the root of the message category


hierarchy (in our case, Edf).

RecordPipeline
Pipeline
Edf

DimensionPipeline

The log level and message category dimensions allow a


user of the hierarchical log system to specify arbitrary
groupings of messages which can then be directed to
specific output destinations, such as a file or one of the
standard output streams (stdout, stderr). A message
grouping consists of a verbosity level and a message
category. The verbosity level is given by one of the log
level values and specifies a group. As the diagram on
page 229 shows, a message grouping with verbosity level
INFO would include all the log messages whose log level
is INFO, STAT, WARN, ERR or FTL.

A message group’s category value specifies that the group


include all messages from the specified message category
and all of its sub-categories. Thus a message group with
category value Edf.Pipeline (continuing with the example
above) would include all messages within the categories
Edf.Pipeline, Edf.Pipeline.RecordPipeline and
Edf.Pipeline.DimensionPipeline.

Endeca Confidential Forge Hierarchical Logging System


232

Log Appenders

Log appenders are determine how log messages for a


specific message group are handled. Each message group
can have a set of appenders associated with it, each of
which may handle the same message differently. For
example, one appender might pipe log messages to
stdout while another appender writes the same log
message to a file.

The logging system uses a configuration file named


log.ini to map log appenders to message groups. Forge
looks for the log.ini file in the same directory as the
pipeline input file (which means that you cannot move
the log.ini file to another location).

As an example, consider the following log.ini file:


VERSION=00.02

logger=Edf.Pipeline.RecordPipeline.Spider;class=str
eam;file;

name= spider.log;level=INFO

logger=Edf.Pipeline.RecordPipeline.Spider.mySpider;
class=stream;file;

name=myspider.log;level=DEBUG

The first line specifies the version of the log.ini file format
in use. The next two lines map message groups to log
appenders.

In the first line, the command:


logger=Edf.Pipeline.RecordPipeline.Spider

Advanced Features Guide Endeca Confidential


Chapter 12
233

specifies that the message group shall include all


messages in the Edf.Pipeline.RecordPipeline.Spider
category. The command:
level=INFO

specifies that the group should include only messages


with verbosity level INFO or less. The remaining
commands in the first line:
class=stream;file;name=spider.log,

specify the behavior of the appender to which this


message group should be sent. These commands indicate
that the appender is a stream appender of type file with
the filename spider.log.

The stream appender implementation supports multiple


loggers appending to a single file. Similarly, the second
line of the file specifies that all messages in the
Edf.Pipeline.RecordPipeline.Spider.mySpider category
with verbosity DEBUG or less should be appended to the
myspider.log file.

<first-line>

<line>

These components are defined recursively as follows:


first-line VERSION=<version-num>

version-num <digits>.<digits>

digits 0|1|2|3|4|5|6|7|8|9

line <logger>\n<line> | <comment>\n<line>

Endeca Confidential Forge Hierarchical Logging System


234

comment Any sequence of characters starting with #

logger logger=<logger-name>;<appender-list>

logger-name Any sequence of characters excluding '.',


'\n' and '\r'

appender-list <appender-info> |
<appender-info>;<appender-list>

appender-info class=<appender-class>[;compression-info][;
level-info>][;format-info>]

appender-class stream;<appender-stream> |
unique-stream;<appender-stream>

appender-stream file;name=<filename> | cout

compression-info compression=<compression-level>

level-info level=<log-level>

format-info format=<format-option>

Compression-level -1|0|1|2|3|4|5|6|7|8|9

format-option simple | bare | bare-counted

log-level DEBUG | VERBOSE | INFO | STAT |


WARN | ERR | FTL

Advanced Features Guide Endeca Confidential


Chapter 12
235

Format of the Appenders

An interpretation of the above BNF results in the


following commonly-used sections/keywords.
Appender Name Parameter Description

VERSION The version number of the log file.

logger A category name, as described in the


Message Category section above

class Specifies how messages should be handled


when logged. Values are:
• stream—the individual messages should
be logged consecutively as they come in.
• unique-stream—each message should be
made unique by removing duplicates.

file Takes no parameter. If this appender is


present, indicates that the messages should
be written to a disk file.

name The pathname of the log file. Used only if


file is specified.

compression Can be used if file is specified. Indicates the


level of data compression in the log file.
The default compression level is -1, which
is the same as omitting this appender.
Alternatively, you can specify levels 0
through 9.

level The level of verbosity of the log messages.


The value is one of the levels described in
the Log Levels section above.

Endeca Confidential Forge Hierarchical Logging System


236

Appender Name Parameter Description

format Specifies the format of the messages. Values


are:
• simple—the standard log message from
Forge. It prepends the level (INF:, VER:,
WRN:, etc.) and possibly the file name,
line number, and a timestamp,
depending on other options.
• bare—indicates that timestamps, file
names, and log levels should not be
included.
• bare-count—similar to bare, except that
a message count is appended in the
format ":::count" (where count is the
number of messages with the same
string. This format should generally be
used with the unique-stream class.

Configuring MustMatch Messages

When configuring dimension mapping, you can specify


Must Match for the Match mode. The MustMatch mode
tags resulting records with any matching dimension
values. If none match, a warning message is issued.

To capture all MustMatch messages, use the following


logger appender:
logger=Edf.Pipeline.Expression.MustMatch

Advanced Features Guide Endeca Confidential


Chapter 12
237

For example, the following lines will write all MustMatch


messages to the match.txt and match.txt.stats files:
logger=Edf.Pipeline.Expression.MustMatch;class=
unique-stream;

file;name=match.txt;level=DEBUG;format=bare

logger=Edf.Pipeline.Expression.MustMatch;class=
unique-stream;

file;name=match.txt.stats;level=DEBUG;format=
bare-counted

To capture MustMatch messages for a specific property,


append the property name to the category name, as in
the following line examples:
logger=Edf.Pipeline.Expression.MustMatch.Color;clas
s=unique-stream;

file;name=colors.log;level=DEBUG;format=bare

logger=Edf.Pipeline.Expression.MustMatch.Color;clas
s=unique-stream;

file;name=colors.log.stats;level=DEBUG;format=
bare-counted

Configuring the Dimension Server Match Count Log

A dimension server can be configured with a Match


Count log. This log keeps track of how many times
during the auto-generation process that certain properties
were matched.

You can add the following line to the log.ini file to log
all MatchCount messages:
logger=Edf.DimensionPipeline.DimensionServer.DimServ
erName.MatchCount;class=stream;file;name=matchCount.
log;level=DEBUG;format=bare

Endeca Confidential Forge Hierarchical Logging System


238

where DimServerName is the configured name of the


dimension server and matchCount.log is the name of the
file to which the MatchCount messages will be logged.

Reference log.ini File

The following log.ini file serves as a reference for users


who want to customize the default behavior of log output
in Forge. This file should serve as a reasonable default
and should be customized further to account for the
specific tasks for each individual pipeline. For example,
different pipelines will have different components and
component names for the message categories.

The set of available message categories is specified in the


next section titled The Forge Logging Hierarchy.
Categories starting with “my” refer to the component
names specified in the pipeline. These must always be
customized to match the specific name chosen within the
pipeline.

Advanced Features Guide Endeca Confidential


Chapter 12
239

VERSION=00.02
# log edf messages of level INFO or less to stdout
logger=Edf;class=stream;cout;level=INFO
# log all edf messages to a file named Edf.log
logger=Edf;class=stream;file;name=Edf.log;level=DEBUG
# log pipeline messages of level INFO or less to pipeline.log
logger=Edf.Pipeline;class=stream;file;name=pipeline.log;level=INFO

# log dimension pipeline messages of level INFO or less to


# dimensionpipeline.log
logger=Edf.Pipeline.DimensionPipeline;class=stream;file;name=
dimensionpipeline.log;level=INFO
# log messages from ANY dimension adapter of level INFO or less to
# dimensionadapter.log
logger=Edf.Pipeline.DimensionPipeline.DimensionAdapter;class=stream;file;
name=dimensionadapter.log;level=INFO
# log messages from the dimension adapter named myDimensionAdapter1 of
# level INFO or less to mydimensionadapter1.log
logger=Edf.Pipeline.DimensionPipeline.DimensionAdapter.myDimensionAdapter1;
class=stream;file;name=mydimensionadapter1.log;level=INFO
# log messages from the dimension adapter named myDimensionAdapter2 of
# level VERBOSE or less to mydimensionadapter2.log, and compress the file
logger=Edf.Pipeline.DimensionPipeline.DimensionAdapter.myDimensionAdapter2;
class=stream;file;name=mydimensionadapter2.log;compression=5;level=VERBOSE
# log record pipeline messages of level INFO or less to
# recordpipeline.log
logger=Edf.Pipeline.RecordPipeline;class=stream;file;name=recordpipeline.log
;level=INFO
# log expression messages of level INFO or less to expression.log
logger=Edf.Pipeline.Expression;class=stream;file;name=expression.log;
level=INFO

# log DimensionServer messages of level INFO or less to servers.log


logger=Edf.Pipeline.DimensionPipeline.DimensionServer;class=stream;file;
name=servers.log;level=INFO

Endeca Confidential Forge Hierarchical Logging System


240

A Simple Reference log.ini file

The following simple log.ini file can be used as-is and


requires no customization.
VERSION=00.02

# log edf messages of level INFO or less to stdout


logger=Edf;class=stream;cout;level=INFO
# log edf messages of level VERBOSE or less to a file named
# logs/edf_verbose.log
logger=Edf;class=stream;file;name=logs/edf_verbose.log;level=VERBOSE
# log edf messages of level WARN or less to a file named
# logs/edf_warn.log
logger=Edf;class=stream;file;name=logs/edf_warn.log;level=WARN

# log dimension pipeline messages of level INFO or less to


# logs/dimensionpipeline.log
logger=Edf.Pipeline.DimensionPipeline;class=stream;file;name=
logs/dimensionpipeline.log;level=INFO

# log record pipeline messages of level INFO or less to


# logs/recordpipeline.log
logger=Edf.Pipeline.RecordPipeline;class=stream;file;name=
logs/recordpipeline.log;level=INFO

Advanced Features Guide Endeca Confidential


Chapter 12
241

The Forge Logging Hierarchy


The following diagram demonstrates the log category
hierarchy currently used in Forge.

Note: Instance name refers to the name of a component as it is


defined in a pipeline.

E df

P ipeline

Dimens ionP ipeline NavC onfigP ipeline

Dimens ionS erver Dimens ionAdapter NavC onfigS erver NavC onfigAdapter NavC onfigAs s embler

E xpres s ion

ins tance name ins tance name ins tance name ins tance name ins tance name

R ecordP ipeline

As s embler R ecordManipulator R ecordAdapter R ecordC ache IndexerAdapter S pider UpdateAdapter

ins tance R ecord ins tance name ins tance name ins tance name ins tance name ins tance name ins tance name
name J oin

R obotE xclus ion UR L

F ilter

Endeca Confidential Forge Hierarchical Logging System


242

Advanced Features Guide Endeca Confidential


Chapter 12
Chapter 13
Using Multithreaded Mode

By default, and in most Endeca applications, the Navigation


Engine runs in single-threaded mode. In this normal mode
of operation, the Navigation Engine processes queries one
at a time. Once the Navigation Engine starts processing a
new query, it continues working on the query until it is
completed.

While working on this query, the Navigation Engine does


not work on or respond to other queries. In many cases,
this simple method of execution is sufficient. Because most
queries complete in a tiny fraction of a second, queries
never have to wait long to be serviced, and the Navigation
Engine appears immediately responsive at all times.

However, in some cases, this simple “one request at a time”


single-threaded execution model does not adequately meet
the hardware utilization and/or responsiveness
requirements of the application. To address such situations,
the Navigation Engine supports a multithreaded execution
mode.
244

Applications that are candidates for multithreaded


execution mode generally exhibit one or both of the
following characteristics:
• Large memory footprint with high throughput—The
Navigation Engine relies on in-memory index
structures to provide sub-second responses to
complex queries. As the scale of application data
increases, so does the memory required to host a
single instance of the Navigation Engine. Throughput
scalability of the Navigation Engine is generally
achieved by running multiple independent instances.
For example, if a single Navigation Engine can service
20 operations per second on a given data set and
query load but site traffic is 60 operations per second,
then three Navigation Engine instances should be run
(each on its own CPU) to ensure sufficient application
throughput.
For applications with small-to-medium data scale, the
cost of hardware to service additional load is
reasonable, as each additional CPU need only be
coupled with a moderate amount of RAM. But for
applications with large-data scale, each additional CPU
would need to be configured with up to 4GB of RAM
in single-threaded mode, which can play a significant
role in hardware cost.
Multithreaded execution mode enables more efficient
utilization of RAM via SMP configurations. For
example, if the data scale requires 4GB of RAM, and
query throughput requires four CPUs, multithreaded
execution allows the site to be hosted on a single
quad-processor machine with 4GB of ram, rather than

Advanced Features Guide Endeca Confidential


Chapter 13
245

a more costly option such as four single-processor


machines, each with 4GB of RAM. In addition to
reduced hardware costs, this approach simplifies
system management and network architecture, and
reduces the hardware hosting space required.
• Long-running queries—For applications that rely on
commonly-used Navigation Engine features, the vast
majority of queries complete in a fraction of a second.
This allows the Navigation Engine to remain
responsive at all times. However, some applications
that make use of more advanced features (such as
computing complex aggregate analytics) will
encounter longer running queries. For such
applications, multithreaded mode allows the
Navigation Engine to remain responsive while
working on long-running queries.

Understanding Multithreaded Mode


In multithreaded mode, the Navigation Engine is started
with a pool of worker threads. These threads represent
sequences of program execution managed and scheduled
by the operating system and hosted within the Navigation
Engine process.

Each worker thread acts like an independent Navigation


Engine, processing queries one at a time. But the threads
share data, memory, and the server network port.
Essentially, this allows a multithreaded Navigation Engine
with N threads to appear as a single Navigation Engine
process that can work on N queries at a time. Each of the

Endeca Confidential Using Multithreaded Mode


246

independent worker threads can run on independent


CPUs, allowing a single multithreaded Navigation Engine
to make use of multi-processor hardware.

Multiple threads can also share a CPU, allowing a


multi-thread Navigation Engine running on a single-CPU
host to remain responsive as long-running queries are
handled.

Costs of Multithreaded Mode


Although the benefits of threaded mode are critical in
certain applications, multithreaded mode is not without
costs, and should not be engaged in situations where it is
unnecessary. Because the worker threads in
multithreaded mode share data, memory, and other
resources, they must synchronize their execution using
mechanisms such as OS-supported locks.

These synchronization operations introduce runtime


overhead that is avoided in the single-threaded case.
Because of this, a multithreaded Navigation Engine
typically provides slightly lower performance than
multiple independent single-threaded Navigation Engine
processes. Performance is discussed in more detail
below; as a general rule, if an application does not
exhibit one or both of the characteristics described above
on page 244, single-threaded mode is the recommended
approach.

Advanced Features Guide Endeca Confidential


Chapter 13
247

Configuration for Multithreaded Mode


By default, the Navigation Engine runs in non-threaded
mode. To enable multithreaded mode, specify the
--threads option along with the number of worker
threads desired when starting your Navigation Engine
(Dgraph).

For example:

--threads 4

will start the Navigation Engine in multithreaded mode


with 4 worker threads.

Multithreaded Navigation Engine Performance


The performance possible with a multithreaded
Navigation Engine process is a function of a number of
factors, such as:
• Base, single-threaded performance, given the
application data and query profile
• Number of CPUs on the host system
• Query characteristics
• Host operating system

Generally, on a host system with N CPUs, where a single


non-threaded Navigation Engine process can serve K
operations/seconds of query load, N or more

Endeca Confidential Using Multithreaded Mode


248

independent Navigation Engine processes will serve


somewhat less than N times K, commonly in the 80-90%
utilization range. In other words, given the base
single-instance performance of K, the expected
N-processor performance is given by
U × K × N where ( 0.8 ≤ U ≤ 0.9 ) .

The expected performance for one multithreaded


Navigation Engine on an N processor machine is similar,
but generally somewhat less. In this case, the expected
performance is given by the above formula, except with
utilization in the 60% to 80% range ( 0.6 ≤ U ≤ 0.8 ).

For example, if a single Navigation Engine provides 20


ops/sec on given load, running two Navigation Engines
on a dual processor may provide around 36 ops/sec
(U=90%, K=20, N=2). Running the same application with
a single multithreaded Navigation Engine may provide 32
ops/sec (U=80%, K=20, N=2).

Application Query Characteristics

Actual utilization depends on a number of factors. One


important factor is the types of queries commonly
executed in the application. While most query operations
(such as basic data navigation, merchandising, or sorting)
provide good concurrency, and hence high utilization,
other operations require costly thread synchronization
and reduce performance. For example, spelling
correction relies on a non-thread-safe spelling library,
which requires worker thread synchronization. Thus, if
most queries in an application require spelling correction,

Advanced Features Guide Endeca Confidential


Chapter 13
249

processor utilization is likely to be on the lower end of


the expected range.

Thread Pool Size and OS Platform

Other factors that impact performance and processor


utilization are the size of the thread pool and the host
operating system type. For example, on the Solaris
operating system, which provides an efficient hierarchical
threads system, the Navigation Engine will exhibit little
decrease in performance at higher thread pool sizes. On
other systems, such as Windows and Linux, performance
will degrade at larger thread pool sizes.

In general, we recommend using one to four threads per


processor for good performance in most cases. The actual
optimal number of threads for a given application
depends on many factors, and is best determined through
experimental performance measurements using expected
query load on production data.

The following section describes detailed issues for


specific platforms.

Hyperthreaded Intel Processors

Hyperthreaded Intel processors appear at the application


level like two CPUs. For example, a host with two
hyperthreaded physical processors will appear to
applications like a machine with four normal processors.
Despite this virtual view, these apparent processors share
resources, and don’t provide the full performance of true

Endeca Confidential Using Multithreaded Mode


250

independent processors. A single hyperthreaded


processor generally provides a 20% to 40% performance
boost to multiprocess or multithreaded applications.

The expected performance of the Navigation Engine on a


hyperthreaded system is
U × K × N × H ( where 1.2 ≤ U ≤ 1.4 ) .

and where U, K, and N are as described above in


“Multithreaded Navigation Engine Performance” on
page 247.

For example, if a single Navigation Engine provides 20


ops/sec on given load, running four Navigation Engines
on a dual hyperthreaded processor (at least one
Navigation Engine must run on each logical processor to
achieve the full benefits of hyperthreading) may provide
around 46.8 ops/sec (U=90%, K=20, N=2, H=1.3). Running
the same application with a single multithreaded
Navigation Engine may provide 41.6 ops/sec (U=80%,
K=20, N=2, H=1.3).

Linux

On Linux, the Navigation Engine makes use of the


LinuxThreads implementation of POSIX threads
(pthreads) for low-level thread services. In this
implementation, threads appear in the system as
independent processes.

For example, if a Navigation Engine is started with a


worker thread pool size of four (--threads 4), it will
cause six processes to appear in the process table for the

Advanced Features Guide Endeca Confidential


Chapter 13
251

system (which can be examined in tools such as ps and


top). These six processes correspond to: four for worker
threads, one for the main startup thread of the Navigation
Engine, and one for a control process used by the
LinuxThreads implementation itself.

This is normal behavior and does not indicate that these


processes incur the normal costs of independent
processes. For example, these processes share memory
space with one another, allowing them to provide the
normal benefits of the multithreaded Navigation Engine.

Solaris

On Solaris (Sparc and Intel hardware platforms), the


Navigation Engine uses the native Solaris threads
interface (as opposed to the pthreads compatibility
interface), and is linked against the alternate Solaris
threads implementation (described in the threads(3THR)
Solaris man page). This threads implementation provides
a light-weight, single-level, higher performance
alternative to the standard two-level Solaris threads
implementation.

Windows

On Windows, the Navigation Engine uses native Win32


threads. The thread count for an Navigation Engine can
be examined in the Windows Task Manager in the
Threads column (the number of threads listed will be one
greater than the value specified for the --threads option;
the additional thread is the main thread of the Navigation
Engine).

Endeca Confidential Using Multithreaded Mode


252

Advanced Features Guide Endeca Confidential


Chapter 13
Chapter 14
Coremetrics Integration

Endeca offers integration with the Coremetrics Online


Analytics product through an integration module that is
packaged with the Endeca reference library. The integration
module contains the code required to capture search terms
information and enable the Coremetrics On-Site Search
Intelligence report. Coremetrics integration is offered for the
JSP, ASP, and ASP .NET versions of the UI reference
implementation.

All of the reference implementations assume that the code


supplied by Coremetrics is located in the /coremetrics
directory at the root of your application server. If you have
installed Coremetrics in another directory, or are using a
different version of Coremetrics, you will have to modify
the coremetrics include statement in the integration module.
In addition, the reference implementations are set up to
point to the Coremetrics test server. In order to enable
Coremetrics integration for production, you must add a
cmSetProduction() call above the cmCreatePageviewTag()
call in the integration module.
254

Using the Integration Module


Each reference implementation has a new module that
contains the logic for when to include the Coremetrics
tags:
• For the JSP, the integration code is in coremetrics.jsp.
• For the ASP, the integration code is in
coremetrics.asp.

• For the ASP .NET, the integration code is in


coremetrics.aspx.

Each reference implementation also has a


commented-out include statement. Uncomment this
statement to enable the Coremetrics code.
• For the JSP, the include statement is in nav.jsp.

• For the ASP, the include statement is in


controller.asp.

• For the ASP .NET, the include statement is in


controller.aspx.

Advanced Features Guide Endeca Confidential


Chapter 14
SECTION V
Other Advanced Features
256

Advanced Features Guide Endeca Confidential


Chapter 15
Implementing Merchandising and
Content Spotlighting

This chapter describes how to implement merchandising in


Endeca InFront and content spotlighting in Endeca ProFind
using dynamic business rules. The chapter includes the
following sections:
• Introduction to Dynamic Business Rules and Promoting
Records
• Suggested Workflow Using Endeca Tools to Promote
Records
• Building the Supporting Constructs for a Rule
• Creating Rules
• Presenting Rule Results in a Web Application
• Grouping Rules
• Performance Considerations for Dynamic Business Rules
• Using an Agraph and Dynamic Business Rules
• Applying Relevance Ranking to Rule Results
• About Overloading Supplement Objects
258

Introduction to Dynamic Business Rules and


Promoting Records
Endeca provides the functionality to promote contextually
relevant records to application users as they search and
navigate within a data set. In Endeca InFront, this activity
is called merchandising because the Endeca records you
promote often represent product data. In Endeca ProFind,
this activity is called content spotlighting because the
Endeca records you promote often represent some type
of document (HTML, DOC, TXT, XLS, and so on). For the
sake of simplicity, this document uses “promoting
records” to generically describe both merchandising and
content spotlighting.

You implement merchandising and content spotlighting


using dynamic business rules. The rules and their
supporting constructs define when to promote records,
which records may be promoted, and also indicate how
to display the records to application users.

Here is a simple merchandising example using a wine


data set. An application user enters a query with the
search term Bordeaux. This search term triggers a rule
that is set up to promote wines tagged as Best Buys. In
addition to returning standard query results for term
Bordeaux, the rule instructs the Navigation Engine to
dynamically generate a subset of records that are tagged
with both the Best Buy and also Bordeaux properties.
The Web application displays the standard query results
that match Bordeaux and also displays some number of

Advanced Features Guide Endeca Confidential


Chapter 15
259

the rule results in an area of the screen set aside for “Best
Buy” records. These are the promoted records.

Comparing Dynamic Business Rules to Content Management


Publishing

Endeca’s record promotion works differently from


traditional content management systems (CMS), where
you select an individual record for promotion, place it on
a template or page, and then publish it to a Web site.
Endeca’s merchandising is dynamic, or rule based. In
rule-based merchandising, a dynamic business rule
specifies how to query for records to promote, and not
necessarily what the specific records are.

This means that, as your users navigate or search, they


continue to see relevant results, because appropriate
rules are in place. Also, as records in your data set
change, new and relevant records are returned by the
same dynamic business rule. The rule remains the same,
even though the promoted records may change.

In a traditional CMS scenario, if Wine A is


“Recommended,” it is identified as such and published
onto a static page. If you need to update the list of
recommended wines to remove Wine A and add Wine B
to the static page, you must manually remove Wine A,
add Wine B, and publish the changes.

With Endeca’s dynamic record promotion, the effect is


much broader and requires much less maintenance. A
rule is created to promote wines tagged as

Endeca Confidential Implementing Merchandising and Content Spotlighting


260

“Recommended,” and the search results page is designed


to render promoted wines.

In this scenario, a rule promotes recommended Wine A


on any number of pages in the result set. In addition,
removing Wine A and adding Wine B is simply a matter
of updating the source data to reflect that Wine B is now
included and tagged as “Recommended.” After making
this change, the same rule can promote Wine B on any
number of pages in the result set, without adjusting or
modifying the rule or the pages.

Dynamic Business Rule Constructs

Two constructs make up a dynamic business rule:


• Trigger—a set of conditions that must exist in a query
for a rule to fire. A trigger may include dimension
values, a keyword or phrase, time values, and
user-profiles. When a user’s query contains a
condition that triggers a rule, the Navigation Engine
evaluates the rule and returns a set of records that are
candidates for promotion to application users.
• Target—specifies which records are eligible for
promotion to application users. A target may include
dimension values, custom properties, and featured
records. For example, dimension values in a target are
used to identify a set of records that are candidates for
promotion to application users.

Advanced Features Guide Endeca Confidential


Chapter 15
261

Three additional constructs support rules:


• Zone—specifies a collection of rules to ensure that
rule results are produced in case a single rule does not
provide a result.
• Style—specifies the minimum and maximum number
of records a rule can return. A style also specifies any
property templates associated with a rule. Rule
properties are key/value pairs that are typically used
to return supplementary information with promoted
record pages. For example, a property key might be
set to “SpecialOffer” and its value set to
“BannerAd.gif”.
A rule’s style is passed back along with the rule’s
results, to the Web application. The Web application
uses the style as an indicator for how to render the
rule’s results.
Note: The code to render the rule’s results is part of the Web
application, not the style itself.
• Rule Group (optional)—provides a means to logically
organize large numbers of rules into categories. This
organization facilitates editing by multiple business
users

The core of a dynamic business rule is its trigger and


target values. The target identifies a set of records that are
candidates for promotion to application users. The zone
and style settings associated with a rule work together to
restrict the candidates to a smaller subset of records that
the Web application then promotes.

Endeca Confidential Implementing Merchandising and Content Spotlighting


262

Query Results and Rules

Once you implement dynamic business rules in your


application, each query a user makes is compared to each
rule to determine if the query triggers a rule. If a user's
query triggers a rule, the Navigation Engine returns
several types of results:
• Standard record results for the query.
• Promoted records specified by the triggered rule’s
target.
• Any rule properties specified for the rule.

Two Examples of Promoting Records

The following sections explain two examples of using


dynamic business rules to promote Endeca records. The
first example shows how a single rule provides
merchandising results when an application user navigates
to a dimension value in a data set. The scope of the
merchandising coverage is somewhat limited by using
just one rule.

The second example builds on the first by providing


more broad merchandising coverage. In this example, an
application user triggers two additional dynamic business
rules by navigating to the root dimension value for the
application. These two additional rules ensure that
merchandising results are always presented to application
users.

Advanced Features Guide Endeca Confidential


Chapter 15
263

An Example with One Rule Promoting Records

This simple example demonstrates a basic merchandising


scenario where an application user navigates to Wine
Type > White, and a dynamic business rule called
“Recommended Chardonnays” promotes chardonnays
that have been tagged as Highly Recommended. From a
merchandising perspective, the marketing assumption is
that users who are interested in white wines are also
likely to be interested in highly recommended
chardonnays.

The “Recommended Chardonnays” rule is set up as


follows:
• The rule’s trigger, which specifies when to promote
records, is the dimension value Wine Type > White.
• The rule’s target, which specifies which records to
promote, is a combination of two dimension values,
Wine Type > White > Chardonnay and Designation >
Highly Recommended.
• The style associated with this rule is configured to
provide a minimum of at least one promoted record
and a maximum of exactly one record.
• The zone associated with this rule is configured to
allow only one rule to produce rule results.

When an application user navigates to Wine Type >


White in the application, the rule is triggered. The
Navigation Engine evaluates the rule and returns
promoted records from the combination of the
Chardonnay and Highly Recommended dimension

Endeca Confidential Implementing Merchandising and Content Spotlighting


264

values. There may be a number of records that match


these two dimension values, so zone and style settings
restrict the number of records actually promoted to one.

The promoted record along with the user’s query and


standard query results are called out in the following
graphic:

User’s query and also the trigger value

Standard results for the query Wine Type >White

Rule results for Recommended Chardonnays

Advanced Features Guide Endeca Confidential


Chapter 15
265

An Expanded Example with Three Rules

The previous example used just one rule to merchandise


highly recommended chardonnays. The following
example expands on the previous one by adding two
more rules called “Best Buys” and “Highly
Recommended.” These rules merchandise wines tagged
with a Best Buy property and a Highly Recommended
property, respectively. Together, the three rules
merchandise records to expose a broader set of potential
wine purchases.

The “Best Buys” rule is set up as follows:


• The rule’s trigger is set to the Web application’s root
dimension value. In other words, the trigger always
applies.
• The rule’s target is the dimension value named Best
Buy.
• The style associated with this rule is configured to
provide a minimum of four promoted records and a
maximum of eight records.
• The zone associated with this rule is configured to
allow only one rule to produce rule results.

The “Highly Recommended” rule is set up as follows:


• The rule’s trigger is set to the Web application’s root
dimension value. In other words, the trigger always
applies.
• The rule’s target is the dimension value named Highly
Recommended.

Endeca Confidential Implementing Merchandising and Content Spotlighting


266

• The style associated with this rule is configured to


provide a minimum of at least one promoted record
and a maximum of three records.
• There is the only rule associated with the zone, so no
other rules are available to produce results. For details
on how zones can be used when more rules are
available, see “Ensuring Promoted Records are Always
Produced” on page 270.

When an application user navigates to Wine Type >


White, the “Recommended Chardonnays” rule fires and
provides rule results as described in “An Example with
One Rule Promoting Records”. In addition, the Highly
Recommended and Best Buys rules also fire and provide
results because their triggers always apply to any
navigation query.

The promoted records for each of the three rules, along


with the user’s query, and standard query results are
called out in the following graphic:

Advanced Features Guide Endeca Confidential


Chapter 15
267

Rule Results for Recommended Chardonnays

User’s query and trigger value

Standard results for the query Wine Type >White

Rule results for Highly Recommended

Rule results for Best Buys

Endeca Confidential Implementing Merchandising and Content Spotlighting


268

Suggested Workflow Using Endeca Tools to Promote


Records
You can build dynamic business rules and their constructs
in Developer Studio. In addition, business users can use
Web Studio to perform any of the following rule-related
tasks:
• Create a new dynamic business rule.
• Modify an existing rule.
• Deploy a rule to a preview application and test or
audit its results.

Because either tool can modify a project, the tasks


involved in promoting records require coordination
between the pipeline developer and the business user.
The recommended workflow is as follows:
1. A pipeline developer uses Developer Studio in a
development environment to create the supporting
constructs (zones, styles, rule groups, and so on) for
rule and perhaps small number of dynamic business
rules as placeholders or test rules.
2. An application developer creates the Web application
including rendering code for each style.
3. The pipeline developer makes the project available to
business users via the Endeca Manager.
4. A business user starts Endeca Web Studio to access the
project, create new rules, modify rules, and test the
rules as necessary.

Advanced Features Guide Endeca Confidential


Chapter 15
269

For general information about using Endeca tools and


sharing projects, see the Endeca Tools Guide. Web Studio
tasks are described in the Endeca Web Studio Help.

Note: Any changes to the constructs that support rules such as


changes to zones, styles, rule groups, and property templates
have to be performed in Endeca Developer Studio.

Incremental Implementation

Merchandising and content spotlighting are complex


features to implement, and the best approach for
developing your dynamic business rules is to adopt an
incremental approach as you and business users of Web
Studio coordinate tasks. It is also helpful to define the
purpose of each dynamic business rule in the abstract
(before implementing it in Developer Studio or Web
Studio) so that everyone knows what to expect when the
rule is implemented. If rules are only loosely defined
when implemented, they may have unexpected side
effects.

Begin with a single, simple business rule to become


familiar with the core functionality. Later, you can add
more advanced elements, along with additional rules,
rule groups, zones, and styles. As you build the
complexity of how you promote records, you will have to
coordinate the tasks you do in Developer Studio (for
example, zone and style definitions) with the work that is
done in Web Studio.

Endeca Confidential Implementing Merchandising and Content Spotlighting


270

Building the Supporting Constructs for a Rule


As discussed in “Dynamic Business Rule Constructs” on
page 260, the records identified by a rule’s target are
candidates for promotion and may or may not all be
promoted in a Web application. It is a combination of
zone and style settings that work together to effectively
restrict which rule results are actually promoted to
application users.
• A zone identifies a collection of rules to ensure at least
one rule always produces records to promote.
• A style controls the minimum and maximum number
of results to display, defines any property templates,
and indicates how to display the rule results to the
Web application.

The following sections describe zone and style usage in


detail.

Ensuring Promoted Records are Always Produced

You ensure promoted records are always produced by


creating a zone in Developer Studio to associate with a
number of dynamic business rules. A zone is a logical
collection of rules that allows you to have multiple rules
available, in case a single rule does not produce a result.
The rules in a zone ensure that the screen space
dedicated to displaying promoted records is always
populated.

Advanced Features Guide Endeca Confidential


Chapter 15
271

A zone has a rule limit that dictates how many rules may
successfully return rule results. For example, if three rules
are assigned to a certain zone but the “Rule limit” is set to
one, only the first rule to successfully provide rule results
is evaluated. Any remaining rules in the zone are ignored.

To create a zone in Developer Studio:

1. In the Project Explorer, expand Dynamic Business


Rules.
2. Double-click Zones to open the Zones view.
3. Click New to open the Zone editor.
4. In the Name field, provide a unique name for the
zone.
5. (optional) If you want to limit the number of rules that
can provide rule results within a zone, type a number
in “Rule limit.”
6. (optional) If you want to randomly order the rules in
the zone, select the “Shuffle rules” check box. When
checked, this indicates that the Navigation Engine
randomly shuffles the order of the rules within this
zone before evaluating them.
7. (optional) Select “Valid for search” to indicate whether
a zone (and all of the rules associated with that zone)
is valid for navigation queries that include a record
search parameter. Rules that include a keyword trigger
require the “Valid for search” setting.
8. (optional) If you want to prevent the same record
from appearing multiple times for multiple rules,
check "Unique by this dimension/property" and

Endeca Confidential Implementing Merchandising and Content Spotlighting


272

specify a unique record criterion. Selecting a


dimension or property allows the Navigation Engine to
identify individual records and prevent the same
record from appearing multiple times. If you check
"Unique by this dimension/property" and do not select
a dimension or property, the same record may appear
multiple times for multiple rules within a single zone.
9. Click OK.

Creating a Style

You create a style in the Styles view of Endeca Developer


Studio. A style serves three functions:
• It controls the minimum and maximum number of
records that may be promoted by a rule.
• It defines property templates, which facilitate
consistent property usage between pipeline
developers and business users of Web Studio.
• It indicates to a Web application which rendering code
should be used to display a rule’s results.

Controlling the Number of Promoted Records

Styles can be used to affect the number of promoted


records in two scenarios:
• A rule produces less than the minimum number of
records. For example, if the “Best Buys” rule produces
only two records to promote and that rule is assigned
a style that has Minimum Records set to three, the rule
does not return any results.

Advanced Features Guide Endeca Confidential


Chapter 15
273

• A rule produces more than the maximum. For


example, if the “Best Buys” rule produces 20 records,
and the Maximum Records value for that rule’s style is
five, only the first five records are returned.

If a rule produces a set of records that fall between the


minimum and maximum settings, the style has no affect
on the rule’s results.

Performance and the Maximum Records Setting

The Maximum Records setting for a style prevents


dynamic business rules from returning a large set of
matching records, potentially overloading the network,
memory, and page size limits for a query. For example, if
Maximum Records is set to 1000, then 1000 records could
potentially be returned with each query, causing
significant performance degradation.

Ensuring Consistent Property Usage with Property Templates

As discussed in the “Dynamic Business Rule Constructs”


section, rule properties are key/value pairs typically used
to return supplementary information with promoted
record pages. For example, a property key might be set
to “SpecialOffer” and its value set to “BannerAd.gif.”

As Web Studio users and Developer Studio users share a


project with rule properties, it is easy for a key to be
mis-typed. If this happens, then the supplementary
information represented by a property does not get
promoted correctly in a Web application. To address this,
you can optionally create property templates for a style.

Endeca Confidential Implementing Merchandising and Content Spotlighting


274

Property templates ensure that property keys are used


consistently when pipeline developers and Web Studio
users share project development tasks.

If you add a property template to a style in Endeca


Developer Studio, that template is visible in Web Studio
in the form of a pre-defined property key with an empty
value. Web Studio users are allowed to add a value for
the key when editing any rule that uses the template’s
associated style. Web Studio users are not allowed to edit
the key itself.

Furthermore, pipeline developers can restrict Web Studio


users to creating new properties based only on property
templates, thereby minimizing potential mistakes or
conflicts with property keys.

For example, a pipeline developer can add a property


template called “WeeklyBannerAd” and then make the
project available to Web Studio users. Once the project is
loaded in Web Studio, a property template is available
with a populated key called “WeeklyBannerAd” and an
empty value. The Web Studio user provides the property
value. In this way, property templates reduce simple
project-sharing mistakes such as creating a similar, but
not identical property called “weeklybannerad”.

Note: Property templates are associated with styles in Developer


Studio, not rules. Therefore, they are not available for use on
the Properties tab of the Rule editor.

Advanced Features Guide Endeca Confidential


Chapter 15
275

Indicating How to Display Promoted Records

You indicate how to display promoted records to users by


creating a style to associate with each rule and by
creating application-level rendering code for the style.
You create a style in Developer Studio. You create
rendering code in your Web application. This section
describes how to create styles. Information about
rendering code will be described later in “Adding Web
Application Code to Render Rule Results” on page 301.

A style has a name and an optional title. Either the name


or title can be displayed in the Web application. When
the Navigation Engine returns rule results to your
application, the engine also passes the name and title
values to your application. The name uniquely identifies
the style. The title does not need to be unique, so it is
often more flexible to display the title if you use the same
title for many dimension value targets, for example, the
title “On Sale” may be commonly used.

Without application-level rendering code that uses the


specific style or title values, the style and title are
meaningless. Both require application-level rendering
code in an application.

To create a style in Developer Studio:

1. In the Project Explorer, double-click Styles to open the


Styles view.
2. Click New to open the Style editor.
3. In the Name field, provide a unique name for the
style.

Endeca Confidential Implementing Merchandising and Content Spotlighting


276

4. If desired, specify a title in the Title field. The title


does not need to be unique.
5. In the Minimum Records field, specify the minimum
number of records that must be returned by a rule’s
target in order for that rule to return results. (The
default Minimum Records value, if not specified, is
zero.)
6. In the Maximum Records field, specify the maximum
number of records that can be returned for a rule.
(The default Maximum Records value, if not specified,
is ten. If Maximum Records is set to zero, the rule
returns zero records.)
7. If you want to create a property template, click Add in
the Property Templates frame. The Property Template
dialog box displays.
a. Provide the key for the property template.
b.Click OK.
c. Repeat for as many new property templates as
necessary.
8. If you need to remove a property template, select it in
the Property Templates frame and click Remove.
9. If you want to allow Web Studio users the ability to
create new properties for a rule, check “Allow
additional, custom properties.” Unchecking this option
prevents Web Studio users from creating new
properties to associate with a rule. In other words,
Web Studio users will be restricted to creating

Advanced Features Guide Endeca Confidential


Chapter 15
277

properties based only on the property templates you


have created in Developer Studio.
10.Click OK.

Creating Rules
After you have created your zones and styles, you can
start creating the rules themselves.

Note: It is not necessary to create rule groups unless your


application requires it. If you want to create rule groups before
you create a rule, see “Grouping Rules” on page 301.

As mentioned in “Suggested Workflow Using Endeca


Tools to Promote Records”, a developer usually creates
the preliminary rules and the other constructs in
Developer Studio, and then hands off the project to a
business user to fine tune the rules and created additional
rules in Web Studio. However, the business user can use
Web Studio to perform any of the tasks described in the
following sections that are related to creating a rule. For
details, see Endeca Web Studio Help.

The following sections guide you through creating a rule


using Developer Studio, including the rule’s trigger,
targets, featured records, properties, and result ordering.

Endeca Confidential Implementing Merchandising and Content Spotlighting


278

Creating a Rule and Ordering Its Results

This section describes how to create a dynamic business


rule and sort its results.

To create a dynamic business rule in Developer Studio:

1. In the Project Explorer, expand Dynamic Business


Rules.
2. Make sure you have already created at least one zone
and one style.
3. Double-click Rules, which opens the Rules view and
also activates the Rule menu on the menu bar.
4. From the Rule menu, select New. The Rule editor
displays.
5. In the Name text box, enter a unique name for the
new rule.
6. From the Zone list, choose a zone to associate with
the rule.
7. From the Style list, choose a style to associate with the
rule.
8. (optional) From the “Members of this rule group” list,
select a rule group to which this rule belongs. Rule
groups are optional. This field is unavailable if no rule
groups have been defined. For more information, see
“Grouping Rules” on page 301.
9. If you want to sort the rule’s promoted records by a
property or dimension value, select the property or
dimension value from the Sort key list. Select [None] to
accept the default sort order specified for the project.

Advanced Features Guide Endeca Confidential


Chapter 15
279

10.If you chose a Sort key, choose Ascending or


Descending to define sort order.
11.If you want to randomly order the promoted records
for this rule, select Shuffle. Selecting Shuffle overrides
any Sort key and Order options you specified above.
12.If you want to exclude promoted results that resemble
the standard results of a query, select Self-pivot. This
control ensures the Navigation Engine does not return
the same records twice for a query, once as the
standard navigation results and then again as the
promoted results.
13.Continue with the following section to define the
rule’s trigger.

Specifying When to Promote Records

You indicate when to promote records on the Triggers tab


and on the Time Triggers tab of the Rule editor. If a user’s
query matches a trigger, the Navigation Engine evaluates
the rule. A rule may have any of the following triggers:
• Dimension triggers
• Time triggers
• Keyword triggers
• User-profile triggers
• Any combination of the above.

A dimension trigger is a collection of one or more


dimension values that identify when a dynamic business

Endeca Confidential Implementing Merchandising and Content Spotlighting


280

rule is evaluated (or fired) by the Navigation Engine. If a


user’s query contains the dimension values identified in a
rule’s trigger, the Navigation Engine evaluates that rule.
For example, in a wine data set, you could set up a rule
that is triggered when a user clicks Red. If the user clicks
White, the Navigation Engine does not evaluate the rule.
If the user clicks Red, the Navigation Engine evaluates the
rule and returns any promoted records.

In addition to specifying explicit dimension value


triggers, dimension value triggers can also be empty
(unspecified) on the Triggers tab. When there is no
explicit dimension value trigger for a rule, any navigation
query may trigger the rule merely by navigating to the
root dimension value for an application (N=0). Such a rule
effectively has a global trigger: any query from the root
dimension value triggers the rule.

A time trigger is a date/time value. If a user makes a


query after a specified start time and before a specified
expiration time, then the Navigation Engine fires the
associated rule.

A keyword trigger is a single word or phrase. If a user’s


query includes that word or phrase, then the Navigation
Engine fires the associated rule. Keyword triggers require
that the zone associated with the rule have “Valid for
search” enabled on the Zone editor in Developer Studio.
Keyword triggers also require a match mode that specifies
how the query keyword should match in order to trigger
the rule. There are three match modes:

Advanced Features Guide Endeca Confidential


Chapter 15
281

• Phrase—A user’s query must match all of the words of


the keyword trigger, in the same order, for the rule to
fire. Phrase mode also allows the rule to fire if the
spelling and stemming corrections of a user’s query
match the keyword triggers.
• All—A user’s query must match all of the words of the
keyword trigger, without regard for order, for the rule
to fire. All mode also allows the rule to be eligible if
the spelling and stemming corrections of a user’s
query match the keyword triggers.
• Exact—A user’s query must exactly match the
keyword trigger for the rule to fire. Unlike the other
two modes, a user’s query must exactly match the
keyword triggers in the number of words and cannot
be a superset of the keyword triggers. Exact mode
does not allow the rule to be qualified by spelling or
stemming corrections.

A user-profile trigger is a label, such as


premium_subscriber, that identifies an application user. If
a user who has such a profile makes a query, the query
triggers the associated rule. For more information, see
“Implementing User Profiles” on page 311.

To specify when to promote records:

1. In the Rule editor for the rule you want to configure,


click the Triggers tab.
2. If you want to add a dimension value trigger, click
Add. The Select Dimension Value dialog box displays.
a. Chose a dimension value from the list.

Endeca Confidential Implementing Merchandising and Content Spotlighting


282

b.Repeat step 2 if you want to add multiple dimension


values to the rule’s trigger. Note that only one
dimension value within each dimension may be
selected.
3. Check Inherit to allow child dimension values of the
specified dimension value to also trigger the rule. If
unchecked, a query can trigger the rule only when a
user navigates to the exact dimension value you
specify.
Note: If a rule has an empty dimension value trigger (a
global trigger), checking Inherit triggers the rule at any
navigation state because the rule is inheriting from the root.
This scenario has performance implications because the
Navigation Engine must evaluate the rule at every
navigation state. For details, see “Performance
Considerations for Dynamic Business Rules” on page 305.
Unchecking Inherit, with an empty trigger, triggers the rule
at the root dimension value but does not trigger the rule at
any other navigation state.
4. To add a keyword trigger, type a keyword or phrase in
Keyword trigger. You can provide only one term or
phrase per rule.
5. If you provided a keyword, select a match mode from
the Match mode drop-down list. You can choose
Phrase, All, or Exact as explained above.
6. If you want to add a user profile trigger, select a
pre-defined user profile from the drop-down list.
7. Go on to the following section to specify a time trigger
or see “Specifying Which Records to Promote” on
page 284 to specify the rule’s target.

Advanced Features Guide Endeca Confidential


Chapter 15
283

Specifying a Time Trigger to Promote Records

You specify a time trigger on the Time Trigger tab of the


Rule editor. A trigger specified on this tab is a date/time
value that indicates the time at which to start the rule’s
trigger and the time at which the trigger ends. Any
matching query that occurs between these two values
triggers the rule.

A time trigger is useful if you want to promote records for


a particular period of time. For example, you might create
a rule called “This Weekend Only Sale” whose time
trigger starts Friday at midnight and expires on Sunday at
6 p.m.

Only a start time value is required for a time trigger. If


you do not specify an expiration time, the rule can be
triggered indefinitely.

To specify a time trigger:

1. In the Rule editor for the rule you want to configure,


click the Time Trigger tab.
2. Select “Give this rule a time trigger” to enable the Start
time options.
3. From the “Start time” drop-down list, select the start
time for the time trigger.
4. If desired, select “Give this rule an expiration date”
and from the “Expiration time” drop-down list, choose

Endeca Confidential Implementing Merchandising and Content Spotlighting


284

the end time for the time trigger. If you do not specify
an expiration time, the trigger does not expire.
5. Go on to the following section to specify the rule’s
target.

Synchronizing Time Zone Settings

The start time and expiration time values do not specify


time zones. The server clock that runs your Web
application identifies the time zone for the start and
expiration times. If your application is distributed on
multiple severs, you must synchronize the server clocks
to ensure the time triggers are coordinated.

Specifying Which Records to Promote

You indicate which records to promote by specifying a


target on the Target tab of the Rule editor. A target is a
collection of one or more dimension values. These
dimension values identify a set of records that are all
candidates for promotion. Zone and style settings further
control the specific records that are actually promoted to
a user.

To specify which records to promote:

1. In the Rule editor for the rule you want to configure,


click the Targets tab.
2. Click Add. The Select Dimension Value dialog box
displays.
3. Select a dimension value from the list and click OK.

Advanced Features Guide Endeca Confidential


Chapter 15
285

4. If necessary, repeat the above steps to add multiple


dimension values to the target.
Note: Note that you cannot add auto-generated dimension
values until you first load and promote the dimension
values. See the Endeca Developer Studio Help for more
information.
5. Check “Augment navigation state” to add target
dimension values to the existing navigation state when
evaluating the rule. If not checked, the Navigation
Engine ignores all current navigation state filters when
evaluating the rule. Navigation state filters include
dimension value selections, record search, range
filters, and so on. The one exception is custom catalog
filters, which always apply regardless of this setting.
For example, if checked, and a user navigates to Wine
Type > Red and that triggers a rule that promotes
wines under $10, then the rule results will include
only red wines that are under $10. The rule results are
always a subset of the standard query results.
Conversely, if “Augment navigation state” is not
checked, and a user navigates to Wine Type > Red
which triggers a rule that promotes wines under $10,
then the rules results will include any wine type (red,
white, sparkling) with wines under $10. The rule
results are not a subset of the standard query results.
6. Click OK if you are finished configuring the rule, or
proceed with the following sections to promote
custom properties or featured records.

Endeca Confidential Implementing Merchandising and Content Spotlighting


286

Adding Custom Properties to a Rule

You can optionally promote custom properties by


creating key/value pairs on the Properties tab of the Rule
editor. Rule properties are typically used to return
supplementary information with promoted record pages.
Properties could specify editorial copy, point to
rule-specific images, and so on. For example, a property
name might be set to “SpecialOffer” and its value set to
“BannerAd.gif.”

You can add multiple properties to a dynamic business


rule. These properties are accessed with the same method
calls used to access system-defined properties that are
included in a rule’s results, such as a rule’s zone and
style. For details see, “Adding Web Application Code to
Extract Rule Results” on page 293.

To add a custom rule property:

1. In the Rule editor for the rule you want to configure,


click the Properties tab.
2. Type the property name in the Property field and its
corresponding value in the Value field.
3. Click Add.
4. Repeat steps 2 and 3 if you want to add additional
properties.
5. Click OK.

Note: You can also create templates to facilitate the creation of


rule properties in Web Studio. See “Ensuring Consistent Property
Usage with Property Templates” on page 273 for details.

Advanced Features Guide Endeca Confidential


Chapter 15
287

Adding Static Records in Rule Results

In addition to defining a rule’s dimension value targets


and custom properties, you can optionally specify any
number of static records to promote. These static records
are called featured records, and you specify them on the
Featured Records tab of the Rule editor. You access
featured records in your Web application using the same
methods you use to access dynamically generated
records. For details, see “Adding Web Application Code
to Extract Rule Results” on page 293.

To add featured records to a rule:

1. In the Rule editor for the rule you want to configure,


click the Featured Records tab.
2. From the Record spec list, choose an Endeca property
to uniquely identify featured records.
3. In the Value text box, type a value for the selected
Endeca property. This value identifies a featured
record you want to promote.
4. Click Add.
5. If desired, repeat steps 3 and 4 to add additional
featured records to the list.
6. To change the order in which the featured records
appear, select a record and click Up or Down.
7. To change a record spec value, select it from the
Record spec values list, modify its value in the Value
text box, and click Update.

Endeca Confidential Implementing Merchandising and Content Spotlighting


288

8. Click OK when you are done adding featured records


to a rule.

The Navigation Engine treats featured records differently


than dynamically generated records. In particular,
featured records are not subject to any of the following:
• Record order sorting by sort key
• Uniqueness constraints
• Maximum record limits

Order of Featured Records

The General tab of the Rule editor allows you to specify a


sort order for dynamically generated records that the
Navigation Engine returns. This sort order does not apply
to featured records. Featured records are returned in a
Supplement object in the same order that you specified
them on the Featured Records tab. The featured records
occur at the beginning of the record list for the rule’s
results and are followed by any dynamically generated
records. The dynamically generated records are sorted
according to your specified sort options.

No Uniqueness Constraints

The zone associated with a rule allows you to indicate


whether rule results are unique by a specified property or
dimension value. This uniqueness constraint does not
apply to featured records even if uniqueness is enabled
for dynamically generated rule results. For example, if
you enabled “Color” to be the unique property for record
results and you have two dynamically generated records

Advanced Features Guide Endeca Confidential


Chapter 15
289

with “Blue” as property value, then the Navigation Engine


excludes the second record as a duplicate. On the other
hand, if you have the same scenario but the two records
are featured results not dynamically generated results, the
Navigation Engine returns both records.

No Maximum Record Limits

The style associated with a rule allows you to set a


maximum number of records that the Navigation Engine
may return as rule results. This Maximum Records value
does not apply to featured records. For example, if the
Maximum Records value is set to three and you specify
five featured records, the Navigation Engine returns all
five records. Also, the Navigation Engine returns featured
records before dynamically generated records, and the
featured records count toward the maximum limit.
Consequently, the number of featured records could
restrict the number of dynamically generated rule results.

Sorting Rules in the Rules View

The dynamic business rules you create in Developer


Studio appear in the Rules view. To make rules easier to
find and work with, they can be sorted by name (in
alphabetical ascending or descending order) or by
priority.

The procedure described below changes the way rules


are sorted in Rules view only. Sorting does not affect the
priority used when processing the rules. Prioritizing rules

Endeca Confidential Implementing Merchandising and Content Spotlighting


290

in Developer Studio is described in “Prioritizing Rules” on


page 290.

To sort the rules in Rules view:

• In the Rules view, click the Name column header to


cycle the sort order from Sorted by name (ascending),
Sorted by name (descending), and Sort by priority
order.

Prioritizing Rules

In addition to sorting rules by name or priority, you can


also modify a rule’s priority in the Rules view of
Developer Studio. Priority is indicated by a rule’s position
in the Rules view, relative to the position of other rules
when you have sorted the rules by priority. You modify
the relative priority of a rule by moving it up or down in
the Rules view.

A rule’s priority affects the order in which the Navigation


Engine evaluates the rule. The Navigation Engine
evaluates rules that are higher in the Rules view before
those that are positioned lower. By increasing the priority
of a rule, you increase the likelihood that the rule is
triggered before another, and in turn, increase the
likelihood that the rule promotes records before others.

It is important to consider rule priority in conjunction


with the settings you specify in the Zone editor. For
example, suppose a zone has “Rule limit” set to three. If
you have ten rules available for the zone, the Navigation
Engine evaluates the rules, in the order they appear in the

Advanced Features Guide Endeca Confidential


Chapter 15
291

Rules view, and returns results from only the first three
that have valid results. In addition, the “Shuffle rules”
check box on the Zone editor overrides the priority order
you specify in the Rules view. When you check “Shuffle
rules”, the Navigation Engine randomly evaluates the
rules associated with a zone.

If you set up rule groups, you can modify the priority of a


rule within a group and modify the priority of a group
with respect to other groups. For details, see “Prioritizing
Rule Groups” on page 304.

To prioritize rules:

1. In the Rules view, click the Name column header to


cycle the order of the rule sort so that the rules are
displayed in priority order (you’ll see “Sorted by
priority” in the lower left corner of the Rules view).
2. Select the rule whose priority you want to change and
click the Up or Down buttons to move the rule to the
desired position.

Presenting Rule Results in a Web Application


The Navigation Engine returns rule results to a Web
application in a Supplement object. To display rule results
to Web application users, an application developer writes
code that extracts the rule results from the Supplement
object and displays the results in the application.

Before explaining how these two tasks are accomplished,


it is helpful to briefly describe the process from the point

Endeca Confidential Implementing Merchandising and Content Spotlighting


292

at which a user makes a query to the point when an


application displays the rule results:
1. A user submits a query that triggers a dynamic
business rule.
2. When a query triggers a rule, the Navigation Engine
evaluates the rule and returns rule results in a single
Supplement object per rule. The rule results are
derived from the rule’s target values refined by zone
and style settings.
3. Web application code extracts the rule results and the
style for the rule from the Supplement object.
4. Custom rendering code defines how to display the
rule results in your application according to the style
supplied with the results.

The following sections describe query parameter


requirements and application and rendering code
requirements.

Required Navigation Engine URL Query Parameters

The Navigation Engine evaluates dynamic business rules


only for navigation queries. This evaluation also occurs
with variations of navigation queries, such as record
search, range filters, and so on. Dynamic business rules
are not evaluated for record, aggregated record, or
dimension search queries. Therefore, a query must
include a navigation parameter (N) in order to potentially
trigger a rule. No other specific query parameters are
required.

Advanced Features Guide Endeca Confidential


Chapter 15
293

Adding Web Application Code to Extract Rule Results

You must add code to your Web application that extracts


rule results from the Supplement objects that the
Navigation Engine returns. Supplement objects are
children of the Navigation object and are accessed via the
getSupplements() method for the Navigation object. The
getSupplements() method returns a SupplementList
object that contains some number of Supplement objects.
For example, the following pseudo code gets all
Supplement objects from the Navigation object.

<SupplementList> = <Navigation>.getSupplements()
<Supplement> = <SupplementList>.get(i)

Each Supplement object may contain three types of data:


records, navigation references, and properties.
• Records—Each dynamic business rule’s Supplement
object has one or more records attached to it. These
records are structurally identical to the records found
in navigation record results. This pseudo code gets all
records from a Supplement object. See the sample
code sections below for more detail.

<ERecList> = <Supplement>.getERecs()
<ERec> = <ERecList>.get(i)

• Navigation reference—Each business rule’s


Supplement object also contains a single reference to
a navigation query. This navigation reference is a
collection of dimension values. These dimension
values create a navigation query that may be used to
direct a user to a new location (usually the full result

Endeca Confidential Implementing Merchandising and Content Spotlighting


294

set that the promoted records were sampled from.)


This is useful if you want to create a link from the
rule’s title that displays the full result set of promoted
records. This pseudo code gets the navigation
reference from a Supplement object. See the sample
code sections below for more detail.

<NavigationrefsList>=
<Supplement>.getNavigationRefs()
<DimValList> = <NavigationRefsList>.get(i)
<DimVal> = <DimValList>.get(j)

• Properties—Each business rule’s Supplement object


contains multiple properties, and each property
consists of a key/value pair. Properties are
rule-specific, and are used to specify the style, zone,
title, and so on for each rule. This pseudo code gets all
the properties from a Supplement object. See the
sample code sections below for more detail.

<PropertyMap> = <Supplement>.getProperties()
<Property> = <PropertyMap>.get(string)

There are a number of important properties for each


business rule’s Supplement object. They include the
following:
• Title—The title of a rule, as specified on the Name
field of the Rule editor.
• Style—The name of the style associated with the rule,
as specified in the Style drop-down list of the Rule
editor’s General tab.

Advanced Features Guide Endeca Confidential


Chapter 15
295

• Style Title—The title of the style (different than the


name of the style) associated with the rule, as
specified in the Title field on the Style editor.
• Zone—The name of the zone the rule is associated
with, as specified by the Zone drop-down list of the
Rule editor’s General tab.
• DGraph.SeeAlsoMerchId—The rule ID. This ID is
system-defined, not user-defined.
• DGraph.SeeAlsoPivotCount—This count specifies the
total number of matching records that were available
when evaluating the target for this rule. This count is
likely to be greater than the actual number of records
returned with the Supplement object, since only the
top N records are returned for a given business rule
style.
• DGraph.SeeAlsoMerchSort—If a sort order has been
specified for a rule, the property or dimension name
of the sort key is listed in this property.
• DGraph.SeeAlsoMerchSortOrder—If a sort key is
specified, the sort direction applied for the key is also
listed.

In addition to the properties listed above, you can create


custom properties that on the Properties tab of the Rule
editor. Custom properties also appear in a Supplement
object. For details, see “Adding Custom Properties to a
Rule” on page 286.

Endeca Confidential Implementing Merchandising and Content Spotlighting


296

Sample Java Code

You can use the following sample Java code to assist you
in extracting rule results from Supplement objects.

SupplementList sl = nav.getSupplements()
for (int i=0; i < sl.Size(); i++) {

// Get Supplement object


Supplement sup = (Supplement)sl.get(i);

// Get properties
PropertyMap supPropMap = sup.getProperties();

// Check if object is merchandising or


// content spotlighting result
if (supPropMap.get("DGraph.SeeAlsoMerchId") != null)
(supPropMap.get("Style") != null) &&
(supPropMap.get("Zone") != null) &&
hasMerch = true;
{
// Get record list
ERecList recs = sup.getERecs()
for (int j=0; j < recs.Size(); j++) {

// Get record
ERec rec = (ERec)recs.get(j);

// Get record Properties


PropertyMap recPropsMap = rec.getProperties();

// Get value of name prop from current record


String name = recPropsMap.get("Name");
}

// Set target link using first Navigation Reference


NavigationRefsList nrl = sup.getNavigationRefs();
DimValList dvl = (DimValList)nrl.get(0);
...

Advanced Features Guide Endeca Confidential


Chapter 15
297

...
// Loop over dimension values to build new target query
String newNavParam;
for (int j=0; j < dvl.Size(); j++) {
DimVal dv = (DimVal)dvl.get(j)

// Add delimiter and id


newNavParam += " "+dv.getId();
}
// Get specific rule properties
String style = supPropMap.get("Style");
String title = supPropMap.get("Title");
String zone = supPropMap.get("Zone");
String customText = supPropMap.get("CustomText");
}
}

Sample ASP .NET Code

You can use the following sample ASP code to assist you
in extracting rule results from Supplement objects.

Endeca Confidential Implementing Merchandising and Content Spotlighting


298

// Check if Supplement object is merchandising


// or content spotlighting
if ((supPropMap["DGraph.SeeAlsoMerchId"] != null) &&
(supPropMap["Style"] != null) &&
(supPropMap["Zone"] != null)) &&
(Request.QueryString["hideMerch"] == null)) {

// Get Record List


ERecList supRecs = sup.ERecs;

// Loop over records


for (int j=0; j<supRecs.Count; j++) {

// Get record
ERec rec = (ERec)supRecs[j];

// Get property map for record


PropertyMap propsMap = rec.Properties;

// Get value of name prop from current record


String name = (String)propsMap["Name"];
}

// Set target link using first navigation reference


NavigationRefsList nrl = sup.NavigationRefs;
DimValList dvl = (DimValList)nrl[0];

// Loop over dimension values to build new target query


String newNavParam;
for (int k=0; k<dvl.Count; k++) {
DimVal dv = (DimVal)dvl[k];

// Add delimiter and id


newNavParam += " "+dv.Id;
}
...

Advanced Features Guide Endeca Confidential


Chapter 15
299

Sample COM Code

You can use the following sample COM code to assist


you in extracting rule results from Supplement objects.

' Get Supplement list


dim sups
set sups = nav.GetSupplements()

' Loop over Supplement objects


For i = 1 to sups.Count

' Get Supplement object


dim sup
set sup = sups[i]

' Get properties


dim supPropMap
set supPropMap = sup.GetProperties()

' Check if Supplement object is merchandising or


' contenting spotlighting
if ((supPropMap.Get("DGraph.SeeAlsoMerchId") <> "") and
(supPropMap.Get("Style") <> "") and
(supPropMap.Get("Zone") <> "")) and
(Request.QueryString("hideMerch") = "")) then
...

Endeca Confidential Implementing Merchandising and Content Spotlighting


300

...
' Get record list
dim supRecs
set supRecs = sup.GetERecs()

' Loop over records


For j=1 to supRecs.count

' Get record


set rec = supRecs(j)

' Get property map for record


set propsMap = rec.GetProperties()

' Get value of name prop from current record


set name = propsMap.Get("Name")
next

' Set target link using first navigation reference


set nrl = sup.GetNavigationRefs()
dim dvl
set dvl = nrl[0]

' Loop over dimension values to build new target query


newNavParam = ""

For k=1 to dvl.Count


dim dv
set dv = dvl(k)

' Add delimiter if necessary


if (newNavParam <> "") then
newNavParam = newNavParam & " "
end if

' Add id
newNavParam = newNavParam & dv.GetId()
next
...

Advanced Features Guide Endeca Confidential


Chapter 15
301

...
' Get specific rule properties
set style = supPropMap.Get("Style")
set title = supPropMap.Get("Title")
set zone = supPropMap.Get("Zone")
set customText = supPropMap.Get("CustomText")
end if
next

Adding Web Application Code to Render Rule Results

In addition to Web application code that extracts rule


results from Supplement objects, you must also add
application code to render the rule results on screen.
(Rendering is the process of converting the rule results
into displayable elements in your Web application pages.)

Rendering rule results is a Web application-specific


development task. The reference implementations come
with three arbitrary styles of rendering business rule
results, but most applications require their own custom
development that is typically keyed on the Title, Style,
Zone, and other custom properties. For details, see
“Adding Web Application Code to Extract Rule Results”
on page 293.

Grouping Rules
Rule groups are a third and optional construct that
complement zones and styles in supporting dynamic
business rules. Rule groups serve two functions:

Endeca Confidential Implementing Merchandising and Content Spotlighting


302

• They provide a means to logically organize rules into


categories to facilitate creating and editing rules.
• They allow multiple business users to access Web
Studio simultaneously.

A rule group provides a means to organize a large


number of rules into smaller logical categories, which
usually affect distinct (non-overlapping) parts of a Web
site. For example, a retail application might organize rules
that affect the electronics and jewelry portions of a Web
site into a group for Electronics Rules and another group
for Jewelry Rules.

A rule group also enables multiple business users to


access Web Studio simultaneously. Each Web Studio user
can access a single rule group at a time. Once a user
selects a rule group, Web Studio prevents other users
from editing that group until the user returns to the
selection list or closes the browser window.

To create a rule group:

1. In the Project Explorer window, expand Dynamic


Business Rules.
2. Double-click Rules. This opens the Rules view and
also activates the Rule Group menu on the menu bar.
3. From the Rule Group menu, select New.
4. Type a unique name in the Group name field. Use
only alphanumeric, dash, or underscore characters.
5. To select a rule for this group, highlight a rule in the
“All rules” list and click Add. The rule appears in the

Advanced Features Guide Endeca Confidential


Chapter 15
303

“Rules in group” list. (If this is the first group you


created, all the rules are moved to the “Rules in group”
list and the Remove button is inactive.)
Note: A rule can belong to only one rule group. Adding a
rule to a group removes it from any group to which it
previously belonged.
6. Click OK. The new rule group appears in the Rules
view.
7. Repeat steps 3 to 7 if you want multiple rule groups in
your project.
8. To change the priority of a rule within a group, select
the rule in the “Rules in Group: Name” column and
click either the Up or Down arrow buttons.

Deleting a Rule Group

You can delete rule groups from your project as


necessary.

To delete a rule group:

1. In the Rules view’s “Rule groups: Name” column,


select a rule group and then click Delete.
2. When the confirmation message appears, click Yes.
3. If your project contains at least one other rule group,
the Select Rule Group dialog box appears. In the drop
down list, select the rule group you want to move the
rules in this group to, and click OK.
This dialog box appears when the rule group you
delete is not the last one in the project. If the rule

Endeca Confidential Implementing Merchandising and Content Spotlighting


304

group you are deleting is the only rule group,


Developer Studio lists the rules under the All rules
heading.

Prioritizing Rule Groups

In the same way that you can modify the priority of a rule
within a group, you can also modify the priority of a rule
group with respect to other rule groups.

The Navigation Engine evaluates rules first by group


order, as shown in the Rules view of Developer Studio or
Web Studio, and then by their order within a given group.
For example, if Group_B is ordered before Group_A, the
rules in Group_B will be evaluated first, followed by the
rules in Group_A. Rule evaluation proceeds in this way
until a zone’s Rule Limit value is satisfied.

This relationship is shown in the graphic below. In it,


suppose zone 1 has a Rule Limit setting of 2. Because of
the order of group B is before group A, rules 1 and 2
satisfy the Rule Limit rather than rules 4 and 5.
Group B
Rule 1, Zone 1
Rule 2, Zone 1
Rule 3, Zone 2
Group A
Rule 4, Zone 1
Rule 5, Zone 1
Rule 6, Zone 2

Advanced Features Guide Endeca Confidential


Chapter 15
305

To prioritize rule groups:

1. In the Rules view, select a group whose priority you


want to change in the “Rule groups: Name” column.
2. Click the Up or Down buttons to move the group to
the desired position.

If you want to further prioritize the rules within a


particular rule group, see, “Prioritizing Rules” on
page 290.

Interaction Between Rules and Rule Groups

When creating or editing rule groups, keep in mind the


following interactions between rules and rule groups:
• Rules may be moved from one rule group to another.
However a rule can appear in only one group.
• A rule group may be empty (that is, it does not have
to contain rules).
• The order of rule groups with respect to other rule
groups may be changed.

Performance Considerations for Dynamic Business


Rules
Dynamic business rules require very little data processing
or indexing, so they do not impact Forge performance,
Dgidx performance, or the Navigation Engine memory
footprint.

Endeca Confidential Implementing Merchandising and Content Spotlighting


306

However, because the Navigation Engine evaluates


dynamic business rules at query time, rules affect the
response-time performance of the Navigation Engine. The
larger the number of rules, the longer the evaluation and
response time. Evaluating more than twenty rules per
query can have a noticeable effect on response time. For
this reason, you should monitor and limit the number of
rules that the Navigation Engine evaluates for each query.

In addition to large numbers of rules slowing


performance, query response time is also slower if the
Navigation Engine returns a large number of records. You
can minimize this issue by setting a low value for the
Maximum Records setting in the Style editor for a rule.

Rules without Explicit Triggers

Dynamic business rules without explicit triggers also


affect response time performance because the Navigation
Engine evaluates the rules for every navigation query.

Using an Agraph and Dynamic Business Rules


To implement dynamic business rules when you are
using the Agraph, keep in mind the following points:
• Using dynamic business rules with the Agraph affects
performance if you are using zones configured with
“Unique by this dimension/property” and combined
with a high setting for the maximum number of
records or a large numbers of rules. To avoid response

Advanced Features Guide Endeca Confidential


Chapter 15
307

time problems, you may need to reduce the number


of rules, reduce the maximum records that can be
returned, or abandon uniqueness.
• All Dgraphs serving one Agraph must share the same
set of dynamic business rules. To ensure this, it is
necessary to update their configurations
synchronously in the Endeca Manager, by running
Dgidx with the --keepcats flag.
• If you update your Dgraphs with dynamic business
rule changes using Developer Studio or Web Studio,
and a request comes to the Agraph while the update is
in progress, the Agraph issues a fatal error similar to
the following:
[Thu Jun 24 16:26:29 2004] [Fatal]
(merchbinsorter.cpp::276) - Dgraph 1 has fewer rules
fired.

As long as the Agraph is running under the Endeca JCD,


the JCD automatically restarts the Agraph. No data is lost.
However, end-users will not receive a response to
requests made during this short time. This problem has
little overall impact on the system, because business rule
updates are quick and infrequent. Nevertheless, Endeca
recommends that you shut down the Agraph during
business rule updates. To shut down the Agraph, go to a
CMD prompt on Windows or a shell prompt on UNIX
and type:

GET 'http://HOST:PORT/admin?op=exit'

where HOST is machine running the Agraph and PORT is


the port number of the Agraph. GET is a Perl utility, so be
sure the Perl binaries are in your system path variable.

Endeca Confidential Implementing Merchandising and Content Spotlighting


308

Applying Relevance Ranking to Rule Results


In some cases, it is a good idea to apply relevance
ranking to a rule’s results. For example, if a user performs
a record search for Mondavi, the results in the Highly
Rated rule can be ordered according to their relevance
ranking score for the term Mondavi. In order to create
this effect, there are three requirements:
• The navigation query that is triggering the rule must
contain record search parameters (Ntt and Ntk).
Likewise, the zone that the rule is assigned to must be
identified as Valid for search. (Otherwise, the rule will
not be triggered.)
• The rule’s target must be marked to Augment
Navigation State.
• The rule must not have any sort parameters specified.
If the rule has an explicit sort parameter, that
parameter overrides relevance ranking. Sort
parameters for a rule are set on the General tab of the
Rule editor.

If these three requirements are met, then the relevance


ranking rules specified with Navigation Engine startup
options are used to rank specific business rules when
triggered with a record search request (a keyword
trigger).

Advanced Features Guide Endeca Confidential


Chapter 15
309

About Overloading Supplement Objects


Recall that dynamic business rule results are returned to
an application in Supplement objects. Each rule that
returns results does so via a single Supplement object for
that rule. However, not all Supplement objects contain
rule results.

Supplement objects are also used to support “Did You


Mean” suggestions, record search reports, and so on. In
other words, a Supplement object can act as a container
for a variety of features in an application. One
Supplement object instance cannot contain results for two
features. For example, one Supplement object cannot
contain both rule results and also “Did You Mean”
suggestions. For that reason, if you combine dynamic
business rules with these additional features, you should
check each Supplement object for specific properties
such as DGraph.SeeAlsoMerchId to identify which
Supplement object contains rule results.

Endeca Confidential Implementing Merchandising and Content Spotlighting


310

Advanced Features Guide Endeca Confidential


Chapter 15
Chapter 16
Implementing User Profiles

A user profile is a character-string-typed name that identifies


a class of end users. User profiles enable applications built
on the Endeca Navigation Platform to tailor the content
displayed to an end user based on that user’s identity.

User profiles can be used to trigger dynamic business rules,


where such rules are optionally constructed with an
additional trigger attribute corresponding to a user profile.
The Endeca Navigation Platform can accept information
about the end user, and use that information to trigger
pre-configured rules and behaviors.

This feature discusses how you create user profiles and then
implement them as dynamic business rule triggers. Before
reading further, make sure you are comfortable with the
information in the Dynamic Business Rules chapter.

Note: Each business rule is allowed to have at most one user profile
trigger.
312

Profile-Based Trigger Scenario


In the following scenario, an online clothing retailer
wants to set up a dynamic business rule that says: “For
young women, who are browsing stretch t-shirts, also
recommend cropped pants.” We follow the shopping
experience of a customer named Jane.

In order to set up this rule, a few configuration steps are


necessary:
1. In Endeca Developer Studio, the retailer creates a user
profile called young_woman, which corresponds to
the set of customers who are female and are between
the ages of 16 and 25.
2. In Endeca Web Studio, a dynamic business rule that
uses the profile as a trigger is created:
young_woman X DVAL(stretch t-shirt) => DVAL(cropped
pants)

No complex Boolean logic programming is necessary


here. The business user simply selects a user profile
from a set of available profiles to create the business
rule.
3. In the Web application that’s driving the customer’s
experience, there needs to be logic that identifies the
user and tests to see if he or she meets the
requirements to be classified as a young_woman.
Alternatively, the profile young_woman may already
be stored along with Jane’s information (such as age,
address, and income) in a database or LDAP server.

Advanced Features Guide Endeca Confidential


Chapter 16
313

The user’s experience would go something like this:


1. Jane accesses the clothing retailer’s Web site and is
identified by a cookie on her computer. By looking up
a few database tables, the application knows that it
has interacted with her before. The database indicates
that she is 19 years old and female.
At this point, the database may also indicate the user
profiles that she belongs to: young_woman,
r_and_b_music_fan, college_student. Alternatively, the
application logic may test against her information to
see which profiles she belongs to, as follows: “Jane is
between 16 and 25 years old and she is female, so she
belongs in the young_woman profile.”
2. As Jane is browsing the site, the Endeca Navigation
Engine is driving her catalog experience. As each
query is being sent to the Endeca Navigation Engine, it
is augmented with user profile information. Here is
some sample Java code:
profileSet.add("young_woman");
eneQuery.setProfiles(profileSet);

3. As Jane clicks on a stretch t-shirt link, the Endeca


Navigation Engine realizes that a dynamic business
rule has been triggered: young_woman X DVAL(stretch
t-shirt). Therefore, it returns a cropped pants record
in one of the dynamic business rule zones.
4. Jane sees a picture of cropped pants in a box labeled,
“You also might like...”

Endeca Confidential Implementing User Profiles


314

Developer Studio Implementation


You set up user profiles in Developer Studio. Both
Developer Studio and Web Studio allow a user profile to
be associated with a business rule’s trigger.

To set up a user profile in Developer Studio:

1. In the Project Explorer tab, double-click User Profiles.


The User Profiles view appears.
2. Click New. The New User Profile editor appears.
3. In the Name text box, type a unique name for this
user profile and click OK. The new user profile is
added to the User Profiles view.

To assign a user profile as a business rule trigger in Developer


Studio:

1. In the Project Explorer, click Dynamic Business Rules


to expand it, and then double-click Rules to open the
Rules view.
2. Select the rule you want to apply the trigger to, and
then click Edit. The Rule editor appears.
3. Click the Triggers tab. In the User Profile list, select a
profile. (You may also specify a dimension trigger
and/or a keyword trigger in the Triggers tab.)
4. Click OK. The user profile information that you added
to the rule now appears in the Rules view.

Advanced Features Guide Endeca Confidential


Chapter 16
315

To assign a user profile as a business rule trigger in Web Studio:

1. Log on to Web Studio and click Rule Manager to


display the Rule Manager page.
2. In the Rule List, click the rule you want to apply the
user profile to.
3. Click the Edit Where and What tab.
4. In the User Profile list, select a user profile to use as a
business rule trigger. There can only be one user
profile trigger per rule.

User Profile Query Parameters


There are no URL ENE query parameters associated with
user profiles. In many live application scenarios, the URL
query is exposed to the end user, and it is usually not
appropriate for end users to see or change the user
profiles with which they have been tagged.

Objects and Method Calls


In the following code samples, the application recognizes
the end user as Jane Smith, looks up some database
tables and determines that she is 19 years old, female, a
college student and likes R&B music. These
characteristics map to the following Endeca user profiles
created in Endeca Developer Studio:
• young_woman
• r_and_b_music_fan

Endeca Confidential Implementing User Profiles


316

• college_student

User profiles can be any string. The user profiles supplied


to ENEQuery must exactly match those configured in
Endeca Developer Studio.

Java Code Example


// User profiles can be any string. The user profiles
// supplied to ENEQuery must exactly match those
// configured in Endeca Developer Studio.
// Make sure you have the following import statement at
// the top of your file:

// import java.util.*;

Set profiles = new HashSet();


// Collect all the profiles into a single Set object.
profiles.add("young_woman");
profiles.add("r_and_b_music_fan");
profiles.add("college_student");
// Augment the query with the profile information.
eneQuery.setProfiles(profiles);

Advanced Features Guide Endeca Confidential


Chapter 16
317

.NET C# Code Example


// Make sure you have the following statement at the top
// of your file:
// using System.Collections.Specialized;

StringCollection profiles = new StringCollection();

// Collect all the profiles into a single


StringCollection object.

profiles.Add("young_woman");
profiles.Add("r_and_b_music_fan");
profiles.Add("college_student");

// Augment the query with the profile information.

eneQuery.Profiles = profiles;

COM Code Example


' Create a zero-based string array that will hold the
' profiles

Dim profiles(3)

' Add each profile as a string

profiles(0) = "young_woman"
profiles(1) = "r_and_b_music_fan"
profiles(2) = "college_student"

' Augment the query with the profile information


eneQuery.SetProfiles(profiles)

Endeca Confidential Implementing User Profiles


318

Performance Impact of User Profiles


An application using this feature may experience
additional memory costs due to user profiles being set in
an ENEQuery object. In addition, the application may
require additional ENEConnection.query() response time,
because the Navigation Engine must do additional work
to receive profile information and check if business rules
fire. However, in typical application scenarios that set one
to five user profile strings of at most 20 characters in the
ENEQuery object, the performance impact is insignificant.

Advanced Features Guide Endeca Confidential


Chapter 16
Chapter 17
Implementing Partial Updates

This chapter describes how to implement partial updates in


your deployment.

About Partial Updates


The Endeca Navigation Engine processes two types of
source data transformations:
• Baseline update
• Partial update

A baseline update (also called a full update) is a complete


re-index of the entire data set. Baseline updates occur
infrequently, usually once per day or once per week. They
typically involve the customer generating an extract from
their database system and making the files accessible either
on an FTP server or on the indexing server. This data is
processed by Forge and Dgidx, and is then finally made
available through the Navigation Engine.

A partial update is a much smaller change in the overall


data set. Partial updates affect a small percentage of the
total records in the system, and therefore occur much more
frequently. They consist of a much smaller extract from the
320

customer’s database and contain volatile information. For


example, the price and availability of products on a retail
store site are usually volatile.

The partial update capability of the Endeca Navigation


Engine allows it to receive and process changes to its
data without reprocessing the baseline data, thus
allowing it to process the updates in a short amount of
time and continue serving requests.

Implementing partial updates requires a separate pipeline


to process the partial updates and starting the Navigation
Engine with an additional command-line flag. These
requirements are explained in detail in this chapter.

You can implement partial updates by using one of three


methods:
• Using a control script, as documented below.
• Using the emgr_update utility to programmatically run
a partial update via the Endeca Manager. For details,
see the Endeca Tools Guide.
• Writing a client application program, using the Endeca
Data Indexing API. For details, see the Endeca Data
Indexing API Guide.

Note: Partial updates cannot be run from the Developer Studio


or Web Studio.

Advanced Features Guide Endeca Confidential


Chapter 17
321

Partial Update Capabilities

The Endeca software supports updates that allow you to:


• Add auto-generated dimension values to existing
dimensions.
• Add an entirely new record with a new set of property
values and dimension values.
• Delete a record. The dimension values with which the
record is tagged are not removed from the system,
even if there are no other records tagged with the
same dimension values.
• Replace an existing record with an entirely new set of
property values and dimension values. Again,
dimension values no longer associated with any
records remain in the system.
• Update an existing record, selectively adding and
removing dimension and property values. Specifically
it is possible to:
• Add property values to a record.
• Remove all property values of a property from a
record.
• Add dimension values to a record.
• Remove specific dimension values from a record.
• Remove all dimension values of a dimension from a
record.

Endeca Confidential Implementing Partial Updates


322

Partial Updates Reference Implementation

A data reference implementation for partial updates is


included with the installation in these default locations:
• On UNIX:
$ENDECA_REFERENCE_DIR/sample_updates_data

• On Windows:
%ENDECA_REFERENCE_DIR%\sample_updates_data

This implementation contains the components for partial


updates, including a baseline update pipeline, a partial
update pipeline, a control script, and source data files.
The descriptions in the rest of this chapter assume the use
of this reference implementation.

Note: The reference implementation is for a single-Dgraph


deployment. If you have an Agraph deployment, see the section
“Partial Updates in Agraph Implementations” on page 355.

Baseline Pipeline Restrictions

Forge processing for baseline updates and partial updates


is done in separate pipelines, but is coordinated and
synchronized. Because baseline updates have loose
restrictions on both time and computational resources,
they can execute complex business logic and large joins.
The resources available to process partial updates are
much more tightly restricted, so resource-intensive tasks
(such as large joins and complex business logic) should
be avoided if possible.

Advanced Features Guide Endeca Confidential


Chapter 17
323

The required coordination between the baseline and


partial update pipelines, coupled with the resource
restrictions on the partial update pipeline, impose
constraints on the baseline update pipeline:
• Properties that are the subject of partial updates, and
properties that are the basis for classifications that are
the subject of partial updates, should not be
assembled onto the records in large joins. Loading the
join table takes time, and the partial update pipeline
may not be able to load the table in the time required
to perform the partial update.
• Records produced by the baseline update pipeline
must be identified in the partial update pipeline by a
single property that is unique for each record. “Record
Specification Attribute” on page 340 describes this
requirement in detail.

Dimensions affected by partial updates must be loaded in


the baseline update pipeline. Loading large dimensions
increases the startup time for the partial update pipeline.
Therefore, the size of dimensions affected by partial
updates should be kept small.

Baseline updates must be comprehensive: all of the


information in all the partial updates since the previous
baseline update must be incorporated in the new baseline
update. If there are multiple types of updates being
performed, all of the updates of all the different types
must be included.

Baseline updates must not overlap. A new baseline


cannot be started until processing of the prior baseline

Endeca Confidential Implementing Partial Updates


324

has been completed (completed means that the baseline


update has been loaded into a Navigation Engine and
updates have started being processed against it).

To avoid ambiguous behavior and spurious errors, we


suggest that partial updates not be extracted while a
baseline data extraction is in progress.

Creating a Partial Update Pipeline


A partial update requires its own pipeline (separate from
the baseline update pipeline) that only deals with partial
updates. Use Developer Studio to create the partial
update pipeline, because Developer Studio can open
both pipeline files at the same time.

Each input record in a partial update pipeline describes a


transformation to be performed on a single record in the
running application. This means, for example, that a
single update cannot change the spelling of a property on
many records; instead, a separate update must be
generated to change the spelling on each record in the
application.

Advanced Features Guide Endeca Confidential


Chapter 17
325

The reference implementation’s partial update pipeline


components are as follows:

Endeca Confidential Implementing Partial Updates


326

The partial update pipeline is executed at frequent


intervals. Between runs, updates are queued. When the
partial update process starts, all the queued updates are
processed and written to a staging area. When Forge is
complete, the updates are read from the staging area into
the running application.

The partial update pipeline in the sample_updates_data


reference implementation works as follows:
1. The partial update pipeline reads its input, using a
record adapter (named LoadUpdateData) with the Multi
File field checked.
2. The input records are transformed into record updates
by a record manipulator (named UpdateManipulator)
using IF and UPDATE_RECORD expressions.
3. The record updates are written out using an update
adapter.

The following sections described the creation of the


individual components of the pipeline.

Creating the Record Adapter

When you create the record adapter in Developer Studio,


the General tab of the Record Adapter editor must have
these basic settings:
• Direction – Must be Input.
• URL – Enter an input URL as a path, with the filename
being a pattern. For example, a URL pattern of
../incoming/updates/*.txt.gz means that Forge will

Advanced Features Guide Endeca Confidential


Chapter 17
327

read any file that has the txt.gz suffix in the


sample_updates_data/data/incoming/updates
directory. Each file that matches the pattern will be
read in sequence.
• Multi File – Check this box to specify that Forge can
read data from more than one input file and that the
input URL is to be interpreted as a pattern.

You can leave the other tabs (Sources, Record Index, and
so on) in their default state.

Creating the Record Manipulator

When creating the record manipulator, the Sources tab of


the Record Manipulator editor must have the following
settings:
• Record source – Select the name of the property
mapper.
• Dimension source – Select None.
• You can leave the Record Index tab empty.

The Expression editor is where which you add the


expressions described below. You open the Expression
editor by double-clicking the record manipulator
component in the Pipeline Diagram. You can add
expressions after the record manipulator is created.

Endeca Confidential Implementing Partial Updates


328

IF Expression

The record manipulator is essentially an IF expression


that calls one of three UPDATE_RECORD expressions based
on a conditional evaluation of the incoming record. The
logic of the IF expression is as follows:
IF the incoming record has a "Remove" field equal to "1"
THEN delete the record
(i.e., call UPDATE_RECORD with an ACTION of "DELETE_OR_IGNORE")
ELSE_IF the incoming record has an "Update" field equal to "1"
THEN update the record
(i.e., call UPDATE_RECORD with an ACTION of "UPDATE")
ELSE add the record
(i.e., call UPDATE_RECORD with an ACTION of "ADD_OR_REPLACE")

Other expressions (such as MATH and PROPERTY) are used


in evaluating the incoming record for the value of the
record’s update property.

You can see the entire IF expression (with its three


UPDATE_RECORD expressions) by opening the reference
implementation’s record manipulator in the Expression
editor.

If you want to modify the UPDATE_RECORD expressions, the


next section provides more details on its usage.

UPDATE_RECORD Expression

The UPDATE_RECORD expression updates existing records


by adding, removing, or replacing dimensions, dimension
values, or property values. The expression can also delete
existing records and add new ones.

Advanced Features Guide Endeca Confidential


Chapter 17
329

If different types of partial updates are processed using


separate pipelines, the UPDATE_RECORD expression can be
written to perform the same action on all of the input.

For example, a partial update pipeline written to handle


only price and availability changes would always
generate UPDATE-type record updates. If the same partial
update pipeline needs to handle REPLACE updates (that is,
reclassification of a record), the input data must contain
some indication of what type of update to perform. Most
commonly, this will simply be a property on the input
record, which is checked inside an IF expression.

The UPDATE_RECORD expression takes a snapshot of the


record at the time it is evaluated and generates a
corresponding record update. Thus, the update contains
the property names and values, as well as classifications,
that are in effect at the time of evaluation. If properties
are renamed, have their values changed, or classifications
are added or deleted after the record update expression
has been evaluated, the changes have no impact on the
record update that will be generated. Only one record
update can be generated per record.

Note the following:


• For ADD record updates, a complete record must be set
up before the expression is evaluated.
• For REPLACE record updates, all the necessary property
values and dimension values (as well as the property
specifying the RECORD_SPEC) must be on the record.

Endeca Confidential Implementing Partial Updates


330

• For ADD_OR_REPLACE record updates, if no record exists


with the specified property value for the property that
has been designated as the RECORD_SPEC, the system
adds a new record; if the record exists, it is replaced.
• For DELETE record updates, the RECORD_SPEC property
must be on the record. This property is used to
identify the record to be deleted. All other properties
and dimension values are ignored.
• For DELETE_OR_IGNORE record updates, if a record
exists with the specified property value for the
property that has been designated as the RECORD_SPEC,
the system removes the record; if the record does not
exist, the action is ignored and no error message is
generated.
• For UPDATE record updates, further specification is
necessary to describe how to handle the property
values and dimension values on the record.
UPDATE-type record updates must also include the
RECORD_SPEC property with each record. Each property
or dimension can have only one type of update
performed, but a single record update may impact any
or all of the properties and dimensions on a record.

Advanced Features Guide Endeca Confidential


Chapter 17
331

The following table lists the expression nodes that are


supported by the UPDATE_RECORD expression.

EXPRNODE Name Description

ACTION The type of action to perform on the record, as indicated by


the VALUE attribute. Valid values for this attribute are:
• ADD – Adds a new record if it does not exist, or generates
an error message if it already exists.
• ADD_OR_REPLACE – Adds a new record if it does not exist,
or replaces it if it already exists.
• REPLACE – Replaces a record if it exists, or generates an
error message if it does not exist.
• DELETE – Removes a record if it exists, or generates an
error message if it does not exist.
• DELETE_OR_IGNORE – Removes a record if it exists, or
does not generate an error message if it does not exist.
• UPDATE – Updates a record if it exists, or generates an
error message if it does not exist.
Examples:
<EXPRNODE NAME="ACTION" VALUE="UPDATE"/>
<EXPRNODE NAME="ACTION" VALUE="ADD_OR_REPLACE"/>

Endeca Confidential Implementing Partial Updates


332

EXPRNODE Name Description

PROP_ACTION If ACTION=UPDATE, the VALUE attribute specifies the type of


update to perform on all the values of the named property.
Valid values for this attribute are as follows:
• ADD – All values for the property on the update record are
added to the current record.
• DELETE – All values for the property on the update record
are removed from the current record.
• REPLACE – All values for the property are removed from
the current record, then all values for the property on the
update record are added to the current record. This node
must be followed by a PROP_NAME expression node that
names the property to be modified. Example:
<EXPRNODE NAME="PROP_ACTION" VALUE="REPLACE"/>
<EXPRNODE NAME="PROP_NAME" VALUE="P_WineType"/>

DIM_ACTION If ACTION=UPDATE, the VALUE attribute specifies the type of


update to perform on all the values of the named dimension.
Valid values for this attribute are as follows:
• ADD – All dimension values in the dimension on the
update record are added to the current record.
• DELETE – All dimension values in the dimension on the
update record are removed from the current record.
• REPLACE – All dimension values in the dimension are
removed from the current record, then all dimension
values in the dimension on the update record are added to
the current record. This node must be followed by a
DIMENSION_ID expression node that names the dimension
to be modified. Example:
<EXPRNODE NAME="DIM_ACTION" VALUE="ADD"/>
<EXPRNODE NAME="DIMENSION_ID" VALUE="8000"/>

Advanced Features Guide Endeca Confidential


Chapter 17
333

EXPRNODE Name Description

DVAL_ACTION If ACTION=UPDATE, removes the dimension value from the


record. Note that the VALUE attribute only supports DELETE.
This node must be followed by a DVAL_ID expression node
that names the dimension value to be removed. Example:
<EXPRNODE NAME="DVAL_ACTION" VALUE="DELETE"/>
<EXPRNODE NAME="DVAL_ID" VALUE="P_PriceStr"/>

Examples of UPDATE_RECORD Expressions

Example 1: An expression configured to convert input


records to ADD_OR_REPLACE RECORD updates:
<EXPRESSION TYPE="VOID" NAME="UPDATE_RECORD">
<EXPRNODE NAME="ACTION" VALUE="ADD_OR_REPLACE"/>
</EXPRESSION>

Example 2: An expression configured to convert input


records to replace the Price property, and the price range
and availability classifications:
<EXPRESSION TYPE="VOID" NAME="UPDATE_RECORD">
<EXPRNODE NAME="ACTION" VALUE="UPDATE"/>
<EXPRNODE NAME="PROP_ACTION" VALUE="REPLACE"/>
<EXPRNODE NAME="PROP_NAME" VALUE="Price"/>
<EXPRNODE NAME="DIM_ACTION" VALUE="REPLACE"/>
<EXPRNODE NAME="DIMENSION_ID" VALUE="100"/><!--100=PriceRange-->
<EXPRNODE NAME="DIM_ACTION" VALUE="REPLACE"/>
<EXPRNODE NAME="DIMENSION_ID" VALUE="200"/><!--200=Availability-->
</EXPRESSION>

Endeca Confidential Implementing Partial Updates


334

UPDATE_RECORD Errors
• If ACTION is not one of ADD, ADD_OR_REPLACE, REPLACE,
DELETE, DELETE_OR_IGNORE, or UPDATE.

• If ACTION is ADD and a record with that specification


already exists. In this case, the record to be added is
skipped instead of replacing the existing record. Use
an ACTION of ADD_OR_REPLACE to add a record if it does
not exist or replace it if it does.
• If ACTION is UPDATE and a record with that specification
does not exist. In this case, the record to be updated is
skipped.
• If ACTION is UPDATE and a sub-action is not specified.
• If ACTION is not UPDATE and a sub-action is specified.
• If ACTION is DELETE and a record with that specification
does not exist. In this case, the record to be deleted is
skipped and an error message is generated. Use an
ACTION of DELETE_OR_IGNORE to suppress the error
message if the record does not exist.
• If more than one sub-ACTION (such as DVAL_ACTION) is
specified for a given property, dimension, or
dimension value.

Format of Update Records

The UPDATE_RECORD expression, as used in the partial


updates reference implementation, requires that each
incoming record have one of the Delimited formats
described below.

Advanced Features Guide Endeca Confidential


Chapter 17
335

Format of Records to Be Deleted

Remove|P_WineID|P_Year|P_Wine|P_Winery|...|
1|34699|1992|A Red Blend Alexander Valley|Lyeth|...|

The first column in the header row must be a Remove


column. The first column in each record must have a
value of 1 to delete the record.

Format of Records to Be Updated

Update|P_WineID|P_Wine|P_PriceStr|
1|34701|Albarino Rias Baixas|1000.00|

The first column in the header row must be an Update


column. The first column in each record must have a
value of 1 to update the record properties.

Format of Records to Be Added

P_WineID|P_Year|P_Wine|P_Winery|P_PriceStr|...|
99000|1992|First New Wine Added|Lyeth|18.00|...|

The header row of records to be added do not begin with


a Remove or Update column. Instead, they use the
normal set of header row columns (P_WineId, P_Year,
and so on). The first column in each record has a normal
property value.

Format of Records in Your Implementation

If your implementation uses Delimited format records,


you can use the above format to specify how the records
are handled. If you use another format, you must use a

Endeca Confidential Implementing Partial Updates


336

record manipulator with the appropriate expressions to


handle your source records.

Creating the Update Adapter

The update adapter is the component that writes out the


record file (or files) that define the new, deleted, or
modified records. The Update Adapter editor must have
at least these settings:
• Output URL (General tab) – Enter the directory to
which Forge writes the partial update files and
processed records. The path is either an absolute path
or a path relative to the location of the partial update
pipeline file. With an absolute path, the protocol must
be specified in RFC 2396 syntax (typically, this means
the prefix file:/// precedes the path to the data file).
Relative URLs must not specify the protocol.
• Output prefix (General tab) – Enter the filename prefix
(such as wine) for the Forge output files. Use the same
prefix as in the indexer adapter for the baseline
update pipeline.
• Filter unknown properties (General tab) – Set this so it
matches the Filter Unknown Properties setting in the
indexer adapter of the baseline update pipeline.
• Record source (Sources tab) – Select the name of the
record manipulator.
• Dimension sources (Sources tab) – Select the name of
the dimension server. You need a dimension source if
you are updating dimensions.

Advanced Features Guide Endeca Confidential


Chapter 17
337

• Enable Agraph support (Agraph tab) – Set this so it


matches the Agraph tab settings in the indexer adapter
of the baseline update pipeline.

Dimension Components

The partial updates pipeline in the sample_updates_data


reference implementation contains two dimension
adapters and one dimension server.

Dimension Adapters

To support classification, the same dimensions that are


loaded in the baseline update pipeline must be loaded in
the partial update pipeline. To cut down on startup time,
the dimensions can be split into multiple files, and only
the dimensions actually used by the partial update
pipeline need to be loaded. In the baseline update
pipeline, multiple dimension adapters can feed into the
same dimension server to consolidate the separate
dimension files.

The sample_updates_data reference implementation uses


two dimension adapters, one for the dimensions.xml file
and the other for the winetype_dimension.xml file. For
both dimension adapters, the Dimension Source field (on
the Sources tab) is set to None.

Endeca Confidential Implementing Partial Updates


338

Dimension Server

The dimension server uses the two dimension adapters as


sources.

The URL field (General tab) specifies the location to


which the autogen_dimensions.xml.gz file is written. This
file contains persistent dimension data produced by
auto-generation.

There are some special considerations when using


AutoGen classification with partial updates. When new
dimension values are generated in the partial updates
pipeline, the dimension changes are included in the
updates sent to the Navigation Engine.

Because the baseline and partial update pipelines share


the same autogen file, changes to AutoGen dimensions
are also shared between the two. However, at any given
time, only one of the two update processes can modify
the Autogen_dimensions file.

Rather than suspend partial updates during baseline


updates, Forge supports the --noAutoGen command-line
option, which turns off the creation of new dimension
values. Classification with existing dimension values
continues normally, but classification failures result in no
matching dimension values, rather than in the creation of
new ones.

For more details on this process, see “Control Script


Development and Execution” on page 343.

Advanced Features Guide Endeca Confidential


Chapter 17
339

Naming Format of Update Source Data Files

When Forge processes update source data files, it is


important to keep two things in mind concerning the
names of the data files:
• The update files should be processed by Forge in
order of their creation. The reason is that if a specific
record appears in more than one update file, you want
the latest update to be processed last, so that it will
override earlier versions when the Dgraph loads the
update record files.
• Forge reads the files in strict lexicographic order of
their filenames. Therefore, you should use a naming
scheme that ensures the processing of the update files
in chronological order of their creation (i.e., last
created, last processed).

For these reasons, it is strongly recommended that you


use a timestamp format as the naming scheme for the
filenames. If necessary, use leading zeros to force the
desired numeric order. For example, if you have two files
named 9.xml and 10.xml, Forge will process 10.xml
before 9.xml; therefore, you must rename 9.xml to 09.xml
so that it is processed before 10.xml.

Note: For the sake of simplicity, the reference implementation


uses source files that do not use a timestamp naming scheme.

Endeca Confidential Implementing Partial Updates


340

Index Configuration

The index configuration files (such as the thesaurus and


stop words files) and cannot be updated using the partial
update mechanism; only records and dimensions can be
updated.

Record Specification Attribute

Developer Studio lets you configure how you want to


refer to records from your application and during partial
updates. The RECORD_SPEC property attribute controls this
behavior.

The RECORD_SPEC attribute allows you to specify the


property that you wish to use to identify specific records
in partial updates. For example, you may wish to use a
field such as UPC, SKU, or part number to identify a
record. You may set the RECORD_SPEC attribute’s value to
TRUE in any property where the values for the property
meet the following requirements:
• The value for this property on each record must be
unique.
• Each record should be assigned exactly one value for
this property.

Only one property in the project may have the


RECORD_SPEC attribute set to TRUE.

Advanced Features Guide Endeca Confidential


Chapter 17
341

All updates that add new records must include a valid


value (that is, a value that fulfills the above criteria) for
the RECORD_SPEC property.

For a partial updates deployment, you must have the


RECORD_SPEC attribute of at least one property set to TRUE.
If no property is marked as the RECORD_SPEC property,
then the Navigation Engine will not process partial
updates. If you are not doing partial updates, then you do
not need to set the RECORD_SPEC to TRUE for any property.

To configure a RECORD_SPEC attribute for an existing


property:
1. In the Project tab of Developer Studio, double-click
Properties.
2. From the Properties view, select a property and click
Edit. The Property editor is displayed.
3. In the General tab, check Use for Record Spec.
4. Click OK. The Properties view is redisplayed.
5. From the File menu, choose Save.

Navigation Engine Configuration


You must start the Navigation Engine with the Dgraph
--updatedir flag to enable it to process partial updates.
The flag takes as an argument the path of directory into
which completed partial update files (from Forge) are
placed. The Navigation Engine does not automatically
load update files placed into this directory. The control

Endeca Confidential Implementing Partial Updates


342

script must be configured to notify the running


Navigation Engine to check for new updates.

Update files are read at startup as well as when the


Navigation Engine receives the update signal. Because
the Navigation Engine looks for update files automatically
at startup, recovery from server failure can be achieved
easily by ensuring that the Navigation Engine is provided
the same --updatedir directory on recovery as it had
prior to failure. The Navigation Engine then reads the
existing files in the directory, restoring the Navigation
Engine to its pre-failure state.

The Navigation Engine reads update files in


numeric-lexicographic order of their filenames
(lexicographic order unless the filename contains leading
zeros, which are ignored). Therefore, the control scripts
should be configured to name update files in ascending
numeric-lexicographic order over time to ensure that
updates are processed in the order they are produced.
For further details, see “Step 2: Apply a Timestamp to the
Record File” on page 350.

Note: While the Dgraph reads files in numeric-lexicographic


order, Forge reads them in strict lexicographic order. Keep this
difference in mind when naming files.

The Navigation Engine processes updates on a


record-by-record basis. Updates fail or succeed entirely at
the record level. This means that a record update that
fails (for example because it attempts to assign an
unknown dimension value to the record) leaves the value
of the record unchanged. Property value changes or

Advanced Features Guide Endeca Confidential


Chapter 17
343

dimension value changes in the failed record update have


no effect. Previous and future record updates and
dimension updates are not affected.

During development, you can use the --updateverbose


flag to specify that the Navigation Engine should output
verbose messages while processing updates. However,
the flag should not be used on production systems
because it will negatively impact update performance.

Dgidx Flags

In most cases, the Dgidx --keepcats flag should be used


to indicate that unused dimension values (that is,
dimension values that have been assigned to no actual
records) should be passed through to the Navigation
Engine. By default, Dgidx strips out such dimension
values. Without such values in the Navigation Engine,
record updates that assign a previously unused
dimension value to a record will fail with an error.

Control Script Development and Execution


A reference template control script (named
update_index.script) is included in the Endeca
sample_updates_data reference implementation. The
control script uses two high-level Script bricks,
baseline_update and partial_update, to implement the
baseline and partial update processes. These processes
are described in sections below.

Endeca Confidential Implementing Partial Updates


344

Note: The sample control script assumes a deployment using


only one Dgraph. If you have an Agraph deployment, see
“Partial Updates in Agraph Implementations” on page 355.

Directory Structure for Updates

The reference control script uses the following directory


structure for handling data flow through the system:
data
forge_input
incoming
updates
partition0
dgidx_output
dgraph_input
updates
forge_output
state

The purpose of these directories are as follows:


• data - Base directory for all other subdirectories. All
files and processes related to the data exist and work
in or under this directory.
• data\forge_input - Contains the Developer Studio
project file (sample_updates.esp), the baseline update
pipeline file (pipeline.epx), the partial update
pipeline file (partial_pipeline.epx), and the index
configuration files (*.xml).
• data\incoming - Contains source data (in the
wine_data.txt.gz file) for a baseline update. On a
production site, the files in this directory may have
been created by a data extraction process on the

Advanced Features Guide Endeca Confidential


Chapter 17
345

customer’s database or may be picked up from


another FTP server.
• data\incoming\updates - Contains source data for a
partial update. The reference implementation ships
with three gzipped files: adds.txt.gz (records to be
added), deletes.txt.gz (records to be deleted, and
updates.txt.gz (records to be updated).

• data\partition0 - Contains files generated by the


Forge, Dgidx, and Dgraph programs.
• data\partition0\dgidx_output - Contains indices that
have been processed by Dgidx and output in a format
that can be read by the Navigation Engine.
• data\partition0\dgraph_input - Contains data that is
read by the Navigation Engine on startup. The data
includes the Dgidx output indices, spelling correction
dictionaries, thesaurus files, and language-encoding
files.
• data\partition0\dgraph_input\updates - Contains
partial updates that have been processed by Forge.
The Navigation Engine reads these updates when it is
restarted with the Dgraph --updatedir flag pointing to
this directory.
• data\partition0\forge_output - Contains data that
has been processed by Forge and is ready for
indexing.
• data\partition0\state - Contains any state
information (such as auto-generated dimension IDs)
that must be saved between Forge runs.

Endeca Confidential Implementing Partial Updates


346

All references to directory names in the following text are


relative to the data directory. All references to directory
names in example or default brick definitions are relative
to the parent of the data directory.

Running the Baseline Updates Script

In the control script, the high-level Script brick,


baseline_update, implements the baseline update
procedure by making calls to other script bricks:

baseline_update : Script
clear_updates
baseline_forge
baseline_dgidx
if dgraph.running
dgraph.stop
baseline_fetch
dgraph.start

You run a baseline update with a command line similar to


this Windows example (assuming you are in the
sample_updates_data\etc directory):

runcommand update_index.script baseline_update

The baseline update process is as follows:

Step 1: Delete Old Updates

All files in the data\partition0\draph_input\updates


directory are deleted by the clear_updates brick.

Advanced Features Guide Endeca Confidential


Chapter 17
347

Step 2: Run Forge

The baseline_forge brick runs Forge on the source data,


using these default settings:

pipeline = ..\data\forge_input\pipeline.epx
forge_options = -vw

You will want to modify the forge_options setting so that


it is better suited for your application.

Step 3: Run Dgidx

The baseline_dgidx brick runs Dgidx with these settings:


input = ..\data\partition0\forge_output\wine
output = ..\data\partition0\dgidx_output\wine
dgidx_options = --keepcats

The --keepcats flag specifies that Dgidx should pass


unused dimension values to the Navigation Engine
instead of stripping them out.

You will probably want to modify the options passed to


Dgidx. You will also want to change:
• The input setting so that it points to the location
where your pipeline writes out the Forge output data.
• The output setting so that it points to the location
where Dgidx should write out data for the Navigation
Engine (make sure that the location ends with the
prefix that you want to use for the Dgidx output).

Step 4: Stop the Navigation Engine

The dgraph.stop command stops the Navigation Engine.

Endeca Confidential Implementing Partial Updates


348

Step 5: Move the Index Files to the Dgraph Directory

The baseline_fetchbrick moves index files from the


data\partition0\dgidx_output directory to the
data\partition0\dgraph_input directory, where they are
used by the Navigation Engine on startup.

Be sure to change the paths in the source and dest


settings for your implementation.

Step 6: Start the Navigation Engine

The Navigation Engine is started with the dgraph brick,


using these settings:

working_dir = $(sample_updates_data_dir)\logs
input = ..\data\partition0\dgraph_input\wine
port = $(dgraph_port)
dgraph_options = --updatedir
..\data\partition0\dgraph_input\updates

You may want to use the --updateverbose flag during


development, but make sure you remove it for
production. You may want to add other options relevant
for your application. See the Endeca Administrator’s
Guide for information about the available options.

At this point, the Navigation Engine should be running


correctly with the latest baseline and partial update data.

Advanced Features Guide Endeca Confidential


Chapter 17
349

Running the Partial Updates Script

In the control script, the high-level Script brick,


partial_update, implements the partial update procedure
as follows:

partial_update : Script
update_forge
apply_timestamp
if dgraph.running
dgraph.update

You run a partial update with a command line similar to


this Windows example (assuming you are in the
sample_updates_data\etc directory):

runcommand update_index.script partial_update

The three major steps of the partial_update Script brick


are described in the following sections.

Step 1: Run Forge on the New Source Data

The update_forge brick runs Forge with the partial


update pipeline and new source data, using these default
settings:
pipeline = ..\data\forge_input\partial_pipeline.epx
forge_options = -vw

Because the record adapter uses the Multi Files setting,


Forge can read data from multiple input files. (The
reference implementation uses three input files.)

Endeca Confidential Implementing Partial Updates


350

You will want to modify the forge_options setting so that


it is better suited for your application. Modify the relative
paths above as appropriate for your implementation.

When Forge finishes, it produces one or more update


record files and stores them in the location specified by
the pipeline's update adapter. This file contains XML
definitions of how the updated records should be treated
by the Navigation Engine (for example, which records to
delete or add).

The record files use this naming format:


db_prefix-sgmtn.records.xml

For example, the update_forge brick in the reference


implementation produces the wine-sgmt0.records.xml
file in the data\partition0\dgraph_input\updates
directory.

The -sgmt0 portion of the filename is generated when


you roll over by size (i.e., the update indexer contains the
ROLLOVER element, as in the reference partial updates
pipeline). Forge splits the output into segment files, each
of which is no larger than 2GB.

It is important that you know the names of the record


files, because they will have to be timestamped, as
described in the next section.

Step 2: Apply a Timestamp to the Record File

It is possible to generate multiple partial updates before


the next baseline update, at which time all the partial

Advanced Features Guide Endeca Confidential


Chapter 17
351

update files are deleted. Therefore, each record file must


be timestamped to ensure that the Navigation Engine
does not upload a partial update more than once.

The apply_timestamp brick renames the


db_prefix-sgmtn.records.xml files by appending a
timestamp string to the filename. The resulting filename
will use this format:
originalfilename_YYYY.MM.DD.HH.NN.SS

where YYYY is the four-digit year, MM is the two-digit


month, DD is the two-digit day, HH is the two-digit hour,
NN is the two-digit minute, and SS is the two-digit second.
For example:
wine-sgmt0.records.xml_2005.06.07.16.14.08

A running Navigation Engine keeps track of the last


timestamped file it uploaded. When it next checks the
updates directory, it will only upload partial update files
that carry a timestamp later than the last uploaded file.

Note that the apply_timestamp brick in the reference


script assumes that only one record file will be renamed.
If your implementation generates multiple record files,
you will need to change this brick for the additional
renaming statements.

Step 3: Update the Navigation Engine

The dgraph.update command causes the running


Navigation Engine to perform the following actions:

Endeca Confidential Implementing Partial Updates


352

1. Go offline while it processes the updates (that is, it


stops accepting user queries and temporarily closes its
listening port).
2. Check the updates directory (whose path is specified
with the --updatedir flag).
3. Upload any partial update with a timestamp later than
the last currently-loaded partial update.
4. Go back online after it has processed all updates.

At this point, the Navigation Engine should be running


correctly with the latest baseline and partial update data.

Adding Other Bricks

You can modify the update_index.script and add other


bricks that are necessary for your implementation. For
example, you can add a brick that fetches partial updates
from an FTP server. In other installations, the partial
updates may be dropped onto the indexing server,
directly into the incoming\updates directory.

The following is an example of a Fetch brick:


fetch_updates : Shell
perl bin/fetch.pl \
--ftp_ip ftp.somecompany.com \
--ftp_user anonymous \
--ftp_pass somecompany.com \
--fetch_dir incoming/ \
--fetch_file_regexp
"endeca_update_200407(\d+)\.txt" \
--exclude_file etc/exclude_files \
--dest_dir data/incoming/updates

Advanced Features Guide Endeca Confidential


Chapter 17
353

The flags in the example are:


• --ftp_ip is the IP address of the FTP server.
• --ftp_user is the username for logging into the FTP
server.
• --ftp_pass is the password for the username.
• --fetch_dir is the directory on the FTP server that
contains the update files to retrieve.
• --fetch_file_regexp is the regular expression that
should be matched for a file to be considered a partial
update file.
• --exclude_file points to a file that will be maintained
automatically by fetch.pl. It is a list of all the files that
have already been retrieved from the FTP server and
should not be retrieved again.
• --dest_dir is the directory into which the fetched files
will be dropped.

See the Endeca Administrator’s Guide for details on


writing and modifying bricks.

URL Update Command Parameters


The dgraph.update command, when issued from a
command line, causes the Navigation Engine to check the
updates directory and upload all partial updates that have
not yet been uploaded.

Endeca Confidential Implementing Partial Updates


354

You can also issue the same command to the Navigation


Engine by using the following URL command syntax in
your Web browser:
http://hostname:dgraphport/admin?op=update

For example:
http://localhost:8000/admin?op=update

If you are using HTTPS mode, use https in the URL.

On receiving the update command, the Navigation Engine


immediately goes offline (that is, closes its port and stops
accepting requests) and checks for and loads new update
files in its update directory. After the Navigation Engine
has processed all updates, it returns to its normal online
mode of operation.

If you want the Navigation Engine to stay online during


the update process, use the offline=false option of the
command, as in this example:
http://localhost:8000/admin?op=update&offline=false

In this online mode, the Navigation Engine will continue


to accept queries.

Note: offline=true is the default and is the same as using the


update command by itself.

To see the update history, use the updatehistory URL


command, similar to the following example:
http://localhost:8000/admin?op=updatehistory

Advanced Features Guide Endeca Confidential


Chapter 17
355

This command will show a list of the update files that the
Navigation Engine has recently processed.

Partial Updates in Agraph Implementations


Implementing partial updates in Agraph implementations
is similar to single-Dgraph deployments, with the
important differences listed below.

Choosing a Distribution Strategy

The update record files produced by Forge contain XML


definitions of the updated records, including information
on how the records should be treated by the Dgraphs.
For example, records to be deleted are flagged with a
RECORD_DELETE element in the file.

New records (i.e., ones that use an ADD or ADD_OR_REPLACE


action for the UPDATE_RECORD expression) are defined with
a RECORD_ADD element that contains the partition number
(in a PARTITION attribute) to which the record is assigned.
Note that only ADD records are assigned partition
numbers.

Endeca Confidential Implementing Partial Updates


356

For example, this XML snippet shows a RECORD_ADD


element that assigns a new record to Agraph partition1:
<UPDATE>
<UPD_UNIT>
<RECORD_ADD PARTITION="1">
<PROP NAME="P_WineID">
<PVAL>99005</PVAL>
</PROP>
<PROP NAME="P_Year">
<PVAL>1992</PVAL>
</PROP>
...
</RECORD_ADD>
</UPD_UNIT>
...
</UPDATE>

How the partition number is assigned to an ADD record


depends on which distribution strategy you have chosen
to implement:
• Random distribution, where you let Forge decide
which partition gets the new record. That is, Forge
uses the configured partition property (typically the
record spec or rollup property) as a basis of assigning
the partition number to the PARTITION attribute.
• Deterministic distribution, where you control the
assignment of records to specific partitions. That is,
you tell Forge which partition number it should assign
to the PARTITION attribute.

The main advantage of random distribution is that you do


not need to know exactly where the records should go in
order for updates to be processed correctly. This scheme
also simplifies operations because the same update

Advanced Features Guide Endeca Confidential


Chapter 17
357

record file is sent to all partitions, so there is less


conditional logic in the control script.

Which distribution strategy you chose depends on the


needs of your implementation. In general, Endeca
recommends that the distribution strategy for partial
updates be the same as for baseline updates.

How the Agraph Partitions Handle Updates

Regardless of which distribution strategy you are using,


the Agraph partitions (i.e., the individual Dgraphs) handle
the record update requests as follows:
• If a DELETE, REPLACE, or UPDATE action request is sent to
all partitions of the Agraph, only the partition that
contains the record will actually delete, replace, or
update the record. The other partitions will issue a
warning message, but continue to function as before.
• If an ADD action request is sent to all partitions, only
the designated partition (as specified in the PARTITION
attribute of the record file) will add the record. The
other partitions will ignore the request.

Because any partition knows how to deal with any


update request, this architecture allows you to send the
record files to all partitions without having to worry about
which partition is the correct one.

Endeca Confidential Implementing Partial Updates


358

Use of Record Spec

The record specification in an Agraph deployment


requires that the record property be unique across all
records across all Navigation Engines.

Naming Convention for Source Data Files

Whether you are using a random or deterministic


distribution strategy, it is strongly recommended that you
use a timestamp format as the naming scheme for the
update source data files. This format, which is explained
in “Naming Format of Update Source Data Files” on
page 339, ensures that Forge processes the files in the
proper order of their creation.

For both strategies, a Perl expression in the record


manipulator (described in a later section) can use the
timestamp part of the filename for the name of the output
record file.

Random Distribution Format

For a random distribution strategy, a suggested format is:


YYYYMMDDHHNNSS.ext

where YYYY is the four-digit year, MM is the two-digit


month, DD is the two-digit day, HH is the two-digit hour,
NN is the two-digit minute, and SS is the two-digit second,
as this example:
20051023161408.txt

Advanced Features Guide Endeca Confidential


Chapter 17
359

These files may contain ADD records that will be


distributed randomly to the Agraph partitions.

Deterministic Distribution Format

For a deterministic distribution strategy, a suggested


format is:
YYYYMMDDHHNNSS-partX.ext

where X is the number of the Agraph partition for which


these records are intended. For example, records in this
source data file are intended for partition3:
20050717151408-part3.txt

The Perl expression in the record manipulator parses the


filename for the partition number and uses it to assign
ADD records to that partition.

The expression also uses the timestamp and -partX


information for the name of the output record file. For
example, the above input filename will generate this
output record file:
20050717151408-part3.records.xml

Keep in mind that if you pre-partition your baseline


source files, you should also pre-partition the records to
be added. That is, all ADD records for the partition 0
Dgraph should be in one file, ADD records for the
partition1 Dgraph should be in a second file, and so on.

Endeca Confidential Implementing Partial Updates


360

Configuring the Partial Updates Pipeline

This section describes how to configure the partial


updates pipeline for either distribution strategy.

IMPORTANT: The procedures described below require that you


hand-edit the pipeline files with a text editor. After you edit these
files, do not open the project in Developer Studio, because it will
overwrite the settings of the update adapter.

Configuring the Record Adapter

The record adapter should have the following settings:


• Set the MULTI attribute to True so that Forge can read
multiple input data files.
• Set the URL attribute to the path of the incoming
directory, with the filename being a pattern (such as
../incoming/updates/*.txt).

• In order to use the naming format of the input file for


the records file name, set the MULTI_PROP_NAME
attribute to a value of FILENAME.

These settings apply to both random and deterministic


record adapters.

Advanced Features Guide Endeca Confidential


Chapter 17
361

The following is an example of a record adapter for the


partial updates pipeline:
<RECORD_ADAPTER
NAME="LoadUpdateData"
URL="../incoming/updates/*.txt"
FORMAT="DELIMITED"
COL_DELIMITER="|"
ROW_DELIMITER="|\n"
DIRECTION="INPUT"
FILTER_EMPTY_PROPS="TRUE"
FRC_PVAL_IDX="TRUE"
MULTI="TRUE"
MULTI_PROP_NAME="FILENAME"
REQUIRE_DATA="FALSE"
</RECORD_ADAPTER>

Note that the FILENAME setting for the MULTI_PROP_NAME


attribute will processed by both the update adapter and
the Perl expression in the record manipulator.

Configuring the Record Manipulator

For both random and deterministic pipelines, the record


manipulator should contain the same IF and
UPDATE_RECORD expressions that are documented for the
single-Dgraph implementation (see “Creating the Record
Manipulator” on page 327).

In addition, you can add a Perl expression that parses the


name of each input file (up to the file extension) and uses
it to name the output record file (which has a
records.xml extension). The exact Perl code in the
expression depends on the distribution strategy.

Endeca Confidential Implementing Partial Updates


362

Perl Expression for Random Distribution

For a random distribution pipeline, the following


example of a Perl expression can be inserted into the
record manipulator:
<EXPRESSION TYPE="VOID" NAME="PERL">
<COMMENT>
This Perl expression handles taking the source input filename and
outputting a record file with the same naming format.
It assumes filenames of the format: timestamp.ext
</COMMENT>
<EXPRBODY><![CDATA[
# Translate filename of input to filename of output.
# Filename is everything after the last slash
if ($props[0]->value() =~ /[\/\\](\w+)\.[^\.]+$/) {
my $filename = $1;
$props[0]->value($filename);
replace_prop("FILENAME", 0, $props[0]);
} else {
die("Could not parse filename: " . $props[0]->value());
}
]]>
</EXPRBODY>
</EXPRESSION>

The expression will generate output record files with


names similar to this example:
20050717180812.records.xml

Note that this sample expression is for use on Windows


machines, where “\” is the directory separator. For UNIX,
change the regex from:
/[\/\\](\w+)\.[^\.]+$/

to:
/\/(\w+\.[^\.]+$/

Advanced Features Guide Endeca Confidential


Chapter 17
363

Keep in mind that you will have to change the Perl regex
code if you use another naming convention for the
source input files.

Perl Expression for Deterministic Distribution

The Perl expression for the record manipulator in a


deterministic distribution pipeline is similar to the random
distribution example, with the addition of code that
extracts the partition ID (the partX piece) from the input
filename and stores it in the X_ParitionNum property. The
partition ID will be assigned by Forge to that record in
the record file (via the PARTITION attribute of the
ADD_RECORD element).

Endeca Confidential Implementing Partial Updates


364

The Perl expression is as follows:


<EXPRESSION TYPE="VOID" NAME="PERL">
<COMMENT>
This Perl expression handles taking the source input filename and
determining the appropriate partition. It assumes filenames of the
format: timestamp-partN.ext
The expression extracts the N in the "partN" piece.
</COMMENT>
<EXPRBODY><![CDATA[
# Translate filename of input to filename of output.
my @props = get_props_by_name("FILENAME");
# Filename is everything after the last slash
if ($props[0]->value() =~ /[\/\\](\w+\-part\d+)\.[^\.]+$/) {
my $filename = $1;
$props[0]->value($filename);
replace_prop("FILENAME", 0, $props[0]);
# Extract the partition ID from the filename to determine
# the partition number for the record.
$filename =~ /part(\d+)$/;
my $part_num = $1;
# X_PartitionNum specifies the target partition for
# this particular record.
my $part_prop = new Zinc::PropVal("X_PartitionNum", $part_num);
add_props($part_prop);
} else {
die("Could not parse filename: " . $props[0]->value());
}
]]>
</EXPRBODY>
</EXPRESSION>

As in the previous example, this sample code is for


Windows machines and the regex code must be changed
for UNIX.

Advanced Features Guide Endeca Confidential


Chapter 17
365

Configuring the Update Adapter

The configuration of the update adapter is similar to that


in single-Dgraph implementations. For both random and
deterministic distribution, the update adapter should have
the following attribute settings:
• OUTPUT_URL – Set to the path of the incoming directory,
with the filename being a pattern.
• OUTPUT_PREFIX – Set to an empty string, because the
output filename will begin with a timestamp format.
• MULTI – Set to True so that Forge can read multiple
input data files.
• MULTI_PROP_NAME – Set to a value of FILENAME.

The recommended settings for the ROLLOVER element


depend on the type of distribution strategy.

ROLLOVER Element for Random Distribution

In a random distribution pipeline, the following settings


are recommended for the ROLLOVER element:
• NUM_IDX – Although this attribute normally sets the
number of Agraph partitions, it is recommended that
you use the Forge --numPartitions flag in the control
script to actually set the number of partitions.
Therefore, leave the field blank or use any number.
• PROP_NAME – Set to the partition property, which is the
record spec or rollup property by which records are
assigned to each partition. An empty field means that

Endeca Confidential Implementing Partial Updates


366

Forge will use a round-robin strategy to assign


partitions to records.
• PROP_TYPE – Set to the partition property’s type
(typically, ALPHA).
• REMOVE_PROP – Typically set to FALSE.

• CUTOFF – Set to the default value of 2000000000.

The following is an example of an update adapter using


the above settings:
<UPDATE_ADAPTER NAME="UpdateAdapter"
OUTPUT_URL="../partition0/dgraph_input/updates"
OUTPUT_PREFIX=""
MULTI="TRUE"
MULTI_PROP_NAME="FILENAME">
<RECORD_SOURCE>UpdateManipulator</RECORD_SOURCE>
<ROLLOVER NAME="RECORD"
NUM_IDX=""
PROP_NAME="P_WineID"
PROP_TYPE="ALPHA"
REMOVE_PROP="FALSE"
CUTOFF="2000000000"/>
</UPDATE_ADAPTER>

ROLLOVER Element for Deterministic Distribution

In a deterministic distribution pipeline, the following


settings are recommended for the ROLLOVER element:
• NUM_IDX – Leave blank or use any number, as it is
recommended you use the Forge --numPartitions flag
in the control script to actually set the number of
Agraph partitions.

Advanced Features Guide Endeca Confidential


Chapter 17
367

• PROP_NAME – Set to the property (X_PartitionNum, for


example) created by the Perl expression in the record
manipulator.
• PROP_TYPE – Set to the INTEGER (because it will hold
the partition number of the ADD record).
• REMOVE_PROP – Set to TRUE (because the PROP_NAME
property should not be in the output).
• CUTOFF – Set to the default value of 2000000000.

The following is an example of an update adapter using


the above settings:
<UPDATE_ADAPTER NAME="UpdateAdapter"
OUTPUT_URL="../partition0/dgraph_input/updates"
OUTPUT_PREFIX=""
MULTI="TRUE"
MULTI_PROP_NAME="FILENAME">
<RECORD_SOURCE>UpdateManipulator</RECORD_SOURCE>
<ROLLOVER NAME="RECORD"
NUM_IDX=""
PROP_NAME="X_PartitionNum"
PROP_TYPE="INTEGER"
REMOVE_PROP="TRUE"
CUTOFF="2000000000"/>
</UPDATE_ADAPTER>

Control Script for Agraph Updates

The reference control script implements partial updates


for a single-machine, single-Dgraph deployment only. For
an Agraph deployment, you can modify the control script
to run Forge on a single machine and distribute the Forge
output to all the other machines. Then, you notify each
Dgraph in your deployment to check for new updates.

Endeca Confidential Implementing Partial Updates


368

Forge Partial Updates Brick

The Forge brick that processes the partial update source


data is similar to the brick described in “Step 1: Run Forge
on the New Source Data” on page 349. The difference is
that we recommend use of the Forge --numPartitions
flag to specify the number of Agraph partitions, as in this
brick example:
# Runs Forge on the update source data.
update_forge : Forge
working_machine = indexer
pipeline = ..\data\forge_input\partial_pipeline.epx
forge_options = -vw --numPartitions $(numPartitions)

Using the --numPartitions flag (which overrides the


NUM_IDX setting in the update adapter) lets you easily add
or subtract Agraph partitions from within the control
script. You will have to set up a global variable (named
numPartitions in the example above) that stores the
number of partitions.

Distributing the Forge Output to the Dgraphs

For a random distribution strategy, partial updates in


Agraph implementations do not require any special
update distribution requirements. Both dimension
modifications (i.e., dimension value additions) and record
modifications (updates, deletes, replaces, and adds)
should be sent to all Dgraphs in the deployment. Each
Dgraph should then be notified to check for new
updates. If a Dgraph cannot handle data that is associated
with another Dgraph, it will simply log a warning but will

Advanced Features Guide Endeca Confidential


Chapter 17
369

otherwise continue working. Note that the Agraph


process itself does not process updates.

For a deterministic distribution strategy, the distribution


of the record files depends on the use of auto-generated
dimensions:
• If you are using auto-generated dimensions, distribute
all the record files to all the Dgraphs.
• If you are not using auto-generated dimensions, you
can distribute each record file to its specific Dgraph.

To make sure that there is no interruption in servicing


navigation requests, you may configure your Dgraphs to
check for new updates at different times. Or you can also
have smaller subgroups read in updates simultaneously
(for example, three machines at a time in a six-machine
implementation).

Endeca Confidential Implementing Partial Updates


370

Advanced Features Guide Endeca Confidential


Chapter 17
Chapter 18
Using the Agraph

Implementing the Agraph allows application users to search


and navigate very large data sets. An Agraph
implementation enables scalable search and navigation by
partitioning a very large data set into multiple Dgraphs
running in parallel. The Agraph sends an application user’s
query to each Dgraph, then coordinates the results from
each, and sends a single reply back to the application user.

What You Should Know First


This document assumes you are familiar with the basics of
the Endeca Navigation Engine as described in the Endeca
Developer’s Guide and that you can create, provision, and
run an Endeca implementation using one Dgraph.

Overview of Distributed Query Processing


You can scale the Navigation Engine to accommodate a
large data set by distributing the Navigation Engine across
multiple processors.
372

In this type of distributed environment, you configure a


Developer Studio project to partition your Endeca records
into subsets of records—as many partitioned subsets as
you need to process all your source data. Each subset of
Endeca records is typically referred to as a partition. Each
processor runs an instance of the Dgraph program by
loading one partition and maintaining a portion of the
total Navigation Engine indices in its main memory.

Such a distributed configuration requires an additional


program called the Agraph (Aggregated Navigation
Engine). The Agraph program receives requests from
clients, forwards the requests to the distributed
Navigation Engines, and coordinates the results. An
Agraph can coordinate as many child Dgraphs as are
necessary for your data set.

Agraph Query Processing

From the perspective of the Endeca Presentation API, the


Agraph program behaves identically to a Dgraph
program. When an Aggregated Navigation Engine
receives a request, it sends the request to all of the
distributed Navigation Engines. Each Navigation Engine
processes the request and returns its results to the
Aggregated Navigation Engine which aggregates the
results into a single response and returns that response to
the client, via the Endeca Presentation API.

Advanced Features Guide Endeca Confidential


Chapter 18
373

In the following illustration, one Agraph coordinates


three Dgraphs.

Client Request Endeca


Presentation
Client Response API

Aggregated Endeca
Navigation Engine
(Agraph)

Endeca Endeca Endeca


Navigation Navigation Navigation
Engine Engine Engine
(Dgraph 1) (Dgraph 2) (Dgraph 3)

Data Foundry Processing

The previous section described how an Agraph functions


from the perspective of client requests and Agraph
responses that are passed through the Endeca API. This
section describes the offline processing that Data Foundry
components perform to create Agraph partitions.

If you want a full explanation about how Data Foundry


processing works for a single Dgraph implementation,
see “Data Foundry Components” in the Endeca
Developer’s Guide. To summarize a portion of that
section, the Data Foundry architecture to process source

Endeca Confidential Using the Agraph


374

data for a single partition, running in a single Navigation


Engine, looks like this:

Source Forge Dgraph


Dgidx
Data (imports, (receives
Source (creates
standardizes, and replies
Data Endeca indices)
tags, exports) ENE Indices to queries)
Records

In an Agraph implementation, the Data Foundry


processing is very similar; however, multiple Data
Foundry components, namely Dgidx, Agidx, and the
Dgraph run in parallel to process each partition’s data.
The architecture to process an Agraph implementation
with three partitions looks like this:

Dgidx 1 Dgraph 1

Partition 1 Partition 1
Endeca ENE Indices
Records Agraph
Agidx
Combined
Forge ENE Indices
Dgidx 2

Partition 2 Partition 2
Endeca ENE Indices Dgraph 2
Records

Dgidx 3 Dgraph 3
Partition 3 Partition 3
Endeca ENE Indices
Records

Advanced Features Guide Endeca Confidential


Chapter 18
375

When you use either Developer Studio or Web Studio to


run a project with three partitions, as shown above, the
following occurs:
1. Forge reads in the source data. (Assume Forge has
access to source data as shown in the diagram with a
single Navigation Engine. )
2. Forge enables parallel processing by producing
Endeca records in any number of partitions. You
specify the number of partitions in the Agraph tab of
the Indexer adapter or the Update adapter. This is
described later in “Modifying a Project for Agraph
Partitions”.
3. The Data Foundry starts a Dgidx process for each
partition that Forge created. The Dgidx processes can
run on one or multiple machines, depending on the
desired allocation of computation resources.
4. Each Dgidx process creates a set of Navigation Engine
indices for its corresponding partition.
5. After all the Dgidx processes complete, the Agidx
program runs to create an index specific to the
Agraph. This index contains information about each
partition’s indices.
6. Each Navigation Engine (Dgraph) starts and loads the
index for its corresponding partition.
7. After all Dgraphs start, the Agraph starts and loads its
index, which contains information about each child
Dgraph’s index.

Endeca Confidential Using the Agraph


376

Guidance about When to Use an Agraph

An Agraph implementation is necessary when you have a


set of Endeca records large enough that a single Dgraph
process would exceed the maximum process size limits
of the machine’s operating system. For details about the
Dgraph and process size, see “Endeca Navigation Engine
Basics” in the Endeca Developer’s Guide.

One approximate test to gauge whether an Agraph may


be necessary is to check how many records of source
data you have before Data Foundry processing. One
million or more records of source data is sufficient to
suggest you may want to use an Agraph in your Endeca
implementation.

Implementation Overview

The rest of this chapter describes the following tasks to


implement an Agraph.
• Modify the project for Agraph partitions.
• Provision the Agraph implementation.
• Run the Agraph implementation.

Modifying the Project for Agraph Partitions


The first step in implementing an Agraph is to configure
the Agraph tab of the project’s Indexer adapter or if you
are working on a partial update pipeline, the Agraph tab

Advanced Features Guide Endeca Confidential


Chapter 18
377

of the Update adapter. The Agraph tab serves the


following functions:
• Enables Agraph support
• Specifies the number of Agraph partitions (Dgraphs)
in your implementation
• Identifies an optional partition property

The partition property field identifies the property by


which records are assigned to each partition. This field is
read-only. The partition property field can display one of
three possibilities:
• A rollup property—If you have a rollup property
enabled in your project, the rollup property also
functions as the partition property. Forge assigns all
records that can be aggregated by the rollup property
to the same partition.
For example, suppose “Year” is the rollup property
and “Year” can have any number of rollup values such
as 2002, 2003, 2004, and so on. Forge assigns all
records tagged with a particular year’s value to the
same partition. This means that all records tagged with
2002 are in the same partition; all records tagged with
2003 are in the same partition, and so on.
• A record spec property—If you do not have a roll up
property but do have a record spec property enabled
in your project, the record spec property functions as
the partition property. Records are assigned evenly
across all partitions according to the record spec
property. This allocation provides equally sized
partitions.

Endeca Confidential Using the Agraph


378

• An empty field (no property displays)—If you have


not enabled a rollup property or record spec property,
the partition property field is empty. With no partition
property, Forge assigns records to each partition
according to a round-robin strategy. This strategy also
provides equally sized partitions.

To modify the project for Agraph partitions:

1. In the Project tab of Developer Studio, double-click


the Pipeline Diagram.
2. Double click the Indexer adapter, or if you are
modifying a partial update pipeline, double click the
Update adapter.
3. Select the Agraph tab.
4. Check “Enable Agraph support”.
5. In “Number of Agraph partitions”, specify the number
of child Dgraphs that the Agraph controls. In an
Agraph implementation, this must be a value of 2 or
more. You provision each of the partitions in
“Provisioning an Agraph Implementation”.

Note: If you want to change the partition property, open the


Properties view and modify which properties are enabled for
rollup and record spec.

Provisioning an Agraph Implementation


In addition to modifying your project to support Dgraph
partitioning, you must also provision your Endeca
implementation using Web Studio. Provisioning informs

Advanced Features Guide Endeca Confidential


Chapter 18
379

the Endeca Manager about the systems allocated to run


the Forge, Dgidx, Agidx, Dgraph, and Agraph programs.

In a production environment, the Agraph and each


Dgraph should run on its own processor. Your servers
may have one or more processors. For example, you can
set up a three Dgraph/one Agraph environment on a
quad processor server. In a development environment,
where optimal performance is less critical, the Agraph
can run on one of the processors running a Dgraph.

An Agraph implementation requires a minimum of two


replicas (mirrors) to provide full application uptime
during partial updates. The second replica is necessary
because one replica’s Agraph goes offline during a partial
update. The second replica can continue to receive and
reply to user requests during the downtime of the first
replica.

To provision an Agraph implementation:

1. Open Internet Explorer, start Web Studio, and log in. If


you have any questions about how to use Web Studio,
see the Endeca Tools Guide. (This procedure assumes
you know how to provision an Endeca
implementation and focuses on the issues specific to
provisioning an Agraph implementation.)
2. Select the Provisioning page from the Administration
section of the navigation menu.
3. In the Hosts section, add each host that runs a Dgraph
or Agraph, including host machines that run Dgraph
or Agraph replicas.

Endeca Confidential Using the Agraph


380

For example, the set of hosts shown in the following


graphic provision a system similar to the one shown in
“Agraph Query Processing” on page 372. Namely,
there are hosts to support a total of six Dgraphs and
two Agraphs. (There are two replicas in the
implementation and each replica runs one Agraph and
three Dgraphs.)

4. In the Dgidx section, add as many Dgidx entries as


you have Dgraph partitions. In other words, the
number of Dgidx entries must correspond to the value
of “Number of Agraph partitions” in your Indexer
adapter.
To continue the previous example, this illustration
shows three Dgidx entries that correspond to three
Dgraph partitions.

Advanced Features Guide Endeca Confidential


Chapter 18
381

5. In the Navigation Engines section, add as many


Navigation Engines as you have Agraph partitions and
replicas for those partitions. (You specified the
number of Agraph partitions in “Modifying the Project
for Agraph Partitions.”) For example, if you have three
Dgraphs and two replicas, you need a total of six
Navigation Engines.
To continue the previous example, this illustration
shows six total Navigation Engines (three per replica).

Endeca Confidential Using the Agraph


382

6. In the Aggregated Navigation Engines section, add as


many Agraphs as you need to support your desired
number of Dgraphs and the required number of
replicas. An Agraph can support any number of
Dgraphs.
To continue the previous example, this illustration
shows two Aggreated Navigation Engines, one for
each replica.

7. In the Options section, specify the number of replicas


and Dgraph partitions.
To continue the previous example, this illustration
shows two replicas with three Dgraphs per replica.

8. Click Save Changes and go on to “Running an Agraph


Implementation”, next.

Advanced Features Guide Endeca Confidential


Chapter 18
383

Running an Agraph Implementation


After saving your provisioning changes, you can view the
components on the Administration page and start a
baseline update. The baseline update processes your
source data and runs all the components shown on the
Administration page including starting all Dgraphs and
the Agraph for each replica.

To run the Agraph:

1. Select the Administration page from the navigation


menu of Web Studio.
2. From the System Operation section, click the Start
button for Baseline Update.
To continue the previous example, this illustration
shows a running Agraph implementation. Each of the
two replicas contain three Dgraphs managed by one
Agraph.

Endeca Confidential Using the Agraph


384

Agraph Presentation API Development


No additional development is needed in the Presentation
API to support the Agraph. The Agraph can be treated
just like a Dgraph.

Note however that when you set a connection to the


Navigation Engine, your application should connect to
the Agraph not one of its child Dgraphs. For example, in
Java, this connection might look like the following:

Advanced Features Guide Endeca Confidential


Chapter 18
385

// Set connection to Agraph


ENEConnection nec = new
HttpENEConnection("engine.endeca.com", "9001");

where engine.endeca.com is the Agraph host and 9001 is


the Agraph port.

Agraph Limitations
The following features cannot be used with the Agraph:
• Enabling the “More…” option for dimension value
ranking.
• Relevance ranking for dimension search is not
supported in an Agraph. In addition, the Static
relevance ranking module is not supported in an
Agraph. See “Using Relevance Ranking” in the Endeca
Developer’s Guide for information on configuring
Dgraphs in an Agraph deployment to support
relevance ranking for record search.
• If you are aggregating records in your application, you
must specify a single property by which to aggregate
the records. Specify the property by enabling the
Rollup check box on General tab of the Property
editor.

Endeca Confidential Using the Agraph


386

Agraph Performance
Ideally, the Agraph speeds up both Dgidix indexing and
Navigation Engine request processing by a factor of the
number of partitions. The indexing speed-up is close to
this ideal, assuming that the Dgidx processes do not have
to compete for computation or disk resources.

Assuming each Dgraph is running on its own processor


as recommended, the Navigation Engine achieves close
to the ideal speed-up for handling expensive requests,
especially analytics requests. For smaller requests, the
overhead of the Agraph tends to nullify the benefits of
processing a query in parallel.

Control Script Environment Considerations


The following sections provide detail about using the
Agraph in a control script environment.

Arranging Partitions and Files

When running your implementation with a control script,


you have to arrange data files so that they are available to
the various Dgidx processes. In particular, each Dgidx
process needs to access its corresponding partition of the
records, as well as the configuration files that are
common to all of the processes. If the Dgidx processes
are to be executed on the different machines, then the
control script must distribute files across machines.

Advanced Features Guide Endeca Confidential


Chapter 18
387

Agraph and Dynamic Business Rules

In a control script environment, Endeca recommends that


you shut down the Agraph during dynamic business rule
updates. (In a tools environment, the Manager
automatically shuts down the Agraph during any type of
update process and then brings the Agraph back online
after the update completes.)

If you do not shut down the Agraph during the update,


end-users will not receive a response to requests made
during this short update time and the Agraph issues a
fatal error similar to the following:
[Thu Jun 24 16:26:29 2004] [Fatal]
(merchbinsorter.cpp::276) - Dgraph 1 has fewer rules
fired.

If you are using dynamic business rules with the Agraph,


the -keepcats flag must be used with Dgidx. For more
information, see “Implementing Merchandising and
Content Spotlighting” on page 257.

Endeca Confidential Using the Agraph


388

Advanced Features Guide Endeca Confidential


Chapter 18
Chapter 19
Using Internationalized Data

The Endeca suite of products supports the Unicode


Standard, Version 4.0, which allows the Endeca Navigation
Engine to process and serve data in virtually any of the
world’s languages and scripts. The Endeca components can
be configured to allow processing of such data when
provided in a native encoding system.

This chapter provides a single source of information for


implementation details that you need to know about when
building a solution that includes internationalized data. The
chapter makes the following assumptions:
• If working with Chinese, you are familiar with which
native encodings (Big5, GBK, etc.) correspond to which
character sets (Traditional versus Simplified).
• If working with Chinese or Japanese, you know that
these languages do not use white space to delimit
words.
• If working with Japanese, you are familiar with the
shift_jis variants and how the same character can be
used for either the Yen symbol or the backslash
character.

For more information on the Unicode Standard and


character encodings, see http://unicode.org.
390

Installing the Supplemental Language Pack


If you have purchased Japanese, Chinese, or Korean
language functionality, you must install the Endeca
Supplemental Language Pack, which contains Japanese,
Chinese, and Korean dictionary files. Follow the
instructions below to install this software.

To install the Endeca Supplemental Language Pack on UNIX:

1. Download the Endeca Supplemental Language Pack


tar file from the Endeca Customer Support Web site:
https://customers.endeca.com

2. Change directories to INSTALL_BASE, which typically is


the /usr/local directory:
cd INSTALL_BASE

3. Use this command to decompress and unpackage the


Endeca tools tar file (GNU tar is recommended):
gzip -dc LANG_TAR_FILE | tar -xvpf -

To install the Endeca Supplemental Language Pack on Windows:

1. Download the Endeca Supplemental Language Pack


executable file from the Endeca Customer Support
Web site:
https://customers.endeca.com

The name of the file will be similar to lang460w2k.exe.

2. Run the file by clicking on it. You are not prompted


for any information; instead, the installer automatically
adds the appropriate dictionary files to the
%ENDECA_ROOT%\conf\dicts directory.

Advanced Features Guide Endeca Confidential


Chapter 19
391

Specifying the License Key

On its own, installing the Supplemental Language pack


does not provide access to Japanese, Chinese, and
Korean language support. These languages also require
you to specify a license key when you run the Dgidx
program and start your Navigation Engine. Contact your
Endeca representative to purchase a license key.

After you acquire the license key, use the --lang_license


flag to the Dgidx and Dgraph programs; for example:
--lang_license 174923185

Configuring Forge Components for Languages


The following sections discuss how to use Forge
components to identify the language of the incoming
source data.

Setting the Encoding for the Incoming Source Data

Forge needs to know the encoding of the data in order to


process it correctly. For a list of valid encodings, see the
ICU Converter Explorer at:

http://oss.software.ibm.com/cgi-bin/icu/convexp

Endeca Confidential Using Internationalized Data


392

The encoding can be specified in the following ways,


depending on the format:
• If the format is Delimited, Vertical, Fixed-width,
Exchange, ODBC, JDBC Adapter, or Custom Adapter,
you must specify the encoding via the
--input-encoding command-line flag when running
Forge or in the Encoding field of the Record Adapter
editor in Developer Studio.
• If the format is Document, documents are fetched via
the Content Acquisition System. Because each
document may have a different encoding, the
command-line argument and the Encoding attribute are
ignored. Instead, the encoding is determined
automatically for each document during the parsing
phase, as follows:
− If the Web server provides the encoding when
sending the document, Forge uses that information.
− If the Web server does not provide the encoding (or
if the file system is being crawled), Forge attempts to
detect the encoding automatically (for example, by
looking for a META tag identifying the encoding or by
examining the actual bytes).
− If both methods fail, Forge emits a warning and
defaults to LATIN-1.
• If the format is XML, the encoding must be specified in
the DOCTYPE declaration of the XML document as
required by the XML standard. Both the command-line
argument and the Encoding attribute are ignored.

Advanced Features Guide Endeca Confidential


Chapter 19
393

• If the format is Binary, the command-line argument


and the Encoding attribute are ignored because
encoding only applies to text files.

Specifying the Language for Documents

For the Document format, Forge can automatically


deduce the language used in a document. Forge’s primary
means for doing so is the ID_LANGUAGE expression, as
used in a record manipulator.

This example identifies the language of the value stored


in the Endeca.Document.Text property and then stores a
corresponding language identifier in the
Endeca.Document.Language property:

<EXPRESSION TYPE="VOID" NAME="ID_LANGUAGE">


<EXPRNODE NAME="PROPERTY" VALUE="Endeca.Document.Text"/>
<EXPRNODE NAME="LANG_PROP_NAME" VALUE="Endeca.Document.Language"/>
<EXPRNODE NAME="LANG_ID_BYTES" VALUE="500"/>
</EXPRESSION>

The EXPRNODE attributes are:


• PROPERTY - Specifies the name of the property on
which to perform language identification.
• LANG_PROP_NAME - Specifies the name of the property to
store the language. Endeca.Document.Language is the
default value.
• LANG_ID_BYTES - Specifies the number of bytes Forge
uses to determine the language. A larger number
provides a more accurate determination, but requires
more processing time. The default value is 300 bytes.

Endeca Confidential Using Internationalized Data


394

For full details on the ID_LANGUAGE expression, see the


Endeca XML Reference.

Forge Language Support Table

With the ID_LANGUAGE expression, Forge can identify the


following language and encoding pairs:

Language/Encoding Language/Encoding Language/Encoding Language/Encoding

ARABIC CP1256 ESTONIAN Latin4 ITALIAN ISO-8859-1 POLISH Latin2


Microsoft Code Page ISO-8859-4 (Latin 4) ISO-8859-1 (Latin 1) ISO-8859-2 (Latin 2)
1256

ARABIC UTF-8 ESTONIAN Latin4 ITALIAN UTF-8 POLISH Latin2


Unicode UTF-8 Microsoft Code Page Unicode UTF-8 Microsoft Code Page
1257 1250

CATALAN ASCII ESTONIAN UTF-8 JAPANESE ASCII POLISH UTF-8


ASCII Unicode UTF-8 JIS-Roman Unicode UTF-8

CATALAN CP1252 FINNISH ASCII ASCII JAPANESE CP932 PORTUGUESE ASCII


Microsoft Code Page Microsoft Code Page ASCII
1252 932

CATALAN ISO-8859-1 FINNISH CP1252 JAPANESE EUC-JP PORTUGUESE


ISO-8859-1 (Latin 1) Microsoft Code Page EUC-JP CP1252 Microsoft
1252 Code Page 1252

CATALAN UTF-8 FINNISH ISO-8859-1 JAPANESE JIS DEC PORTUGUESE


Unicode UTF-8 ISO-8859-1 (Latin 1) Kanji ISO-8859-1
ISO-8859-1 (Latin 1)

CHINESE ASCII FINNISH UTF-8 JAPANESE JIS PORTUGUESE UTF-8


CNS-Roman Unicode UTF-8 ISO-2022-JP Unicode UTF-8

CHINESE ASCII FRENCH ASCII ASCII JAPANESE JIS JIS X ROMANIAN Latin2
GB-Roman 0201-1976 ISO-8859-2 (Latin 2)

Advanced Features Guide Endeca Confidential


Chapter 19
395

Language/Encoding Language/Encoding Language/Encoding Language/Encoding

CHINESE BIG5 Big FRENCH CP1252 JAPANESE JIS JIS X ROMANIAN Latin2
Five Microsoft Code Page 0201-1997 Microsoft Code Page
1252 1250

CHINESE BIG5-CP950 FRENCH ISO-8859-1 JAPANESE JIS JIS X ROMANIAN UTF-8


Microsoft Code Page ISO-8859-1 (Latin 1) 0208-1983 Unicode UTF-8
950

CHINESE CNS CNS FRENCH UTF-8 JAPANESE JIS JIS X RUSSIAN CP1251
11643-1986 Unicode UTF-8 0208-1990 Microsoft Code Page
1251

CHINESE GB GERMAN ASCII ASCII JAPANESE JIS JIS X RUSSIAN ISO-8859-5


GB2312-80 0212-1983 ISO-8859-5

CHINESE EUC-CN GERMAN CP1252 JAPANESE JIS JIS X RUSSIAN KOI8R KOI
EUC-CN Microsoft Code Page 0212-1990 8-R
1252

CHINESE EUC DEC GERMAN ISO-8859-1 JAPANESE SJS RUSSIAN UTF-8


Hanzi Encoding ISO-8859-1 (Latin 1) Shift-JIS Unicode UTF-8

CHINESE Unicode GERMAN UTF-8 JAPANESE Unicode SLOVAK Latin2


Unicode UCS-2 Unicode UTF-8 Unicode UCS-2 ISO-8859-2 (Latin 2)

CHINESE Unicode GREEK Greek JAPANESE Unicode SLOVAK UTF-8


Unicode UTF-8 ISO-8859-7 Unicode UTF-8 Unicode UTF-8

CZECH Latin2 GREEK Greek KOREAN ASCII SPANISH ASCII ASCII


ISO-8859-2 (Latin 2) Microsoft Code Page KS-Roman
1253

CZECH Latin2 GREEK UTF-8 KOREAN KSC SPANISH CP1252


Microsoft Code Page Unicode UTF-8 EUC-KR Microsoft Code Page
1250 1252

CZECH UTF-8 HEBREW Hebrew KOREAN KSC KS C SPANISH ISO-8859-1


Unicode UTF-8 ISO-8859-8 5861-1992 ISO-8859-1 (Latin 1)

Endeca Confidential Using Internationalized Data


396

Language/Encoding Language/Encoding Language/Encoding Language/Encoding

DANISH ASCII ASCII HEBREW Hebrew KOREAN Unicode SPANISH UTF-8


Microsoft Code Page Unicode UCS-2 Unicode UTF-8
1255

DANISH CP1252 HEBREW UTF-8 KOREAN Unicode SWEDISH ASCII


Microsoft Code Page Unicode UTF-8 Unicode UTF-8 ASCII
1252

DANISH ISO-8859-1 HUNGARIAN Latin2 LATVIAN Latin4 SWEDISH ISO-8859-1


ISO-8859-1 (Latin 1) ISO-8859-2 (Latin 2) ISO-8859-4 ISO-8859-1 (Latin 1)

DANISH UTF-8 HUNGARIAN Latin2 LATVIAN Latin4 SWEDISH CP1252


Unicode UTF-8 Microsoft Code Page Microsoft Code Page Microsoft Code Page
1250 1257 1252

DUTCH ASCII ASCII HUNGARIAN UTF-8 LITHUANIAN Latin4 SWEDISH UTF-8


Unicode UTF-8 ISO-8859-4 Unicode UTF-8

DUTCH CP1252 ICELANDIC ASCII LITHUANIAN Latin4 THAI CP874


Microsoft Code Page ASCII Microsoft Code Page Microsoft Code Page
1252 1257 874

DUTCH ISO-8859-1 ICELANDIC LITHUANIAN UTF-8 THAI UTF-8 Unicode


ISO-8859-1 (Latin 1) ISO-8859-1 Unicode UTF-8 UTF-8
ISO-8859-1 (Latin 1)

DUTCH UTF-8 ICELANDIC CP1252 NORWEGIAN ASCII TURKISH CP1254


Unicode UTF-8 Microsoft Code Page ASCII Microsoft Code Page
1252 1254

ENGLISH ASCII ASCII ICELANDIC UTF-8 NORWEGIAN CP1252 TURKISH UTF-8


Unicode UTF-8 Microsoft Code Page Unicode UTF-8
1252

ENGLISH CP1252 ITALIAN ASCII ASCII NORWEGIAN


Microsoft Code Page ISO-8859-1
1252 ISO-8859-1 (Latin 1)

Advanced Features Guide Endeca Confidential


Chapter 19
397

Language/Encoding Language/Encoding Language/Encoding Language/Encoding

ENGLISH ISO-8859-1 ITALIAN CP1252 NORWEGIAN UTF-8


ISO-8859-1 (Latin 1) Microsoft Code Page Unicode UTF-8
1252

Performance Considerations for Language Identification

Language identification requires a balance between


accuracy and performance, and the exact balance
depends both on the requirements and the data:
• To increase accuracy, raise the number of bytes in
the LANG_ID_BYTES attribute in the ID_LANGUAGE
expression.
• To increase performance, either reduce the number
of bytes, or, if possible, use different criteria to
determine the language. For example, if the
languages are already segmented by folder, then a
conditional ADD_PROP expression can be used to
create the language property on each record,
avoiding the LANGUAGE_ID expression altogether.

Note that if the Web server being crawled by the CAS


provides incorrect encoding information, you can remove
the encoding property (which typically is the
Endeca.Document.Encoding property) before the parse
phase. In this case, the PARSE_DOC expression will attempt
to detect the encoding automatically. If the encoding for
all documents being crawled is known in advance, an
expression could add the correct encoding to each record
before the parse expression.

Endeca Confidential Using Internationalized Data


398

Configuring Languages for the Navigation Engine


The following sections discuss how to configure language
identifiers for the Navigation Engine, as well as
language-specific spelling correction.

When using internationalized data, keep in mind that the


Navigation Engine does not support language-specific
sort orders (for example, Spanish speakers expect ch and
ll to be sorted as distinct characters), separation of
compound words in German, or bi-directional languages
like Arabic and Hebrew.

Using Language Identifiers

American English (“en”) is the default language of the


Navigation Engine. If your application contains text in
other languages, you should specify to the Navigation
Engine the language of the text, so that it can perform
language-specific operations correctly.

You use a language ID to identify a language. Language


IDs must be specified as a valid RFC-3066 or ISO-639
code, such as the following examples:
• da – Danish
• de – German
• el – Greek
• en – English (United States)
• en-GB – English (United Kingdom)

Advanced Features Guide Endeca Confidential


Chapter 19
399

• es – Spanish
• fr – French
• it – Italian
• ja – Japanese
• ko – Korean
• nl – Dutch
• pt – Portuguese
• zh – Chinese
• zh-CN – Chinese (simplified)
• zh-TW – Chinese (traditional)

A list of the ISO-639 codes is available at:

http://www.w3.org/WAI/ER/IG/ert/iso639.htm

You can supply the language ID for source data using


one of these methods:
• A global language ID can be used if all or most of your
text is in a single language.
• A per-record language ID should be used if the
language varies on a per-record basis.
• A per-dimension/property language ID should be used
if the language varies on a per-dimension basis.
• A per-query language ID should be used in your
front-end application if the language varies on a
per-query basis.

Endeca Confidential Using Internationalized Data


400

The following sections describe these methods of


specifying the language ID for your data.

Specifying a Global Language ID

If most of your text is in a single language, you can use


the global language ID by specifying the --lang flag to
the Dgidx and Dgraph programs. The Navigation Engine
assumes that text not tagged with a more specific
language ID (the per-record, per-dimension, or per-query
language IDs) is in the global language. The global
language ID defaults to en (US English) if left unspecified.

The following example shows the Endeca Manager


Settings dialog (in Developer Studio) configured to use
the --lang flag set to zh-CN for Simplified Chinese:

Advanced Features Guide Endeca Confidential


Chapter 19
401

Specifying a Per-Record Language ID

If your application data is organized so that all the data in


a record is in a single language but different records are
in different languages, you should use a per-record
language ID. This scenario is common in applications that
use the Content Acquisition System, because in those
applications each record represents an entire document
which is usually all in a single language, while different
documents may be in different languages.

To specify a per-record language ID, add a property or


dimension named Endeca.Document.Language to your
records. This is the default name of the property created
by the ID_LANGUAGE expression in Forge, so use of that
expression automatically creates a per-record language
ID. The value of the property or dimension should be a
valid RFC-3066 or ISO-639 language ID.

Specifying a Per-Dimension/Property Language ID

If your application tends to have mixed-language records,


and the languages are segregated into different
dimensions or properties, use per-dimension/property
language IDs. For example, your data may have an
English property called Description and a Spanish
property called Descripción. In this case, because an
individual record can have both English and Spanish text,
a per-property language ID would be more appropriate
than a per-record language ID.

To specify per-dimension/property language IDs, create a


file called (dbprefix.languages.xml) whose contents list

Endeca Confidential Using Internationalized Data


402

the dimensions and/or properties for which you want to


specify a language. Use one KEY_LANGUAGE element for
each dimension or property. For details on this element,
see the Endeca XML Reference.

Note: You cannot create this file in Developer Studio. You must
create it with a text editor. Place the file in the directory where
the project XML files reside.

The following example illustrates a languages.xml file:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LANGUAGES SYSTEM "languages.dtd">
<LANGUAGES>
<KEY_LANGUAGE NAME="Description" LANGUAGE="en"/>
<KEY_LANGUAGE NAME="Descripción" LANGUAGE="es"/>
<KEY_LANGUAGE NAME="Beschreibung" LANGUAGE="de"/>
</LANGUAGES>

In the example, three dimensions are configured for


English, Spanish, and German, respectively.

Note: This feature is not supported when using the Endeca


Manager.

Specifying a Per-Query Language ID

The ENEQuery and UrlENEQuery classes in the Endeca


Presentation API have a setLanguageId() method, which
you use to tell the Navigation Engine what language full
text queries are in. Note that in the .NET version of the
API, the member is called the LanguageId property.

If you have enabled the language-specific spelling


correction feature, a per-query language ID will enable

Advanced Features Guide Endeca Confidential


Chapter 19
403

the Navigation Engine to select the appropriate dictionary


for a given query.

For details on the ENEQuery and UrlENEQuery class


members, see the Endeca Javadocs or the appropriate
Endeca API Guide.

The following Java code snippet shows how to set French


(using its language code of “fr”) as the language of any
text portion of the query (such as search terms):
// Create a Navigation Engine query
ENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");
// Set French as the language for the query
usq.setLanguageId("fr");
// Set other query attributes
...
// Make the request to the Navigation Engine
ENEQueryResults qr = nec.query(usq);

If no per-query language ID is specified, the Navigation


Engine uses the global language ID, which defaults to en
(US English) if not set specifically.

Configuring Language-Specific Spelling Correction

You can enable language-specific spelling correction to


prevent queries in one language from being
spell-corrected to words in a different language.

This feature works by creating separate dictionaries for


each language. The dictionaries are generated from the
source data and therefore require that the source data be
tagged with a language ID. You should also use a

Endeca Confidential Using Internationalized Data


404

per-query language ID, so that the Navigation Engine can


select the appropriate dictionary for a given query.

Note: This feature is not supported when using the Endeca


Manager.

To enable the language-specific spelling correction


feature, create a db_prefix.spell_config.xml file with the
following text:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SPELL_CONFIG SYSTEM "spell_config.dtd">
<SPELL_CONFIG>
<SPELL_ENGINE>
<DICT_PER_LANGUAGE>
<ESPELL/>
</DICT_PER_LANGUAGE>
</SPELL_ENGINE>
</SPELL_CONFIG>

See the Endeca XML Reference for more information


about the spell_config.xml file and its elements.

Note: You cannot create this file in Developer Studio. You must
create it by hand using a text editor.

If a spell_config.xml file exists, it will override the use of


these parameters to the Dgidx --spellmode option:
espell

aspell

aspell_OR_espell

aspell_AND_espell

The language-specific spelling correction feature uses the


Espell language engine, which is part of the base product.

Advanced Features Guide Endeca Confidential


Chapter 19
405

The Aspell language engine only supports English, and so


it is not supported for this feature.

Using Encoding in the Web Application


If you are using internationalized data in your Web
application, you should be aware of the encoding
(character set) requirements described in the following
two sections.

Setting the Encoding for URLs

The UrlENEQuery and UrlGen classes require that you


specify a character encoding so that they can properly
decode URLs. For example, a URL containing %E5%8D%83
refers to the Chinese character for “thousand” if using the
UTF-8 encoding, but refers to three accented European
letters if using the windows-1252 encoding.

The following Java code snippet shows how to instantiate


a UrlGen object using the UTF-8 character encoding set:
// Create request to select refinement value
urlg = new UrlGen(request.getQueryString(), "UTF-8");

For details on these classes, see the Endeca Javadocs or


the appropriate Endeca Presentation API Guide.

Endeca Confidential Using Internationalized Data


406

Setting the Page Encoding

Your application should choose a suitable output


encoding for the pages it produces. For example, a
multi-lingual European site might choose the
windows-1252 encoding, while a Chinese site might
choose GB2312 or Big5. If you need to support all
languages, we recommend using the UTF-8 encoding.

Viewing Navigation Platform Logs


Log messages output by the Navigation Platform binaries
(Forge, Dgidx, Dgraph, and so forth) are in the UTF-8
encoding. Most common UNIX/Linux shells and terminal
programs are not set up to display UTF-8 by default and
will therefore display some valid characters as question
marks (?).

If you find unexpected question marks in the data, first


validate that it is not simply a display issue. Try the od
command on Linux, or use a UTF-8 compatible display.

Advanced Features Guide Endeca Confidential


Chapter 19
Index

A disabling for a host 96


aggregated records example key ring file 90
creating record queries 114
displaying records and properties 115 B
enabling a dimension for 108 baseline updates
enabling a property for 108 defined 319
ENE URL query parameters 112 pipeline details 322
introduced 105 Boolean syntax for record filters 145
methods for rollup keys 109 boot-strapping server authentication 95
Agraph bulk export of records
control script 386 Developer Studio configuration 137
introduced 371 ENE URL query parameters 138
partial updates 355 introduced 137
partitions 372 objects and method calls 138
performance 386 performance 144
provisioning 378
query processing 372 C
replicas 379 CA_DB element 95
running 383 CAS
troubleshooting 385 converting documents to text 47
audience for this guide xvi crawl types 27
authentication creating a full crawling pipeline
configuring basic 89 39–41
408

creating a record adapter to read crawl types in CAS 27


documents 41–43 crawler errors 72
creating a record manipulator 43–50
handling crawler errors 72–81 D
identifying the language of derived properties
documents 50–52 introduced 123
introduced 23 performance 129
Perl manipulator 54–55 presentation API development 125
property name syntax 36–37 sample .Net code 127
redundant URLs 30 sample COM code 128
reference implementation 26–27 sample Java code 126
removing document body properties DERIVED_PROP element 124
52–54 deterministic distribution strategy for
removing unnecessary records after a partial updates 356
Dgraph configuration for partial updates
crawl 68–71
341
RETRIEVE_URL expression 45
Dgraph partition 372
root URL extraction settings 60–63 dimension adapter for partial updates
security information 24 337
source documents and Endeca dimension server for partial updates 338
records 31–36 dimension server match count log,
source formats supported by ProFind configuring 237
81 directory structure for partial updates
specifying root URLs to crawl 59–60 344
spiders 55–59 dynamic business rule
supporting components 25–26 CMS comparisons 259
URL and record processing 28–30 creating 278
viewing all properties generated by properties 286
38–39 rendering results 301
client authentication 97 rule groups 301
components that support CAS 25 style 261
configuring HTTPS 94 target 260
Content Acquisition System. See CAS trigger 260
content spotlighting 257 types of results 262
converting documents to text in CAS 47 zone 261
CONVERTTOTEXT expression 48
Coremetrics integration 253

Advanced Features Guide Endeca Confidential


409

E F
encrypting keys with Forge 100 featured records 287
Endeca Navigation Engine distribution Forge
across multiple processors 371 encoding for internationalized data
ENE URL query parameters 391
bulk export of records 138 encrypting keys with 100
for aggregated records 112 Forge hierarchical logging
key-based record sets 133 configuring MustMatch messages
record filters 153 236–237
ERecEnumerator class 143 introduced 227
ERecIter class 142 log appenders 231–234
example syntax of URL filters 64 log levels 228, 228–229
Exchange Server authentication 98 log.ini file 238–240
EXCHANGE_SERVER element 99 message categories 229–231
expression evaluation of record filters
formats
155
Forge log appenders 235
externally created dimensions
partial update records 334
compared to externally managed
full updates
taxonomies 168
See baseline updates
Developer Studio configuration 169
importing 174
G
introduced 167
global language ID 400
XML requirements 170
externally managed taxonomies
H
compared to an externally created
HTTP element 93
dimension 178
HTTPS element 96
Developer Studio configuration
179–180
I
introduced 177 ID_LANGUAGE expression 50
loading 188–189 IF expression for partial updates 328
node ID requirements 184 implementing "More..." dimension
pipeline configuration 185–187 values 161
transforming 187–188 importing externally created dimensions
XML syntax 181–183 174
XSLT mapping 181 inert dimension values
.NET example 165

Endeca Confidential
410

COM example 166 L


Developer Studio configuration 162 language IDs
introduced 161 global 400
Java example 164 per-dimension 401
Presentation API development per-property 401
163–166 per-query 402
internationalized data per-record 401
ENE URLs 405 large OR filter performance 156
Forge encoding 391 large scale negation 157
global languange ID for 400 license key for Asian language support
ID_LANGUAGE expression 393 391
introduced 389 logging hierarchy, Forge 241
language identification 398
language support table 394 M
language-specific spelling corrections memory costs of record filters 155
merchandising 257
403
See also dynamic business rule
licence keys for Asian language
tool workflow 259, 268
support 391
multithreaded mode
page encoding 406
associated costs 246
per-dimension or property language
Dgraph configuration 247
IDs 401
Intel processors 249
performance 397
introduced 243
per-query language ID 402
Linux 250
per-record language ID 401
performance 247–248
supplemental language pack 390
Solaris 251
Iterator class for bulk export of records
Windows 251
141
MustMatch messages 236
K
N
KEY element 94
node ID requirements for externally
key-based record sets
managed taxonomies 184
ENE URL query parameters 133
non-navigable dimension values
introduced 131
See inert dimension values
objects and method calls 133
performance 132

Advanced Features Guide Endeca Confidential


411

O deployment 360
objects and method calls record manipulator component 327
aggregated record rollup keys 109 record manipulator for Agraph
bulk export of records 138 deployment 361
key-based record sets 133 record specification attribute 340
reference implementation 322
P update adapter 336
PARSE_DOC expression 49 update adapter for Agraph
part list performance 156 deployment 365
partial updates UPDATE_RECORD expression 328
adding other control script bricks 352 URL update command parameters
capabilities 321 353
control script development 343 passing phrases with Forge 100
control script development for per-dimension language ID 401
Agraph 367 performance
deterministric distribution strategy Agraph 386
356 bulk export of records 144
Dgidx configuration 343 derived properties 129
Dgraph configuration 341 internationalized data 397
difference from baseline updates 319 key-based record sets 132
dimension adapter 337 multithreaded mode 247
dimension server 338 user profiles 318
directory structure 344 Perl expression for partial updates 362
format of update records 334 Perl manipulator in CAS 54
IF expression for record manipulator per-property language ID 401
328 per-query language ID 402
in Agraph deployments 355 per-record language ID 401
introduced 319 per-record memory overhead 144
naming format of data files 339, 358 promoting records with dynamic
business rules 258
Perl expression for record
properties in CAS
manipulator 362
name syntax 36
pipeline details 324
viewing generated 38
random distribution strategy 356
provisioning the Agraph in Web Studio
record adapter component 326 378
record adapter for Agraph PROXY element 100

Endeca Confidential
412

proxy server authentication 99 root URLS for the spider to crawl 59


rule groups 301
R
random distribution strategy for partial S
updates 356 SITE element 91
REALM element 93 specifying root URLs to crawl in CAS 59
record adapter spelling correction, language-specific
creating for Agraph partial updates 403
360 spiders
creating for CAS 41 specifying proxy servers 66
creating for partial updates 326 specifying record sources 65
record filters specifying timeouts 65
data configuration 151–152 Stratify document classification
Developer Studio configuration 151 building a taxonomy 202
ENE query syntax 147 creating a pipeline 204–213
ENE URL query parameters 153 dimension value synonyms 219
expression evaluation 155 Endeca integration with 196–199
introduced 145 exporting a taxonomy 203–204
large scale negation 157 integrating the taxonomy 213–215
memory costs 155 introduced 191
Navigation Engine configuration 152 loading the dimensions 216–218
performance 154 mapping dimensions 222
syntax 146 required tools 200–201
XML file syntax 149 terms and concepts 193–196
record manipulator styles for dynamic business rules,
creating for Agraph partial updates creating 275
361 supplemental language pack, installing
creating for CAS 43 390
creating for partial updates 327 syntax of record filters 146
redundant URLS in CAS 30–31
reference implementation, CAS 26 T
REMOVE_RECORD expression 70 target for promoting records, creating
removing document body properties in 284
CAS 52 taxonomy, developing a Stratify 201–223
RETRIEVE_URL expression 45 trigger
root URL extraction settings for CAS 60 creating 281

Advanced Features Guide Endeca Confidential


413

dimension value 279


keyword 280
time 280
user profile 281

U
update adapter for Agraph partial
updates 365
update adapter for partial updates 336
UPDATE_RECORD expression for
partial updates 328
URL and record processing in CAS 28
URL filters for CAS 64
user profiles
.NET example 317
COM example 317
Developer Studio configuration
314–315
introduced 311
Java example 316
objects and method calls 315–316
performance 318
scenario 312–313

W
Web crawling with authentication 89

X
XML syntax for dimension hierarchy 171

Z
zones for dynamic business rules,
creating 271

Endeca Confidential
414

Advanced Features Guide Endeca Confidential

You might also like