Professional Documents
Culture Documents
Administrating TREX
using the
TREX Admin Tool
Bettina Knauss
NetWeaver RIG EMEA
SAP AG
Walldorf 07.03.2007
TREX Introduction
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
TREX Architecture
Example
When a service sends the name server the request
GetServer (IndexServer, SearchMode, MyIndex)
the name server answers with the address
<host>:<port>
of the index server to which to send the request
topology.ini
Read by all name servers
Contains all index-relevant information
– To edit the file, use the TREX standalone admin tool
sapprofile.ini
Read by all TREX services and clients
Specifies:
– Port number of local name server
– Host and port numbers of all master name servers
– Amount of shared memory used by topology.ini data
– System ID
– Path information to where each service saves its data
TREX Preprocessor
Delivers documents that the engines can use directly
Supports almost any data type
Gets documents via HTTP from source
Converts documents to HTML
Keeps the document structure
Extracts attributes
– Metadata from DOC, PDF, ...
.* .zip .ppt
– Names from a lexicon
.pdf .*
– Application-specific attributes
Performs linguistic processing .* <html> .doc
– Tokenization <head>…</head>
<body>…</body>
– Stemming </html>
– Tagging
(using third party products)
TREX Preprocessor
Reduces workload on the other engines
Works independently of the indexes
Is stateless
Java ABAP Index
Client Client Server
Name
Python
Extensions Preprocessor Server
Client
Search Indexing
Exact search Many documents at once
– SAP – Up to tens of millions
Phrase search Many formats *
– “SAP AG” – PDF, doc, ppt, zip, …
Boolean search With or without queueing
– SAP AND ORACLE – Synchronous or asynchronous
Masked or wildcard search Automatic language identification *
– Web* – 31 languages so far …
Fuzzy or error-tolerant search Attribute extraction *
– Kagerman Kagermann – DC and other metadata
Linguistic search Linguistic processing *
– Houses House – Tokenizing, tagging, stemming, …
Attribute search Ranking
– Author = Stevens – TF*IDF and P-norm
* Via Preprocessor
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
Index Server
Index Server
Web Server Text Mining Text Search Attribute
Engine Engine Engine
The Web server converts the HTTP message into the format used
inside TREX and sends a request to the name server for the name
and address of a service to handle the request
The name server checks its list of available servers and tells the
Web server the address of an index server that has received the
fewest calls so far and can handle the request
Java Client
TREX
Name Queue
Where can I Name
Server Preprocessor Queue
Server
Server Preprocessor Server
send this
request?
Send it to
Index Index Server
Index Server
Server 1
Web Server Text Mining Text Search Attribute
Engine Engine Engine
The Web server passes the search request to the index server as
a TCP/IP packet
The index server sees that the request is for a phrase search and
therefore forwards the phrase to the preprocessor for language
identification, tokenization, tagging, and stemming
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Do a phrase searchServer
for Preprocessor Server
invoice verification in
the BooksOnline index
Index Server
Index Server
Web Server !Text Mining Text Search Attribute
Engine Aphrase search Engine
Engine –
this means work for
the preprocessor!
The language of the search Index Index Index
may be specified in advance
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
This is a simple
query – just a
Index Server
2-word phraseIndex Server
Web Server Text Mining Text Search Attribute
Engine Engine Engine
The index listing for invoice
is longer than the index
listing for verification Index
so Index Index
select verification first
The search engine finds the row for the term verification in the
BooksOnline index and selects the set of books containing the
term, then it checks this set of books against the row for the term
invoice and selects just the books that contain both terms
Next, it reads the addresses of the terms in each book, calculates
rank values, sorts the results, and takes the top ten (or more)
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
Calculate ranks
and sort
Index Server
Index Server
1. Find set of books
Web Server with verification
Text Mining Text Search Attribute
2. Find subset Engine
with Engine Engine
The search engine reads all the requested attributes for the
selected books, including titles and authors and keys to the
documents
The engine uses the keys to load the document contents and
scans the texts for the first occurrences of the search phrase (or
linguistic variants of the phrase) to create a brief summary text
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
The search engine passes the result set back via the index server
for merging with results from any other engines (here none)
The index server passes the result set back via the Web server
and the Java client to the graphical user interface
Jane sees a ranked list of books about invoice verification less
than a second after she launched the search
Java Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
Server Preprocessor Server
Index Server
Index Server
Web Server Text Mining Text Search Attribute
Engine Engine Engine
73 books found
in 0.14 seconds
Index Index Index
Internal
InternalAuditing
Auditing
by
byFirst
FirstAuthor,
Author,Second
SecondAuthor
Author
Economic
Economic Publishers, NewYork
Publishers, New York
Invoice
Invoice verification is the nextstep
verification is the next step......The
Theinvoice
invoiceverification
verificationininthe
the......
375
375pages
pagesFirst
Firstedition
editionISBN
ISBN0-3XX-XXXXX-X
0-3XX-XXXXX-X
Browse
Browsefull
fulltext
text
ABAP Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
RFC Server
Server Preprocessor Server
Gateway
Index Server
Index Server
Create an index Text Mining Text Search Attribute
called BooksOnline Engine Engine Engine
The name server tells the RFC server the address of an index
server that can create the index
In a one-box implementation of TREX, this step is straightforward
unless the index server is down for some reason
The name server uses a round robin procedure to select an index
server
ABAP Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
RFC Server
Server Preprocessor Server
Gateway
ABAP Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
RFC Server
Server Preprocessor Server
Gateway
Queueing is an option:
Indexing can also be Index Index Index
done immediately
The queue server receives the list of URLs for the documents
.htm .pdf .ppt from the specified folder and persists them in a queue for the
index for as long as required until a preprocessor is available
.xls .doc .txt Indexing a large collection of documents can be a long job, so the
administrator can hold or flush the queue manually at any time
ABAP Client
TREX
Name Queue
Name
Server Preprocessor Queue
Server
RFC Server
Server Preprocessor Server
Gateway
Queue server receives document
URLs and adds them to the
Index Server
BooksOnline queue
Index for indexing
Server
Index Server
Index Server
Gateway
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
TREX Administration Tools
Start Tool
DEMO
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
Landscape Example
IS Index server
M Master Master Slaves
MI Master index mytrexmaster
mytrexslave1 ... 2
NS Name server
PP Preprocessor RFC WS
WS
RFC
Q Queue
M NS PP
QS Queue server M QS S NS PP
M IS S IS
RFC RFC server
SN Snapshots
Q Q Q
S Slave Q MI SN
SI SI
SI Slave index
WS Web server
http://trex.wdf.sap.corp:1080/ Documentation Distributed Search and Classification (TREX) 7.0 SP2 Systems
RFC WS
RFC WS
M NS PP
S NS PP
Backup Host M QS File Server
S IS
M IS
mytrexbackup
T
RFC WS
Q Q
Q QQ MI MIQ
M NS PP
B QS Slave Hosts
B IS Master Host Q mytrexslave3/4
Q SI SI
mytrexmaster2 SN SNQSI
RFC WS RFC WS
M NS PP S NS PP
M QS S IS
M IS
http://trex.wdf.sap.corp:1080/ Documentation Distributed Search and Classification (TREX) 7.0 SP2 Systems
RFC WS RFC WS WS
RFC
M NS PP M NS PP S NS PP
B QS M QS File Server
S IS
B IS M IS
T
Q Q Q
Q QQ MI MI
Q
Backup Host Q SI SI Slave Hosts
Master Host SN SNQSI
mytrexslave3/4
mytrexbackup2 mytrexmaster2
S NS PP M NS PP S NS PP
B QS M QS S IS
B IS M IS
http://trex.wdf.sap.corp:1080/ Documentation Distributed Search and Classification (TREX) 7.0 SP2 Systems
DEMO
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
Creating RFC Connection
DEMO
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
Reorg I
Landscape Configuration
RFC Connection
Administrating, Monitoring
Traces
TREX Traces
DEMO
No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be
changed without prior notice.
Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.
Microsoft, Windows, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation.
IBM, DB2, DB2 Universal Database, OS/2, Parallel Sysplex, MVS/ESA, AIX, S/390, AS/400, OS/390, OS/400, iSeries, pSeries, xSeries, zSeries, System i, System i5, System p,
System p5, System x, System z, System z9, z/OS, AFP, Intelligent Miner, WebSphere, Netfinity, Tivoli, Informix, i5/OS, POWER, POWER5, POWER5+, OpenPower and PowerPC are
trademarks or registered trademarks of IBM Corporation.
Adobe, the Adobe logo, Acrobat, PostScript, and Reader are either trademarks or registered trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Oracle is a registered trademark of Oracle Corporation.
UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group.
Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc.
HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C ®, World Wide Web Consortium, Massachusetts Institute of Technology.
Java is a registered trademark of Sun Microsystems, Inc.
JavaScript is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape.
MaxDB is a trademark of MySQL AB, Sweden.
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered
trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies.
Data contained in this document serves informational purposes only. National product specifications may vary.
The information in this document is proprietary to SAP. No part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior
written permission of SAP AG.
This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments,
and functionalities of the SAP® product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this
document is subject to change and may be changed by SAP at any time without notice.
SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items
contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability,
fitness for a particular purpose, or non-infringement.
SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This
limitation shall not apply in cases of intent or gross negligence.
The statutory liability for personal injury and defective products is not affected. SAP has no control over the information that you may access through the use of hot links contained in
these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.