You are on page 1of 50

Java Content Repository TextSearch in IBM WebSphere Portal: What's new in V7

Malarvizhi Kandasamy (k.malarvizhi@in.ibm.com), Staff Software Engineer, JCR Development, IBM Ramgopal Kanasani (rakanasa@in.ibm.com), Software Engineer, JCR Development, IBM

March 2011 Copyright International Business Machines Corporation 2011. All rights reserved. Summary: This white paper explains how text searches are performed on the content stored in the JavaTM Content Repository in IBM WebSphere Portal 7; specifically, we focus on the search component of the WebSphere Portal 7 IBM Lotus Web Content Management authoring portlet that uses the WebSphere Portal Search Engine, which differs from previous WebSphere Portal versions.

Table of Contents
1 Introduction...............................................................................................................................2 2 Overview of JCR TextSearch....................................................................................................2 2.1 Index maintenance............................................................................................................2 2.2 Search...............................................................................................................................3 3 Configuring JCR Content Model TextSearch............................................................................3 3.1 Standalone WebSphere Portal environment.....................................................................3 3.2 Manually configuring JCRCollection1 in standalone environment.....................................4 3.3 Clustered WebSphere Portal environment......................................................................11 3.4 Preparing the remote search service...............................................................................13 3.5 Configuring the remote search service............................................................................19 4 Setting up JMS in a clustered environment.............................................................................25 4.1 Adding a cluster as a bus member..................................................................................28 5 Searching a seedlist document in WCM.................................................................................37 6 Troubleshooting JCR TextSearch...........................................................................................44 7 Conclusion..............................................................................................................................49 8 Resources...............................................................................................................................49 About the authors......................................................................................................................50

1 Introduction
In WebSphere Portal 7, the Java Content Repository (JCR) uses the WebSphere Portal Search Engine (PSE) for its text search functions. In WebSphere Portal versions 6 and earlier, JCR uses the Juru text engine, a Java library developed by the Haifa Research Lab (HRL) and the component that maintains the text index and performs the searches over it. However, to align with the industry-standard approach and to support multiple search engines and repositories, JCR adopted HRLs PSE, which is based on the Apache Lucene search engine in WebSphere Portal 7. This paper discusses the indexing and search part of Lotus Web Content Management (hereafter called WCM) and the different configurations introduced in version 7. We also describe the configuration techniques for standalone, clustered, and farm environments, and provide some best practices and troubleshooting tips.

2 Overview of JCR TextSearch


JCR TextSearch is the component that enables the searching for words and phrases found in the content repository and provides linguistic capabilities for the documents stored in the repository. JCR-based content includes that which is created with WCM or Personalization. The two processes involved in TextSearch functionality are (1) index maintenance and (2) text search.

2.1 Index maintenance


Index maintenance is a background scheduled activity that is invoked at a specified interval. This activity updates the text index directory for all the actions that have occurred in the repository. In WebSphere Portal 6.1 and previous releases, the administrator does not have any control over the indexing process; the JCR configuration file, icm.properties, is the only place in which the administrator can configure text search. In WebSphere Portal 7, indexing process can be monitored and administered from the Search Administration portlet. JCR creates an out-of-the-box search collection (JCRCollection1), which is associated with the JCRSource crawler that performs document indexing. We will look at the crawler and its configurations later in this paper.

Full crawl
This is the activity that builds the text index directory from scratch, which usually occurs when WebSphere Portal is installed and the server is started for the first time. In the first run, JCR processes internally and prepares to build the index from scratch. Only in the next scheduled run will the crawler start collecting the document. The full crawler indexes the document in asynchronous mode only. This means that, even if the WebSphere Portal server goes down while building the index, the process can resume and start building the index from the point of failure when the server comes up again.

Incremental crawl
Once the full crawl has successfully built the index directory from scratch, the WebSphere Portal scheduler keeps checking at frequent, specific intervals to determine whether there are any modifications in the repository. If so, it updates the index directory with all these changes.

2.2 Search
In WebSphere Portal 7, JCR uses XPath as the query language. When you want to search some information from the content repository, you specify XPath as the input to JCR. In the XPath query there are two built-in functions, text-contains() and text-score(), which provide text search functions on the JCR node using the search pattern. JCR supports different types of searches including fielded search, scoped search, fuzzy search, stemmed search, as well as linguistic features and ranking/score in search. For more information on search and the convertors used in TextSearch, refer the developerWorks article, Java content repository TextSearch in IBM WebSphere Portal and IBM Lotus Web Content Management: Overview and troubleshooting.

3 Configuring JCR Content Model TextSearch


Let's now discuss how to configure the standalone, clustered, and farm environments with respect to the JCR content model.

3.1 Standalone WebSphere Portal environment


This is a system in which WebSphere Portal 7 is installed as a separate application along with IBM WebSphere Application Server (hereafter called WAS). WebSphere Portal uses the Apache Derby database as the default database; however, in a production environment, it is recommended to use another high-end database like IBM DB2, Oracle, or Microsoft SQL Server. The Derby database is more suitable for a development environment or for running WebSphere Portal in a proof-of-concept environment. In the standalone environment, the Java Message Service (JMS) resources required for JCR index maintenance are created during installation. The JCR search collection (JCRCollection1) is created automatically when the first content is created-or existing content is edited-and is used by the Search component in the WCM Authoring portlet. The JCR Search collection is created in the local file system as specified in the icm.properties file. The JMS bus member is the WebSphere Portal server. If the database is transferred from Derby to other databases, then the ConfigEngine task, create-jcr-jms-resources-post-dbxfer, should be run to create the JMS resources for the transferred database. Also, the messaging engines datastore is configured to use the same database as used by JCR for databases, other than Derby. The JCR icm.properties configuration file is located in WebSphere Portal under the c:/IBM/WebSphere/wp_profile/PortalServer/jcr/lib/com/ibm/icm folder. Note that any modification to the file requires that the WebSphere Portal server be restarted. The following are the essential configuration properties for JCR TextSearch: 3

Enable the textsearch (jcr.textsearch.enabled=true). Set this value to false to disable the text search at runtime. NOTE: By disabling the search, documents won't be collected during the crawler schedule, and the Authoring portlet search won't work. By default, this value is set to false. Once WebSphere Portal is installed successfully, enable TextSearch by setting this property to true.

Set the Convertor to extract the binary content (jcr.textsearch.convertor = com.ibm.icm.ts.convertor.WpsConvertor). As mentioned previously, there are three convertor options available for this property. The recommended option is WpsConvertor, which calls the Document Conversion Service internally in JCR. Create the index directory in the location specified (jcr.textsearch.indexdirectory = c:/IBM/WebSphere/wp_profile/PortalServer/jcr/search) Set the PSE type as localhost (jcr.textsearch.PSE.type=localhost). This is the value to be set in a standalone system. The other options are Simple Object Access Protocol (SOAP) and Enterprise JavaBeans (EJBs), which are used to configure remote search service for a clustered environment. We will see the different options in detail later in this document. Set the Incremental Topic used during Incremental crawl (jcr.textsearch.incrementalcrawl.topic = jms/JCRSeedTopic1100).

3.2 Manually configuring JCRCollection1 in standalone environment


If you want to delete the default JCRCollection1 created by JCR during WebSphere Portal server startup and want to re-create the Collection with customized values, follow these steps: 1. In WebSphere Portal, select Portal Administration Search Administration Manage Search Search collections. Click the New Collection button (see figure 1).

Figure 1. Create New Collection in the Manage Search Collections window

2. 3.

For a standalone environment, select Default Portal Search Service in the Search service field (see figure 2). Keep the Name of the Collection as JCRCollection1, and specify the location of the Search Collection and the same location in icm.properties. The default language for the Collection is English, which will be used as the indexing language during crawling. The index language enhances the quality of search results that are returned. All other fields are optional.

Figure 2. Create Search Collection window

4.

In this example, we specify the location of the Collection as C:\JCR, which creates the search index directory in the location C:\JCR as shown in figure 3.

Figure 3. Directory structure of JCRCollection1

5.

Once the Collection is created, it displays in the list of Search Collections. Click the JCRCollection1 collection to create a new content source, as shown in figure 4.

Figure 4. Create new JCR Content Source for JCRCollection1

6.

The content source handles indexing the documents and is where you specify the crawler parameters. When specifying the content source parameters, choose the content source type as Seedlist Provider and provide the name for the new content source, in this case, JCRContentSource (see figure 5). Also in figure 5, specify the value for the URL as follows: http://server name:portnumber/seedlist/server? Action=GetDocuments&Format=ATOM&Locale= en_US&Range=5&Source=com.ibm.lotus.search.plugins.seedlist.retriever.jcr.JCRRetriever Factory&Start=0&SeedlistId=1@ In this URL, the range parameter specifies the number of documents in a single page of a crawler session. Here, the Retriever sends the response to the crawler in XML ATOM format, and the response is sent as pages, each of which contains a range number of documents.

7.

For example, if there is a list of 100 updates to be indexed, and the range is set as 10, then the retriever sends 10 pages to the crawler, each of which contains 10 documents. By default, the range value is set as 100. The administrator can change the range parameter in the URL based on the portal requirement. The crawler has a timeout of 10 minutes set internally; however, if the portal is too slow and the Retriever is unable to retrieve 100 documents in 10 minutes, then the administrator can reduce the range value.
Figure 5. Configure JCR Content Source for JCRCollection1

8.

Still in figure 5, under the General Parameters section, configure the parameters for the Content Source: Levels of links to follow. Use the drop-down menu to select how many levels of pages the crawler will follow from the Seedlist. It is unlimited. Number of documents to collect. Use this to configure the maximum number of documents to collect. Force complete crawl. Indicates whether the crawler needs to fetch only updates from the Seedlist feed or from the full list of content. When enabled (checked), the crawler will request the full list of content items. When unchecked, the crawler will request only the list of updates.

Stop collecting after. Indicates the maximum duration, in minutes, that the crawler should operate. Stop fetching a document. Indicates how much time, in seconds, the crawler will spend attempting to fetch a document.

If the Content Source is created successfully, a message will display at the top of the page, as shown in figure 6.
Figure 6. Content Source JCRSource in collection JCRCollection1 is OK message

NOTE: These instructions are in the WebSphere Product Documentation topic Setting up JCR search collections for creating the JCRCollection1 collection. Using this collection and content source, you are able to search for items within the WCM authoring portlet. If JCRCollection1 is created manually, then the scheduling interval must be configured for the Content Source such that the crawler runs automatically in the configured interval. To do this: 1. 2. Use the Schedulers tab to set the frequency with which the crawler should run to update the search content (see figure 7). Choose the date, time, and update interval when the crawler should start running. Click Create.

Figure 7. Schedulers tab for JCR Content Source

For JCRCollection1, which is created automatically by the application, the index maintenance is scheduled to run every 60 minutes. If you want to change this frequency, you can configure it in the scheduler. Delete the existing scheduled Updates, and choose the day, time, and interval; click Create. Figure 8 shows that the scheduler is configured to run on Jan 13, 2011 at 2:00 PM; thereafter, it continues to run every 30 minutes.

10

Figure 8. Scheduler for JCR Content Source

3.3 Clustered WebSphere Portal environment


To increase capacity and to provide high availability (24x7), we can cluster multiple WebSphere Portal servers by using WebSphere Application Server Network Deployment. The portal in each node of a cluster shares a common configuration, and the load is distributed evenly across all cluster instances. In such an environment, the search must be configured as a remote search service, to enable all the nodes of the cluster to have a common search directory. The indexing and search occurs over the remote common search service, which might yield performance benefits via offloading and balancing the system load. You can provide the remote search service either as an EJB or as a Web service via SOAP. By using EJB, you can have the security enabled; however, with SOAP it is not possible to enable security. Only the WebSphere Portal nodes in the same cluster can use the same remote search service. For a clustered setup in WebSphere Portal 7, you need to delete the default search collection JCRCollection1 that is automatically created when content is first created or edited after a restart of WebSphere Portal. The default collection is configured with the standalone environment settings. Delete the default JCRCollection1 that is created by JCR by clicking the Delete Collection (trash can) icon in the Search Collections window (see figure 9). 11

Figure 9. Search Collections window

Consider a case in which we have a vertical clustered setup with two nodes configured (members WebSphere_Portal and WebSphere_Portal_jcrislgrp1). The WebSphere_Portal node is running in port 10079, and the WebSphere_Portal_jcrislgrp1 node is running in port 10133 (see figure 10).

12

Figure 10. Vertical cluster nodes in WAS Admin Console

In this case, you can configure remote search service by using either EJBs or as a Web service via SOAP. Before doing so, however, you must have the required .zip files and install the EJB/SOAP application for remote search service. Follow the instructions in the Product Documentation topic, Preparing for remote search service, to prepare the system for remote search service. In this paper, we show the remote search service using SOAP.

3.4 Preparing the remote search service


To prepare the remote search service, follow these steps: 1. Copy the files WebScannerSoap.ear, WebScannerEjbEar.ear, and PseLibs.zip to the was_profile_root/installableApps directory on the machine on which you want to install the remote search service (see figure 11).

13

Figure 11. Remote search server in a cluster environment

2.

Depending on the requirements of your environment, install one of the two applications WebScannerEJbEar.ear (for EJB Service) or WebScannerSoap.ear (for SOAP Service) on server1 (see figure 12).

Figure 12. WAS Admin Console showing the SOAP application running

3.

Extract the PSE libraries and add them to the classpath on server1, as follows: a) Create a directory with the name Extract under the directory installableApps. 14

b) c) d) e)

Locate the file PseLibs.zip in the directory installableApps and extract its contents into the Extract directory that you created in the previous step. Open the administrative console, select Environment Shared Libraries, and create or modify the new shared library named PSE (see figure 13). Add the extract/lib library to the classpath by adding a new line to the Classpath field, giving the full path, was_profile_root/installableApps/extract/lib. Save your changes to the master configuration.

Figure 13. Environment variable PSE in Shared Libraries

4.

Define a new Classloader for server1, as follows (see figure 14): a) b) c) d) In the WAS administrative console, select Servers Server types WebSphere application servers server1. Under Server Infrastructure, select Java and Process Management, and click Classloaders. Click New and then click Apply. Under Additional Properties, click Libraries, then click Add. Select the Library Name PSE from the drop-down list, and click OK. Save your changes to the master configuration.

15

Figure 14. Define Classloader for server1 in WAS Admin Console

5.

Now we must determine the required values for configuring the portlet parameters; specifically, the value for the port number for the SOAP URL parameter. The appropriate port number for the SOAP URL parameter is the port on which the application server runs; in other words, the HTTP transport on which Server1 is configured to run. To determine the correct port number, on the administrative console, select Application servers server1 Ports WC_defaulthost (see figure 15).

16

Figure 15. WC_defaulthost port in remote search server

6.

Make sure that the port number set in the following file matches this port: was_profile_root/installedApps/cell/WebScannerEar.ear/WebScannerSoap.war/ wsdl/com/ibm/hrl/portlets/WsPSE/WebScannerLiteServerSOAPService.wsdl where cell is the cell name of your remote search machine.

7.

Edit the file, looking for the port given in the value for the SOAP address location, for example (see figure 16): <soap: address location="http://localhost:your_port_no/WebScannerSOAP/servlet/rpcrouter"/>

17

Figure 16. WebScannerLiteServerSOAPService.wsdl file in remote search server

8.

In the administrative console, select Resources Asynchronous beans Work managers, and create a new Work manager named PSEWorkManager with the following attributes (see figure 17): Name: PSEWorkManager JNDI Name: wps/searchIndexWM Minimum Number of Threads: 20 Maximum number of Threads: 60 Growable = True (ensure that the Growable check box is selected) Service Names: Application Profiling Service, WorkArea, Security, Internationalization

9.

Click Apply and Save, to save your changes to the configuration.

18

Figure 17. Snapshot of PSEWorkManager resource in remote search server

10. Finally, open the WAS administrative console, select Applications Application Types WebSphere enterprise applications, and scroll to WebScannerEar. You can use the filter feature to search for these names. Click the check box and click Start. A message confirms that the application started successfully.

3.5 Configuring the remote search service


Configure remote search service per the InfoCenter topic, Configuring a remote search service: 1. 2. 3. Go to Administration Search Administration Manage Search Search Services for creating remote service. For service type as EJB, provide the values for parameters EJB Name, IIOP URL, and set PSE_TYPE to ejb (see figure 18). For service type as SOAP, provide the values for parameters SOAP URL, and set PSE_TYPE as soap (see figure 19).

19

Figure 18. Configuring remote SOAP Search Service (window 1)

Figure 19. Configuring remote SOAP Search Service (window 2)

20

4.

For a clustered setup, configure the DefaultCollectionDirectory as a local directory in the remote search server that is accessible by all the nodes in the setup. Once the service is configured, it appears in the Search Services page (see figure 20).

Figure 20. Search Services in Manage Search page

5.

Select the configured remote service Remote SOAP and manually create the JCRCollection1 collection with the collection location in the remote search server, as shown in figures 21 and 22.

21

Figure 21. All Collections in Manage Search window

Figure 22. Create JCR Content Source for JCRCollection1

22

Once the JCRSource is created, it displays as shown in figure 23.


Figure 23. JCRSource in JCRCollection1 Remote Search Server

6.

After the Collection and Content Source are created, update the remote search service properties in the icm.properties JCR configuration file. Here are the configuration properties essential for JCR TextSearch: Set the Portal Search Engine type as SOAP or EJB. In our case, we set it to SOAP: jcr.textsearch.PSE.type=SOAP Set the Portal Search Engine Soap URL as jcr.textsearch.SOAP.url=http://9.124.160.188:10054/WebScannerSOAP/servlet/rpcro uter You should set its value to: http://your_soap_search_server.your.example_domain.com:port/WebScannerSOAP/ servlet/rpcrouter where your_soap_search_server.your.example_domain.com is the name of the remote search server, and port is the port number that you obtained from the file

23

was_profile_root/installedApps/cell/WebScannerEar.ear/WebScannerSoap.war/ wsdl/com/ibm/hrl/portlets/WsPSE/WebScannerLiteServerSOAPService.wsdl 7. Edit the file, looking for the port given in the value for the SOAP address location, for example: <soap: address location="http://localhost:your_port_no/WebScannerSOAP/servlet/rpcrouter"/> . When you enter the URL in a Web browser, you should see something like that shown in figure 24.
Figure 24. SOAP RPC Router in Remote Search Server

Since the remote search server is configured in one of the nodes of the clustered environment, the deployment manager is updated with these settings, and the same data appears in the Administration portlet of the other secondary clustered nodes. Hence, you should be able to see the Remote Soap Search service and JCRCollection1 in the secondary node, WebSphere_Portal_jcrislgrp1, which is running in port 10133 (see figure 25).

24

Figure 25. Administration portlet in secondary clustered node

4 Setting up JMS in a clustered environment


For a standalone environment, the JMS bus is created during WebSphere Portal installation. For a clustered environment, the JMS resources for JCR must be added manually on the deployment manager while setting up the portal. Once a cluster environment is set up, and after creating a deployment manager and adding one node to it, we can set up the JMS bus and bus member. To create the JCR JMS Bus: 1. In the WAS Deployment Manager Console, select Service Integration Buses, and in the Content pane, click New (see figure 26).

25

Figure 26. Service integration buses

2.

Enter a name for the bus; here, it's JCRBus. The name should be the same as specified in the JCR icm.properties file, property name jcr.textsearch.busName=JCRBus (figure 27).

Figure 27. Create New JMS Bus

26

3.

Uncheck the Bus security option, to disable bus security; click Next. A summary of the Bus creation, with administrative security settings disabled for the bus is displayed (figure 28).

Figure 28. Summary of New JMS Bus

4.

Review the summary, and click Finish. The JCRBus displays as shown in figure 29. Save your changes to the master configuration.

27

Figure 29. JCRBus displayed

4.1 Adding a cluster as a bus member


The members of a Service Integration (SI) Bus are the application servers and clusters. Here are the steps to add the bus member: 1. 2. From the navigation pane, select Service Integration Buses JCRBus. Under Topology, click Bus members (see figure 30); click Add.

28

Figure 30. Bus Members in JCRBus

3.

Select the Cluster scope in WAS environments that support server clusters (see figure 31).

Figure 31. Add new bus member in JCRBus

29

4.

Select the cluster and click Next. Select the Enable messaging engine policy assistance? check box (see figure 32).

Figure 32. Messaging engine policy assistance for the selected Bus member

5.

Select Data store as type of the message store and click Next (see figure 33).

30

Figure 33. Data store option in Messaging Engine policy

6.

Configure the messaging engine for JCRBus, as shown in figure 34.

31

Figure 34. Configure messaging engines

7.

Select Use existing data source, enter the JNDI name, and the name of the schema and authentication alias to be used (see figure 35). The JNDI name is jdbc/wpdbDS_<jcr target db name>; for example,. jdbc/wpdbDS_jcr. (Refer to icm.properties file for the JNDI name).

32

Figure 35. Data store configuration in messaging engine

8.

Click Next. Optional: You can view the current settings of the initial and maximum Java Virtual Machine (JVM) heap sizes. If you want to tune performance by changing the current settings, select the Change heap sizes check box and enter the changes in the proposed heap sizes fields. Click Next. A summary of the added bus member in the PortalCluster scope displays (see figure 36).

9.

33

Figure 36. Summary of added bus member in PortalCluster

10. Click Finish, to confirm the creation of the bus member, and save your changes to the master configuration. 11. Restart the WebSphere Portal Server and Deployment Manager. After restart, if you check on the status of the messaging bus, you can confirm it is started (see figure 37).

34

Figure 37. PortalCluster JCRBus is up and running

The JMS resources such as Topic Connection factories and Topics are created during WebSphere Portal installation, so these don't need to be created for a standalone and cluster setup after that. For the WCM Authoring Search to work successfully, you should find the Topic Connection factories (figure 38) and Topics (figure 39) in the Deployment Manager console under JMS resources.

35

Figure 38. Topic connection factories in JMS Resources

Figure 39. Topics in JMS Resources

36

Farm environment In this scenario, WebSphere Portal search should be configured as if it is part of a clustered environment. The search server should be set up as a separate portal instance outside the farm and configured to search the farm. Use the Remote Search Service to configure the search server.

5 Searching a seedlist document in WCM


You've installed WebSphere Portal 7.0 on your system and now want to search for a document. If you check the number of documents in the search index directory, JCRCollection1, you can see that the document count is zero. As mentioned above, the crawler must be run twice for all the documents to be collected and the index to be built completely. In the first run, JCR processes internally and prepares to build the index from scratch. Only in the next scheduled run does the crawler start collecting the documents. You can run the crawler manually by clicking the Start crawler icon. Figure 40 shows the JCR Content Source where the crawler is running manually (note the Running Status).
Figure 40. JCR Content Source where the crawler is running manually

Once the crawler completes its processing, you can see that the document count is 296, and the Status is Idle (see figure 41).

37

Figure 41. JCR Content Source after crawler completes successfully

If you want the crawler to collect the documents automatically, then wait for two scheduled indexing runs to occur. For example, if you've configured the interval as 1 hour, then check after 2 hours; you'll see that the document count is a value greater than zero. 1. Now, edit any WCM content and save it (figure 42).

38

Figure 42. WCM document Malar test document

2.

Wait until the next index maintenance interval, or run the crawler manually. Figure 43 shows that the crawler has collected one document that was modified.

39

Figure 43. Crawler after collecting the modified document

3.

To check whether the document was indexed successfully, we can use the Search and Browse the Collection (spectacles) icon in the Manage Search Collections from All Services window (see figure 44). You can use this to search for content and information directly against the search collection, which differs from searching in the WCM Authoring portlet.

40

Figure 44. Search and Browse the Collection icon window

4.

Type the string on which you want to search in the "Search for" entry field, and click Search. Search and Browse displays the search results in a table (see figure 45).

41

Figure 45. Search and Browse the Collection window

NOTE: The WCM Authoring portlet uses the JCR XPath query to fetch search results from the repository. For more information about the Search using XPath query, refer to the developerWorks article, Java content repository TextSearch in IBM WebSphere Portal and IBM Lotus Web Content Management: Overview and troubleshooting. 5. Now, if you search for the edited document in the WCM Authoring portlet (see figure 46), the results will display as shown in figure 47.

Figure 46. Search content in the WCM Authoring portlet

42

Figure 47. Search showing search results

43

The XPath query for the search text Malars test document is [text-contains(.,'Malar* [SUBTREE_UUID]:[d10b57d5-c1f0-4f0b-8188-5554a5d5ff79]')] order by text-score(.,'Malar* [SUBTREE_UUID]:[d10b57d5-c1f0-4f0b-8188-5554a5d5ff79]') descending.

6 Troubleshooting JCR TextSearch


Let's now discuss some of the common problems reported by users while indexing and searching via WebSphere Portal 7 TextSearch. (1) When you run the crawler, it fails with the following message in the trace logs:
[9/9/10 11:20:28:074 GMT+05:30] 00000074 JCRCFLLoggerI 3 com.ibm.icm.ts.tss.JCRCFLLoggerImpl com.ibm.icm.ts.tss.app.SubscriptionIndexMaintainer.fetchDocuments [WebContainer : 2] com.ibm.icm.ts.tss.app.SubscriptionIndexMaintainer.fetchDocuments [WebContainer : 2]: Text Search is Disabled, returning empty documents list.

As indicated by the error, TextSearch is not enabled. You need to enable TextSearch by setting the property jcr.textsearch.enabled as true. (2) Incremental crawl in WebSphere Portal 7 is using the JMS messaging engine. If the JMS Topic Connection factory does not exist in WAS, then it fails with the below message in SystemErr logs:
[7/30/10 13:29:28:187 EDT] 0000002e SystemErr R Caused by: javax.naming.NameNotFoundException: Context: rtp33/nodes/rtp33/servers/WebSphere_Portal, name: jms/JCRSeedTCF: First component in name jms/JCRSeedTCF not found. [Root exception is org.omg.CosNaming.NamingContextPackage.NotFound: IDL:omg.org/CosNaming/NamingContext/NotFound:1.0]

You need to ensure that the Topic Connection factory (figure 48) and Incremental Crawl Topic (figure 49) are created in WAS, as mentioned in the property: jcr.textsearch.crawl.tcf=jms/JCRSeedTCF, jcr.textsearch.incrementalcrawl.topic=jms/JCRSeedTopic1100

44

Figure 48. Topic Connection factory in WAS

Figure 49. Incremental crawl Topic in WAS

Also, ensure the JCRBus as mentioned in property jcr.textsearch.busName=JCRBus is running (see figure 50).

45

Figure 50. JCRBus in WAS

(3) If the database is transferred from Derby to other databases like DB2, ensure that the ConfigEngine task has run, to create the JMS resources in the new database:
./ConfigEngine.sh create-jcr-jms-resources-post-dbxfer -DWasPassword=password command to create JMS resources in the new database.

Failure to run this task may cause the crawler to fail to collect documents in JCRCollection1. (4) When Document Conversion Services is unable to extract content from binary documents, JCR throws a com.ibm.icm.ts.ConverterException exception:
Caused by: com.ibm.wps.odc.convert.ConvertorException: com.ibm.wps.odc.convert.ConvertorException: Stellent Conversion Error: file is corrupt Caused by: com.ibm.wps.odc.convert.PasswordProtectedException: com.ibm.wps.odc.convert.PasswordProtectedException: File is password protected or encrypted

For all types of ConvertorExceptions, including file corrupt, password protected, MIME type not supported, among others, JCR indexes all the attributes of the document except the file content. Hence, the document is still searchable. (5) You are using Microsoft SQL Server, JDBC driver 3.0, and notice the following SQLExceptions when the WebSphere Portal server starts:
[11/15/10 17:17:01:200 CST] 00000055 SibMessage E [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS0002E: The messaging engine encountered an exception while starting. Exception: com.ibm.ws.sib.msgstore.PersistenceException: CWSIS1501E: The data source has produced an unexpected exception: com.microsoft.sqlserver.jdbc.SQLServerException: Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "SQL_Latin1_General_CP1_CS_AS" in the INTERSECT operation. [11/15/10 17:17:01:246 CST] 00000055 SibMessage E [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSID0027E: Messaging engine PortalCluster.000-JCRSeedBus cannot be restarted because a serious error has been reported.

46

[11/15/10 17:17:01:246 CST] 00000055 SibMessage I [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSID0016I: Messaging engine PortalCluster.000-JCRSeedBus is in state Stopped.

This exception hampers the ability of JCR Search to collect documents in JCRCollection1. The solution is to use the SQL Server JDBC version 2.0 Driver, as WAS V7 is tested with JDBC Driver version 2.0 and Microsoft SQL Server 2008 database. WAS V7 currently does not support JDBC driver 3.0. Similarly, WAS V7 is tested with Microsoft SQL Server JDBC Driver version 1.2 and Microsoft SQL Server 2005 database. (6) When the WebSphere Portal server starts, you notice the following errors in the logs:
[12/10/10 9:49:27:739 EST] 0000000b SibMessage E [JCRSeedBus:wpslesblade05.WebSphere_Portal-JCRSeedBus] CWSIS0002E: The messaging engine encountered an exception while starting. Exception: com.ibm.ws.sib.msgstore.PersistenceException: CWSIS1501E: The data source has produced an unexpected exception: java.lang.IllegalStateException: CWSIS1530E: The data type, -9, was found instead of the expected type, 12, for column, URI, in table, jcr.SIBCLASSMAP. [12/10/10 9:49:27:781 EST] 0000000b WSRdbManagedC W DSRA1300E: Feature is not implemented: javax.sql.PooledConnection.removeStatementEventListener [12/10/10 9:49:27:784 EST] 0000000b SibMessage E [JCRSeedBus:wpslesblade05.WebSphere_Portal-JCRSeedBus] CWSID0035E: Messaging engine wpslesblade05.WebSphere_Portal-JCRSeedBus cannot be started; detected error reported during com.ibm.ws.sib.msgstore.impl.MessageStoreImpl start() [12/10/10 9:49:27:786 EST] 0000000b SibMessage E [JCRSeedBus:wpslesblade05.WebSphere_Portal-JCRSeedBus] CWSID0027E: Messaging engine wpslesblade05.WebSphere_Portal-JCRSeedBus cannot be restarted because a serious error has been reported. [12/10/10 9:49:27:787 EST] 0000000b SibMessage I [JCRSeedBus:wpslesblade05.WebSphere_Portal-JCRSeedBus] CWSID0016I: Messaging engine wpslesblade05.WebSphere_Portal-JCRSeedBus is in state Stopped.

This issue occurs on servers whose WAS version is 7.0.0.11 or earlier. The exception hampers the execution of JCR Search. The solution is to implement the workaround described in the IBM Support document, PK11027: Messaging engine startup fails for some Oracle driver/server levels, whereby you update the sib.properties file to add the line sib.msgstore.jdbcPerformColumnChecks=false Turning off column checking will resolve the issue. This problem is corrected by APAR PM13911 in Fix Pack 7.0.0.13, per the Support document titled, PM13911: ILLEGALSTATEEXCEPTION ATTEMPTING TO START A MESSAGING ENGINE AGAINST AN SQL SERVER 2008 DATABASE WITH A JDBC 2.0 DRIVER. The issue has been fixed in WAS version 7.0.0.13 or later. (7) In a clustered environment, you notice log errors during WebSphere Portal startup, and the JCR Search does not work. For example, you may find the following errors and exceptions in the WebSphere Portal server's SystemOut.log file:
[12/7/10 20:53:19:975 EST] 00000020 SibMessage I [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS1538I: The messaging engine, ME_UUID=4A76DD636E4D7F8D, INC_UUID=31828889C3AEC661, is attempting to obtain an exclusive lock on the data store.

47

[12/7/10 20:53:20:334 EST] 00000021 SibMessage I [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS1545I: A single previous owner was found in the messaging engine's data store, ME_UUID=5A4BB5D8F066DB65, INC_UUID=66E9D52EA9D8BA08 [12/7/10 20:53:20:339 EST] 00000021 SibMessage E [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS1535E: The messaging engine's unique id does not match that found in the data store. ME_UUID=4A76DD636E4D7F8D, ME_UUID(DB)=5A4BB5D8F066DB65 [12/7/10 20:53:20:346 EST] 00000020 SibMessage I [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS1593I: The messaging engine, ME_UUID=4A76DD636E4D7F8D, INC_UUID=31828889C3AEC661, has failed to gain an initial lock on the data store. [12/7/10 20:53:20:361 EST] 00000020 SibMessage E [JCRSeedBus:PortalCluster.000JCRSeedBus] CWSIS1519E: Messaging engine PortalCluster.000-JCRSeedBus cannot obtain the lock on its data store, which ensures it has exclusive access to the data.

The reason for the errors is that the JCR Message Engine fails to start. To resolve the problem: 1. 2. Stop the NodeAgent and WebSphere Portal servers, and connect to your database. Drop all tables that are related to the message engine's data store. There are nine tables, all beginning will with the first three letters, "SIBxxx": SIB000 SIB001 SIB002 SIBCLASSMAP SIBKEYS SIBLISTING SIBOWNER SIBOWNERO SIBXACTS 3. Restart the Deployment Manager, NodeAgent, and WebSphere Portal servers, to recreate the tables automatically with the correct message engine data store's unique ID. This is documented in the IBM Support Technote, The messaging engine's unique ID does not match that found in the data store.

(8) After install, you set jcr.textsearch.enable=true, and restart WebSphere Portal. However, when you repeatedly run the crawler manually from the Administration portlet, you notice the exception,"Full Crawl Topics are not free. Please retry later," appearing in the log:
[7/23/10 11:49:52:340 EST] 0000007a JCRCFLLoggerI E com.ibm.icm.ts.tss.JCRCFLLoggerImpl com.ibm.icm.ts.tss.ls.LibraryServerImpl.retrieveTopicFromPool [WebContainer : 8]: Full Crawl Topics are not free. Please retry later. com.ibm.icm.ts.tss.TextEngineException: Full Crawl Topics are not free. Please retry later. at com.ibm.icm.ts.tss.ls.LibraryServerImpl.retrieveTopicFromPool(LibraryServerImpl.java:583) at com.ibm.icm.ts.tss.app.SubscriptionIndexMaintainer.getTopicFromPool(SubscriptionIndexMaint ainer.java:1529)

The error occurs because, when the crawler is repeatedly started manually, the full-crawl process is initiated, starting the publishing of messages for the entire repository repeatedly and exhausting all the full-crawl Topics. Hence, the JCRRetriever throws the above exception.

48

During the next crawler index schedule, the expired full-crawl topics will be claimed, and JCRRetriever will index the messages in the next interval. It is recommended for the crawler to automatically start collecting documents during the scheduled interval, and avoid repeatedly running the crawler manually. (9) When your search is not returning a document in the WCM Authoring portlet, you may want to know whether the document is indexed in the JCRCollection1 directory. To do this, search the keyword directly in the Search and Browse the Collection spectacles icon in the Manage Search Collections from All Services window. If the search returns results, it indicates that the document is indexed in the collection directory successfully. There could be some other reasons for the keyword search not returning results in the Authoring portlet, in which case, you need to contact JCR Support. If Search and Browse the collection does not return results, then wait until the next indexing scheduled interval and search again. For other issues in TextSearch, collect the Index Maintenance and Search logs. To do this: 1. Go to WebSphere Portal Administration Enable Tracing. 2. Set the trace to [com.ibm.icm.ts.*=finest] for text search, and remove other JCR traces like [com.ibm.icm.*=finest] in the Portal admin console. This helps to reduce the log file content and debug the issue faster. 3. Edit the document for which search is not working and save it. 4. Manually run the crawler in Administration Search Administration Manage Search Search Collections JCRCollection1 JCR Content Source. Wait for, say, 5 minutes to allow the crawler to completely build the index directory. 5. Collect the logs as "index-logs. 6. Search for content, and collect the logs as "search-logs". Contact JCR Support with the collected logs (SystemOut.log, SystemErr.log, trace.log and all history trace logs) for index maintenance and search.

7 Conclusion
This paper has presented the new features of JCR TextSearch Indexing in WebSphere Portal 7. Our aim is that WebSphere Portal developers, administrators, and customers who use JCR TextSearch can use this document to help them understand configuration techniques in different environments, and to administer and troubleshoot common issues in WebSphere Portal 7.

8 Resources
developerWorks white paper, Making content searchable anywhere using IBM WebSphere Portal's publishing Seedlist Framework.

49

developerWorks article, Introducing the Java Content Repository API. developerWorks article, Using Apache Lucene to search text. WebSphere Portal wiki article, Java content repository TextSearch Support tools for IBM WebSphere Portal: Overview and usage. WebSphere Portal Server 7 Product Documentation. developerWorks WebSphere Portal product page.

About the authors


Malarvizhi Kandasamy (Malar) is a Staff Software Engineer who has worked at IBM since April 2005 and on JCR since 2006, with more than 10 years of experience in the software industry. She is an IBM Certified Solution Designer for Content Manager v8.3, an IBM Certified Database Associate for DB2 v9, a Sun Certified Java Programmer 1.5, and an IBM Certified Lotus WebSphere Portal Administrator v6.1. She holds a Bachelor of Engineering degree in Computer Science from Madras University. You can reach her at k.malarvizhi@in.ibm.com. Ramgopal (Ram) Kanasani is a Software Engineer who has worked at IBM since August 2008. He holds a Bachelor of Technology degree from IIIT Allahabad. You can reach him at rakanasa@in.ibm.com.

Trademarks
DB2, developerWorks, IBM, Lotus, and WebSphere are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others.

50