You are on page 1of 14

IBM Content Analytics - Enterprise Search Solution - Shweta Chaudhary

What is content analytics?


IBM Content Analytics helps you perform content analysis of unstructured documents, e-mails, databases, and other enterprise content by providing a platform for crawling content, extracting facets of meaning from text, and interactively exploring those facets and relationships between facets. Content analytics starts with the gathering of relevant enterprise content. IBM Content Analytics provides crawlers for a wide variety of enterprise content sources that allow content, metadata, structured data, and unstructured data to be collected and fed into the analytics pipeline. Crawler configuration is flexible and scalable, enabling you to gather the source information that you need to intelligently answer content-based questions.

Component overview
The content analytics components collect data from throughout your enterprise; analyze, parse, and extract meaning from the information; and create a text index and a correlation index that users can explore and mine. A content analytics collection represents the set of sources that users can explore and mine with a single query. When you create a collection, you specify which sources you want to include and configure options for how users can search and mine the indexed data. You can create multiple collections, and each collection can contain data from a variety of data sources. For example, you might create a collection that includes documents from databases, or you can give your website urls. When users explore and mine a collection, the results potentially include documents/links from each of the data sources.

Crawlers
You can start and stop crawlers manually, or you can set up crawling schedules. If you schedule a crawler, you specify when it is initially run and how often it must revisit the data sources to crawl new and changed documents. You can configure the Web crawler to run continuously. You specify which uniform resource locators (URLs) you want to crawl, and the crawler returns periodically to check for data that is new or changed.

Parsers,document processors and annotators


The parser, document processors, and annotators analyze documents that were collected by a crawler and prepare them for indexing.

Indexing
After documents are parsed and analyzed, the indexing processes incrementally add information about new, changed, and deleted documents to the index to ensure that users always have access to the most current information.

Search Servers
The search servers work with your search applications to process queries, search the index, and return results. When you configure the search servers for a collection, you can specify options for how the collection is to be searched or mined:

You can configure a search cache to hold the results of frequently submitted requests. A search cache can improve search and retrieval performance. If your application developers create custom dictionaries, you can associate the dictionaries with collections: o When users query a collection that uses a synonym dictionary, documents that contain synonyms of the query terms are included in the results. o When users query a collection that uses a stop word dictionary, the stop words are removed from the query before the query is processed. o When users query a collection that uses a boost word dictionary, the importance of documents that contain the boost words is increased or decreased, depending on the boost factor that is associated with the word in the dictionary. If you predetermine that certain documents are relevant to certain queries, you can configure quick links. A quick link associates a specific URI with specific keywords and phrases. If a query contains any of the keywords or phrases that specify in a quick link definition, the associated URI is returned automatically in the results. You can configure type ahead support. When typing a query, users can see suggestions for potential matches of the query terms and select one of the suggested matches to run the query. Suggested matches are also provided for facet values. Users can select the facet values that they want to add to a search by selecting one or more of the suggestions. In a multiple server configuration, failover capability is available at the collection level, not just at the server level. If a collection on one search server becomes unavailable for any reason, then the queries for that collection are routed automatically to another search server.

Administration Console
The administration console runs in a browser, which means administrative users can access it from any location at any time. Security mechanisms ensure that only those users who are authorized to access administrative functions do so. The administration console includes wizards that can help you do several of the primary administrative tasks. For example, the crawler wizards are specific to a data source type and help you select the sources that you want to enable users to search.

Monitoring the System


You can use the administration console to monitor system activities and adjust operations as needed. If you create a search collection, you can see detailed statistics about current and past query processing. For example, you can select a timeline for the query statistics, see the most popular queries, and see queries that returned no results. You can also export the query history to a comma separated value (CSV) file, which you can then import into a spreadsheet or other application for further statistical analysis.

Log Files
Log files are created for individual collections and for system-level sessions. When you configure logging options for a collection or for the system, you specify the types of messages that you want to log, such as error messages and warning messages. You also specify how often you want the system to rotate older log files to make room for recent messages. You can choose options to receive e-mail about specific messages (including alerts), or all error messages, whenever they occur.

Customizing the content analytics system


The application programming interfaces enable you to create custom search, administration, and text miner applications; custom applications to update the content of collections; custom programs for text analysis; and custom dictionaries for synonyms, stop words, and boost words.

Search and text miner applications

You can use the search and text miner applications as a template for developing custom applications.

How Data Flows through the system

Planning for Installation-H/w and S/w Requirements

The supported system configurations depend on whether you install the product on a single server or multiple servers.

Single server installation


After you install the software on one computer, you can install additional servers to support search and document processing. For example, you might want to set up several search servers to spread the query processing load across processors. If you install the product on additional servers, you must use the administration console to specify the purpose of the server. Install the crawl, index, and search components on one computer. You can then add servers to support the following activities: Search To provide increased scalability and high availability for query processing, including processing requests for deep inspection of text analysis results, you can add multiple search servers. Document processing To provide increased scalability and failover support for parsing and analyzing documents, you can add multiple document processing servers. Backup server for high availability To provide increased throughput and failover support for crawling and indexing documents, you can add one high availability backup server on AIX or Windows platforms. This server is supported through IBM PowerHA for AIX and Microsoft Cluster Service (MSCS) on Windows. You must ensure that the backup server that you add and the master server share the same data directory (ES_NODE_ROOT). The following figure shows a sample single server configuration. Figure 1. Sample single server configuration

Distributed server installation


If you install the product on more than one computer, you must install the base product on one server, called the master server. You can then add servers to support the following activities: Crawl You must install one master crawler server to support crawlers.

Search You must add at least one server to support search. To provide increased scalability and high availability for query processing, including processing requests for deep inspection of text analysis results, you can add multiple search servers. Document processing To provide increased scalability and failover support for parsing and analyzing documents, you can add multiple document processing servers. Backup servers for high availability To provide increased throughput and failover support, you can add one high availability server for crawling documents and one high availability server for indexing documents on AIX or Windows platforms. This server is supported through IBM PowerHA for AIX and Microsoft Cluster Service (MSCS) on Windows. You must ensure that the backup server that you add and the master server share the same data directory (ES_NODE_ROOT). The following figure shows a sample distributed server configuration. Figure 2. Sample distributed server configuration

Operating system The following table specifies the operating system(Windows Server in this case) for any computer on which you install the server components for IBM Content Analytics. In a multiple server installation, all servers must run the same operating system. Hardware platform: Operating system (Required) Microsoft Windows Server 2003 Enterprise with SP2(32bit) x86 x8664 POWER zSeries Notes Minimum: One computer with two or more Intel 32-bit or AMD 32-bit processors. Minimum: One computer with two or more Intel 32-bit or AMD 32-bit processors.

Microsoft Windows Server 2003 R2 Enterprise with SP2 (32-bit)

Microsoft Windows Server 2008 Enterprise with SP2(64bit)

Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Requires IBM Content Analytics 2.2 Fix Pack 2. Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel

Microsoft Windows Server 2008 Standard with SP2(64-bit)

Microsoft Windows Server 2008 R2 Enterprise (64-bit)

Microsoft Windows Server 2008 R2 Standard (64-bit)

Microsoft Windows Server 2008 R2 Enterprise with SP1 (64-bit)

support only, or Intel EM64T, 64-bit kernel support only). Microsoft Windows Server 2008 R2 Standard with SP1 (64-bit) Requires IBM Content Analytics 2.2 Fix Pack 2. Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only).

Disk space The following table specifies the minimum disk space required to install the IBM Content Analytics server software. The disk space requirements for running a content analytics system can vary and depend on the number of documents to be crawled, the types of data sources to be crawled, and how documents are parsed, indexed, and queried. Disk space requirements are also influenced by how regularly you build and update indexes. For a multiple server configuration, the space requirements affect the index server. The ES_NODE_ROOT directory requires the most disk space on your system. Operating system: Disk space (minimum) 2000 MB for the Content Analytics software, plus 1800 MB of temporary disk space AIX Linux Win Notes This is the minimum required to install the software. More space is needed to run a content analytics system.

Memory

The RAM specifications in the following table reflect basic server requirements for an IBM Content Analytics system.

Operating system: RAM (minimum)


4 GB, plus 4 GB paging space

AIX

Linux

Win

Notes
Suitable for a small, entry-level solution (one document processor and up to 100 MB of data).

8 GB, plus 8 GB paging space

Suitable for a medium solution (two document processors and up to 1 GB of data). Suitable for a large solution (four document processors and up to 10 GB of data).

16 GB, plus 16 GB paging space

Processor The CPU specifications in the following table reflect the minimum requirements for small, medium, and large installations. Actual system requirements depend on the number of active collections and your performance requirements. Operating system: Processor 1 or 2 processors per server AIX Linux Win Notes Suitable for a small, entry-level solution (one document processor and up to 100 MB of data). Suitable for a medium solution (two document processors and up to 1 GB of data). Suitable for a large solution (four document processors and up to 10 GB of data).

2 or 4 processors per server

At least 4 processors per server

Web application server


IBM Content Analytics requires one of the following web application servers to run the search and text mining applications. Even if you run the search and text mining applications on WebSphere Application Server, the embedded Jetty web application server is required for certain functions. For example, the Jetty server is used for administration, monitoring, and providing search functions.

Operating system: Web application server (required) Jetty 6.1.22 (32-bit or 64-bit) AIX Linux Win Notes Automatically installed by the IBM Content Analytics installation program. Also requires IBM HTTP Server 6.1 and WebSphere Application Server 6.1 plug-ins.

WebSphere Application Server 6.1, base edition (32-bit)

Not supported on Linux on System z. WebSphere Application Server 6.1, base edition (64-bit) Also requires IBM HTTP Server 6.1 and WebSphere Application Server 6.1 plug-ins. Included in the Content Analytics distribution package. You must also install Fix Pack 3 or later. Also requires IBM HTTP Server 7.0 and WebSphere Application Server 7.0 plug-ins. The 32-bit level is not supported on Linux on System z. WebSphere Application Server Network Deployment 7.0.0.3 or later (32-bit or 64-bit) Also requires IBM HTTP Server 7.0 and WebSphere Application Server 7.0 plug-ins. The 32-bit level is not supported on Linux on System z. Clustering is not supported.

WebSphere Application Server 7.0.0.3 or later, base edition(32-bit or 64-bit)

LDAP server
When you use Jetty as the application server, IBM Content Analytics can be configured to authenticate users that are managed by one of the following Lightweight Directory Access Protocol (LDAP) servers.

Operating system: LDAP server AIX Lotus Domino Enterprise Server 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4 Domino Enterprise Server 8.0.0, 8.0.1, 8.0.2, 8.5.0, 8.5.1 Tivoli Directory Server 6.0, 6.1, 6.2,

Note Linux Solaris Win s

6.3 Windows Active Directory 2003 Windows Active Directory 2008 Browser (Product Components) The following table specifies the browser software that you can use with the IBM Content Analytics administration console and sample applications. Note that you must enable JavaScript in your browser, and you must also install Adobe Flash Player 10.x. The IBM Content Analytics clients are best viewed with a screen resolution of 1024 x 768 or higher. For the best performance, use Microsoft Internet Explorer 8.x or Mozilla Firefox 3.5.x. Operating system: Browser Microsoft Internet Explorer 7 Microsoft Internet Explorer 8 Mozilla Firefox 3.5 Not supported on Linux on System z. Not supported on Linux on System z. AIX Linux Win Notes

Mozilla Firefox 3.6

Java environment IBM Content Analytics includes the 32-bit and 64-bit versions of Java Virtual Machine (JVM) 1.6. The following table specifies the requirements for compiling Java applications. The Java SDK is not required to install a content analytics system. Before you can build Java applications, you must also install and configure Apache ANT, a Java-based build tool.

Operating system: Java IBM Software Development Kit for Java 5.0 (32-bit or 64-bit) IBM Software Development Kit for Java 6.0 (32-bit or 64-bit) Apache ANT 1.7 Seehttp://ant.apache.org/ AIX Linux Win Notes

LanguageWare Resource Workbench


IBM LanguageWare is a technology that provides a full range of text analysis functions. You can use IBM LanguageWare Resource Workbench with IBM Content Analytics to implement custom linguistic processing, such as creating custom annotators. For information about adding support for additional languages, contact IBM Software Support.

Operating system: LanguageWare Resource Workbench LanguageWare Resource Workbench 7.2 AIX Linux Win Notes For more information about LanguageWare, seeText Analytics Tools and Runtime for IBM LanguageWare.

Change history
History Release 2.2.0 Date October 2010 Change New

Security in IBM Content Analytics


Security mechanisms enable you to protect sources from unauthorized searching and restrict administrative functions to specific users. With IBM Content Analytics, users can search and explore a wide range of data sources. To ensure that only users who are authorized to access content do so, and to ensure that only authorized users are able to access the administration console, the system coordinates and enforces security at several levels. Web server The first level of security is the Web server. You can assign users to administrative roles and authenticate users who administer the system. When a user logs in to the administration console, only the functions and collections that the user is authorized to administer are available to that user. Search and text miner applications can also use Web server security mechanisms to authenticate users who search collections. Collection-level security When you create a collection, you can enable security at the collection level. You cannot change this setting after the collection is created. If you do not enable collection-level security, you cannot later specify document-level security controls. When collection-level security is enabled: The global analysis processes apply different rules for indexing duplicate documents. You can configure options to enforce document-level security, such as associating security tokens with documents as they are crawled, requiring current credentials to be validated during query processing, and specifying whether anchor text in Web documents is to be indexed. You can enforce security by mapping search and text miner applications (not individual users) to the collections that they can search. You then use standard access control mechanisms to permit or deny users access to applications. There is a trade-off between enabling collection security and search quality. Enabling collection security reduces the information that is indexed for each document. A side effect is that fewer results will be found for some queries. Document-level security When you configure crawlers for a collection, you can enable document-level security. For example, you can specify options to associate security tokens with data as the data is collected by crawlers. Your applications can use these tokens, which are stored with documents in the index, to enforce access controls and ensure that only users with the proper credentials are able to query the data and view search results. For certain types of data sources, you can configure options to validate a user's login credentials with current access controls during query processing. This extra layer of security ensures that a user's privileges are validated in real time with the data source. This capability can protect against instances in which a user's credentials change after a document and its security tokens are indexed. The anchor text processing phase of global analysis normally associates text that appears in one document (the source document) with another document (the target document) in which that text does not necessarily appear. When you configure a Web crawler, you can specify whether you want to exclude the anchor text from the index if the text links to a document that the Web crawler is not allowed to crawl.

Because of differences in facet counting that occurs when document-level security is enabled, document-level security is not supported for text analytics collections or for search collections that are configured to use facets. Security for your collections extends beyond the authentication and access control mechanisms that the system can use to protect indexed content. Safeguards also exist to prevent a malicious and unauthorized user from gaining access to data while it is in transit. For example, the search servers use protocols such as the Secure Sockets Layer (SSL), the Secure Shell (SSH), and the Secure Hypertext Transfer Protocol (HTTPS) to communicate with the index server and the search and text miner applications. Additional security is provided through encryption. For example, the password for the system administrator, which is specified when the product is installed, is stored in an encrypted format. Passwords that users specify in user profiles are also stored in an encrypted format. For increased security, you need to ensure that the server hardware is appropriately isolated and secure from unauthorized intrusion. By installing a firewall, you can protect the servers from intrusion through another part of your network. Also ensure that there are no open ports on the servers. Configure the system so that it listens for requests only on ports that are explicitly assigned to IBM Content Analytics

Backing up and restoring the system


Backup and restore scripts enable you to back up and restore the IBM Content Analytics system.

What the scripts back up


The scripts back up and restore the following files: Configuration files from the ES_NODE_ROOT/master_config directory System databases, such as databases for crawled data, identity management, and so on. All files in the ES_NODE_ROOT/data directory Index files for collections that are configured with non-default data directories Restriction: Do not use operating system commands to copy files between the production-level IBM Content Analytics server and a backup server. The system will not work as expected because host name and IP address information, which IBM Content Analytics requires, is not provided through file copy mechanisms. In addition, unless the production IBM Content Analyticssystem is completely stopped before you copy files, the internal database might be copied in an inconsistent state, which can prevent IBM Content Analytics from starting on the target server.

Backup directory structure


The backup script creates the following subdirectories under a directory that you specify when you run the script. The administrator ID must have permission to write to the directory that you specify. master_config Contains the configuration files from the database Contains the database files from the crawler server data Contains the index files from the index server
ES_NODE_ROOT/master_config

directory

Usage guidelines

You can back up data from one computer and restore it to another computer. However: o You cannot restore files that were backed up from a different version of IBM Content Analytics. o You must restore data to a system that has the same number of servers. For example, if you back up a system that runs on two servers, you must restore data to a system that uses two servers.

You cannot restore files that were backed up from one operating system to a system that is running a different operating system. For example, if you installed the product on AIX and now want to run it on Linux, you must install a new system on your Linux servers. All settings for the installation directory (ES_INSTALL_ROOT), the data directory (ES_NODE_ROOT), and the administrator ID and password must be the same between the backed up system and the system to which data is restored. For a multiple server configuration, back up and restore the system from the index server. Because all of the crawler data resides in databases on the crawler server, the scripts run remote commands to back up and restore the crawler data. You must have enough disk space available to back up the system files to another directory. The backup and restore scripts do not check the files. All system sessions are stopped while the backup and restore scripts are running. To avoid seeing incorrect or inconsistent system information, do not use the administration console while the scripts are running. If the system fails because of an irrecoverable error, you must re-install IBM Content Analytics.

Analyzing content
Use the text miner application to analyze your content and discover trends, correlations, and deviations in data over time. The text miner application provides real-time statistical analysis of structured and unstructured data in IBM Content Analytics collections. By using the IBM Content Analytics text miner application, you can explore structured and unstructured content and gain insights that can help drive business decisions. For example, by exploring content that was analyzed through content analytics, you can identify patterns, observe trends in data over time, and take action when unusual correlations or anomalies are surfaced. The overall content analysis workflow consists of iterating through multiple cycles of the following steps: 1. 2. 3. 4. 5. Set the objectives for your analysis. Gather the appropriate data. Analyze and explore the data by using the text miner application. Take actions based on the insights that you discovered. Validate the effect.

IBM Software Support


IBM Software Support provides assistance with product defects. Before you contact IBM Software Support, your company must have an active IBM software subscription and support contract, and you must be authorized to submit problems to IBM. The type of software subscription and support contract that you need depends on the type of product you have: IBM distributed software products (including, but not limited to, Tivoli, Lotus, and Rational products, as well as DB2 and WebSphere products that run on Windows or UNIX operating systems) You can enroll in Passport Advantage in one of the following ways: Online: Go to the Passport Advantage Web page (www.ibm.com/software/lotus/passportadvantage/) and click How to Enroll. By phone: For the phone number to call in your country, go to the contacts page of the IBM Software Support Handbook on the Web and click the name of your geographic region. IBM eServer software products (including, but not limited to, DB2 and WebSphere products that run in zSeries, pSeries, and iSeries environments) You can purchase a software subscription and support agreement by working directly with an IBM sales representative or an IBM Business Partner. For more information about support for eServer software products, go to the IBM Technical Support Advantage Web page (www.ibm.com/systems/support/). If you are not sure what type of software subscription and support contract you need, call 1-800-IBMSERV (1-800-426-7378) in the United States or, from other countries, go to the contacts page of the IBM Software Support Handbook on the Web and click the name of your geographic region for phone numbers of people who provide support for your location. Follow these steps to contact IBM Software Support: 1. Subscribing to support updates (RSS feeds) With IBM Software Support RSS feeds, you can to stay up to date with the latest content for specific IBM Software products. Feeds are updated daily. Determine the business impact of your problem When you report a problem to IBM, you will be asked to supply a severity level. Therefore, you need to understand and assess the business impact of the problem you are reporting. Describe your problem and gather background information When explaining a problem to IBM, be as specific as possible. Include all relevant background information so that IBM Software Support specialists can help you solve the problem efficiently. Submit your problem to IBM Software Support You can submit your problem in several ways.

2.

3.

4.

You might also like