Professional Documents
Culture Documents
Component overview
The content analytics components collect data from throughout your enterprise; analyze, parse, and extract meaning from the information; and create a text index and a correlation index that users can explore and mine. A content analytics collection represents the set of sources that users can explore and mine with a single query. When you create a collection, you specify which sources you want to include and configure options for how users can search and mine the indexed data. You can create multiple collections, and each collection can contain data from a variety of data sources. For example, you might create a collection that includes documents from databases, or you can give your website urls. When users explore and mine a collection, the results potentially include documents/links from each of the data sources.
Crawlers
You can start and stop crawlers manually, or you can set up crawling schedules. If you schedule a crawler, you specify when it is initially run and how often it must revisit the data sources to crawl new and changed documents. You can configure the Web crawler to run continuously. You specify which uniform resource locators (URLs) you want to crawl, and the crawler returns periodically to check for data that is new or changed.
Indexing
After documents are parsed and analyzed, the indexing processes incrementally add information about new, changed, and deleted documents to the index to ensure that users always have access to the most current information.
Search Servers
The search servers work with your search applications to process queries, search the index, and return results. When you configure the search servers for a collection, you can specify options for how the collection is to be searched or mined:
You can configure a search cache to hold the results of frequently submitted requests. A search cache can improve search and retrieval performance. If your application developers create custom dictionaries, you can associate the dictionaries with collections: o When users query a collection that uses a synonym dictionary, documents that contain synonyms of the query terms are included in the results. o When users query a collection that uses a stop word dictionary, the stop words are removed from the query before the query is processed. o When users query a collection that uses a boost word dictionary, the importance of documents that contain the boost words is increased or decreased, depending on the boost factor that is associated with the word in the dictionary. If you predetermine that certain documents are relevant to certain queries, you can configure quick links. A quick link associates a specific URI with specific keywords and phrases. If a query contains any of the keywords or phrases that specify in a quick link definition, the associated URI is returned automatically in the results. You can configure type ahead support. When typing a query, users can see suggestions for potential matches of the query terms and select one of the suggested matches to run the query. Suggested matches are also provided for facet values. Users can select the facet values that they want to add to a search by selecting one or more of the suggestions. In a multiple server configuration, failover capability is available at the collection level, not just at the server level. If a collection on one search server becomes unavailable for any reason, then the queries for that collection are routed automatically to another search server.
Administration Console
The administration console runs in a browser, which means administrative users can access it from any location at any time. Security mechanisms ensure that only those users who are authorized to access administrative functions do so. The administration console includes wizards that can help you do several of the primary administrative tasks. For example, the crawler wizards are specific to a data source type and help you select the sources that you want to enable users to search.
Log Files
Log files are created for individual collections and for system-level sessions. When you configure logging options for a collection or for the system, you specify the types of messages that you want to log, such as error messages and warning messages. You also specify how often you want the system to rotate older log files to make room for recent messages. You can choose options to receive e-mail about specific messages (including alerts), or all error messages, whenever they occur.
You can use the search and text miner applications as a template for developing custom applications.
The supported system configurations depend on whether you install the product on a single server or multiple servers.
Search You must add at least one server to support search. To provide increased scalability and high availability for query processing, including processing requests for deep inspection of text analysis results, you can add multiple search servers. Document processing To provide increased scalability and failover support for parsing and analyzing documents, you can add multiple document processing servers. Backup servers for high availability To provide increased throughput and failover support, you can add one high availability server for crawling documents and one high availability server for indexing documents on AIX or Windows platforms. This server is supported through IBM PowerHA for AIX and Microsoft Cluster Service (MSCS) on Windows. You must ensure that the backup server that you add and the master server share the same data directory (ES_NODE_ROOT). The following figure shows a sample distributed server configuration. Figure 2. Sample distributed server configuration
Operating system The following table specifies the operating system(Windows Server in this case) for any computer on which you install the server components for IBM Content Analytics. In a multiple server installation, all servers must run the same operating system. Hardware platform: Operating system (Required) Microsoft Windows Server 2003 Enterprise with SP2(32bit) x86 x8664 POWER zSeries Notes Minimum: One computer with two or more Intel 32-bit or AMD 32-bit processors. Minimum: One computer with two or more Intel 32-bit or AMD 32-bit processors.
Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only). Requires IBM Content Analytics 2.2 Fix Pack 2. Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel
support only, or Intel EM64T, 64-bit kernel support only). Microsoft Windows Server 2008 R2 Standard with SP1 (64-bit) Requires IBM Content Analytics 2.2 Fix Pack 2. Minimum: One computer with two or more 64-bit processors (AMD Opteron, 64bit kernel support only, or Intel EM64T, 64-bit kernel support only).
Disk space The following table specifies the minimum disk space required to install the IBM Content Analytics server software. The disk space requirements for running a content analytics system can vary and depend on the number of documents to be crawled, the types of data sources to be crawled, and how documents are parsed, indexed, and queried. Disk space requirements are also influenced by how regularly you build and update indexes. For a multiple server configuration, the space requirements affect the index server. The ES_NODE_ROOT directory requires the most disk space on your system. Operating system: Disk space (minimum) 2000 MB for the Content Analytics software, plus 1800 MB of temporary disk space AIX Linux Win Notes This is the minimum required to install the software. More space is needed to run a content analytics system.
Memory
The RAM specifications in the following table reflect basic server requirements for an IBM Content Analytics system.
AIX
Linux
Win
Notes
Suitable for a small, entry-level solution (one document processor and up to 100 MB of data).
Suitable for a medium solution (two document processors and up to 1 GB of data). Suitable for a large solution (four document processors and up to 10 GB of data).
Processor The CPU specifications in the following table reflect the minimum requirements for small, medium, and large installations. Actual system requirements depend on the number of active collections and your performance requirements. Operating system: Processor 1 or 2 processors per server AIX Linux Win Notes Suitable for a small, entry-level solution (one document processor and up to 100 MB of data). Suitable for a medium solution (two document processors and up to 1 GB of data). Suitable for a large solution (four document processors and up to 10 GB of data).
Operating system: Web application server (required) Jetty 6.1.22 (32-bit or 64-bit) AIX Linux Win Notes Automatically installed by the IBM Content Analytics installation program. Also requires IBM HTTP Server 6.1 and WebSphere Application Server 6.1 plug-ins.
Not supported on Linux on System z. WebSphere Application Server 6.1, base edition (64-bit) Also requires IBM HTTP Server 6.1 and WebSphere Application Server 6.1 plug-ins. Included in the Content Analytics distribution package. You must also install Fix Pack 3 or later. Also requires IBM HTTP Server 7.0 and WebSphere Application Server 7.0 plug-ins. The 32-bit level is not supported on Linux on System z. WebSphere Application Server Network Deployment 7.0.0.3 or later (32-bit or 64-bit) Also requires IBM HTTP Server 7.0 and WebSphere Application Server 7.0 plug-ins. The 32-bit level is not supported on Linux on System z. Clustering is not supported.
LDAP server
When you use Jetty as the application server, IBM Content Analytics can be configured to authenticate users that are managed by one of the following Lightweight Directory Access Protocol (LDAP) servers.
Operating system: LDAP server AIX Lotus Domino Enterprise Server 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4 Domino Enterprise Server 8.0.0, 8.0.1, 8.0.2, 8.5.0, 8.5.1 Tivoli Directory Server 6.0, 6.1, 6.2,
6.3 Windows Active Directory 2003 Windows Active Directory 2008 Browser (Product Components) The following table specifies the browser software that you can use with the IBM Content Analytics administration console and sample applications. Note that you must enable JavaScript in your browser, and you must also install Adobe Flash Player 10.x. The IBM Content Analytics clients are best viewed with a screen resolution of 1024 x 768 or higher. For the best performance, use Microsoft Internet Explorer 8.x or Mozilla Firefox 3.5.x. Operating system: Browser Microsoft Internet Explorer 7 Microsoft Internet Explorer 8 Mozilla Firefox 3.5 Not supported on Linux on System z. Not supported on Linux on System z. AIX Linux Win Notes
Java environment IBM Content Analytics includes the 32-bit and 64-bit versions of Java Virtual Machine (JVM) 1.6. The following table specifies the requirements for compiling Java applications. The Java SDK is not required to install a content analytics system. Before you can build Java applications, you must also install and configure Apache ANT, a Java-based build tool.
Operating system: Java IBM Software Development Kit for Java 5.0 (32-bit or 64-bit) IBM Software Development Kit for Java 6.0 (32-bit or 64-bit) Apache ANT 1.7 Seehttp://ant.apache.org/ AIX Linux Win Notes
Operating system: LanguageWare Resource Workbench LanguageWare Resource Workbench 7.2 AIX Linux Win Notes For more information about LanguageWare, seeText Analytics Tools and Runtime for IBM LanguageWare.
Change history
History Release 2.2.0 Date October 2010 Change New
Because of differences in facet counting that occurs when document-level security is enabled, document-level security is not supported for text analytics collections or for search collections that are configured to use facets. Security for your collections extends beyond the authentication and access control mechanisms that the system can use to protect indexed content. Safeguards also exist to prevent a malicious and unauthorized user from gaining access to data while it is in transit. For example, the search servers use protocols such as the Secure Sockets Layer (SSL), the Secure Shell (SSH), and the Secure Hypertext Transfer Protocol (HTTPS) to communicate with the index server and the search and text miner applications. Additional security is provided through encryption. For example, the password for the system administrator, which is specified when the product is installed, is stored in an encrypted format. Passwords that users specify in user profiles are also stored in an encrypted format. For increased security, you need to ensure that the server hardware is appropriately isolated and secure from unauthorized intrusion. By installing a firewall, you can protect the servers from intrusion through another part of your network. Also ensure that there are no open ports on the servers. Configure the system so that it listens for requests only on ports that are explicitly assigned to IBM Content Analytics
directory
Usage guidelines
You can back up data from one computer and restore it to another computer. However: o You cannot restore files that were backed up from a different version of IBM Content Analytics. o You must restore data to a system that has the same number of servers. For example, if you back up a system that runs on two servers, you must restore data to a system that uses two servers.
You cannot restore files that were backed up from one operating system to a system that is running a different operating system. For example, if you installed the product on AIX and now want to run it on Linux, you must install a new system on your Linux servers. All settings for the installation directory (ES_INSTALL_ROOT), the data directory (ES_NODE_ROOT), and the administrator ID and password must be the same between the backed up system and the system to which data is restored. For a multiple server configuration, back up and restore the system from the index server. Because all of the crawler data resides in databases on the crawler server, the scripts run remote commands to back up and restore the crawler data. You must have enough disk space available to back up the system files to another directory. The backup and restore scripts do not check the files. All system sessions are stopped while the backup and restore scripts are running. To avoid seeing incorrect or inconsistent system information, do not use the administration console while the scripts are running. If the system fails because of an irrecoverable error, you must re-install IBM Content Analytics.
Analyzing content
Use the text miner application to analyze your content and discover trends, correlations, and deviations in data over time. The text miner application provides real-time statistical analysis of structured and unstructured data in IBM Content Analytics collections. By using the IBM Content Analytics text miner application, you can explore structured and unstructured content and gain insights that can help drive business decisions. For example, by exploring content that was analyzed through content analytics, you can identify patterns, observe trends in data over time, and take action when unusual correlations or anomalies are surfaced. The overall content analysis workflow consists of iterating through multiple cycles of the following steps: 1. 2. 3. 4. 5. Set the objectives for your analysis. Gather the appropriate data. Analyze and explore the data by using the text miner application. Take actions based on the insights that you discovered. Validate the effect.
2.
3.
4.