Professional Documents
Culture Documents
29 July 2014
Benjamin G. Leonhardi
Software Engineer
IBM
Piotr Pruski
Partner Enablement Engineer
IBM
Businesses often need to analyze large numbers of documents of various file types. Apache
Tika is a free open source library that extracts text contents from a variety of document formats,
such as Microsoft Word, RTF, and PDF. Learn how to run Tika in a MapReduce job within
InfoSphere BigInsights to analyze a large set of binary documents in parallel. Explore how
to optimize MapReduce for the analysis of a large number of smaller files. Learn to create a
Jaql module that makes MapReduce technology available to non-Java programmers to run
scalable MapReduce jobs to process, analyze, and convert data within Hadoop.
This article describes how to analyze large numbers of documents of various types with IBM
InfoSphere BigInsights. For industries that receive data in different formats (for example, legal
documents, emails, and scientific articles) InfoSphere BigInsights can provide sophisticated text
analytical capabilities that can aid in sentiment prediction, fraud detection, and other advanced
data analysis.
Learn how to integrate Apache Tika, an open source library that can extract the text contents of
documents, with InfoSphere BigInsights, which is built on the Hadoop platform and can scale to
thousands of nodes to analyze billions of documents. Typically, Hadoop works on large files, so
this article explains how to efficiently run jobs on a large number of small documents. Use the
steps here to create a module in Jaql that creates the integration. Jaql is a flexible language for
working with data in Hadoop. Essentially, Jaql is a layer on top of MapReduce that enables easy
analysis and manipulation of data in Hadoop. Combining a Jaql module with Tika makes it easy to
read various documents and use the analytical capabilities of InfoSphere BigInsights, such as text
analytics and data mining, in a single step, without requiring deep programming expertise.
Copyright IBM Corporation 2014
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Trademarks
Page 1 of 22
developerWorks
ibm.com/developerWorks/
This article assumes a basic understanding of the Java programming language, Hadoop,
MapReduce, and Jaql. Details about these technologies are outside the scope of the article,
which focuses instead on sections of code that must be updated to accommodate custom code.
Download the sample data used in this article.
Apache Tika
The Apache Tika toolkit is a free open source project used to read and extract text and other
metadata from various types of digital documents, such as Word documents, PDF files, or files in
rich text format. To see a basic example of how the API works, create an instance of the Tika class
and open a stream by using the instance.
If your document format is not supported by Tika (Outlook PST files are not supported, for
example) you can substitute a different Java library in the previous code listing. Tika does support
the ability to extract metadata, but that is outside the scope of this article. It is relatively simple to
add that function to the code.
Jaql
Jaql is primarily a query language for JSON, but it supports more than just JSON. It enables you
to process structured and non-traditional data. Using Jaql, you can select, join, group, and filter
data stored in HDFS in a manner similar to a blend of Pig and Hive. The Jaql query language was
inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Page 2 of 22
ibm.com/developerWorks/
developerWorks
a functional, declarative query language designed to process large data sets. For parallelism, Jaql
rewrites high-level queries, when appropriate, into low-level queries consisting of Java MapReduce
jobs. This article demonstrates how to create a Jaql I/O adapter over Apache Tika to read various
document formats, and to analyze and transform them all within this one language.
As an alternative to traditional classes, process small files in Hadoop by creating a set of custom
classes to notify the task that the files are small enough to be treated in a different way from the
traditional approach.
At the mapping stage, logical containers called splits are defined, and a map processing task takes
place at each split. Use custom classes to define a fixed-sized split, which is filled with as many
small files as it can accommodate. When the split is full, the job creates a new split and fills that
one as well, until it's full. Then each split is assigned to one mapper.
Page 3 of 22
developerWorks
ibm.com/developerWorks/
From a Java programming perspective, the class that holds the responsibility of this conversion
is called an InputFormat, which is the main entry point into reading data from HDFS. From the
blocks of the files, it creates a list of InputSplits. For each split, one mapper is created. Then
each InputSplit is divided into records by using the RecordReader class. Each record represents a
key-value pair.
be produced from these files. How the splits are converted into key-value pairs is defined in the
subclasses. Some example of its subclasses are TextInputFormat, KeyValueTextInputFormat, and
CombineFileInputFormat.
Hadoop works more efficiently with large files (files that occupy more than 1 block).
FileInputFormat converts each large file into splits, and each split is created in a way that
contains part of a single file. As mentioned, one mapper is generated for each split. Figure 1
depicts how a file is treated using FileInputFormat and RecordReader in the mapping stage.
However, when the input files are smaller than the default block size, many splits (and therefore,
many mappers) are created. This arrangement makes the job inefficient. Figure 2 shows how too
many mappers are created when FileInputFormat is used for many small files.
Page 4 of 22
ibm.com/developerWorks/
developerWorks
Page 5 of 22
developerWorks
ibm.com/developerWorks/
use in downstream MapReduce jobs. The Java classes used for writing files in MapReduce are
OutputFormat and RecordWriter. These classes are similar to InputFormat and RecordReader,
except that they are used for output. The FileOutputFormat implements OutputFormat. It contains
the path of the output files and directory and includes instructions for how the write job must be
run.
RecordWriter,
which is created within the OutputFormat class, defines the way each record passed
from the mappers is to be written in the output path.
This output can be used later for downstream analysis. The following sections explain the details
of each class.
Page 6 of 22
ibm.com/developerWorks/
developerWorks
In this application, assume you want to output a delimited file. Therefore, you need a way to
replace the chosen delimiter character in the original text field with a different character and a
way to replace new lines in the text with the same replacement character. For this purpose, add
two parameters: com.ibm.imte.tika.delimiter and com.ibm.imte.tika.replaceCharacterWith.
As shown in Listing 3, in the TikaHelper class, read those parameters from an instance of
Configuration to get the replacement options. Configuration is passed from RecordReader, which
creates the TikaHelper instance, described in a following section of this article.
After preparing the options, call the readPath method to get a stream of data to be converted
to text. After replacing all the desired characters from the configuration, return the string
representation of the file contents.
The replaceAll method is called on a string object and replaces all recurring characters with
the one specified in the argument. Because it takes a regular expression as input, surround the
characters with the regular expression group characters [ and ]. In the solution, indicate that if the
com.ibm.imte.tika.replaceCharacterWith is not specified, all characters are to be replaced with
an empty string.
In this article, the output is saved as delimited files. This makes them easy to read and process.
However, you do need to remove newline and delimiter characters in the original text. In use cases
such as sentiment analysis or fraud detection, these characters are not important. If you need to
preserve the original text 100 percent, you can output the results as binary Hadoop sequence files
instead.
Page 7 of 22
developerWorks
ibm.com/developerWorks/
Listing 5. TikaInputFormat.java
public class TikaInputFormat extends CombineFileInputFormat<Text, Text>
{
@Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException
{
return new TikaRecordReader((CombineFileSplit) split, context);
}
@Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}
In the constructor shown in Listing 6, store the required information to carry out the job delivered
from TikaInputFormat. Path[] paths stores the path of each file, FileSystem fs represents a
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Page 8 of 22
ibm.com/developerWorks/
developerWorks
file system in Hadoop, and CombineFileSplit split contains the criteria of the splits. Notice
that you also create an instance of TikaHelper with the Configuration to parse the files in the
TikaRecordReader class.
In the nextKeyValue method shown in Listing 7, you go through each file in the Path[] and return
a key and value of type Text, which contains the file path and the content of each file, respectively.
To do this, first determine whether you are already at the end of the files array. If not, you move on
to the next available file in the array. Then you open a FSDataInputStream stream to the file. In this
case, the key is the path of the file and the value is the text content. You pass the stream to the
TikaHelper to read the contents for the value. (The currentStream field that always points to the
current file in the iteration.) Next, close the used-up stream.
Explore HadoopDev
Find resources you need to get started with Hadoop powered by InfoSphere BigInsights,
brought to you by the extended BigInsights development team. Doc, product downloads,
labs, code examples, help, events, expert blogs it's all there. Plus a direct line to the
developers. Engage with the team now.
This method is run once for every file in the input. Each file generates a key-value pair. As
explained, when the split has been read, the next split is opened to get the records, and so on.
This process also happens in parallel on other splits. In the end, by returning the value false, you
stop the loop.
In addition to the following code, you must also override some default functions, as shown in the
full code, available for download.
Page 9 of 22
developerWorks
ibm.com/developerWorks/
} catch (Exception e) {
return false;
}
currentStream = null;
currentStream = fs.open(path);
key.set(path.getName());
value.set(tikaHelper.readPath(currentStream));
currentStream.close();
count++;
return true; //we have more data to parse
}
Listing 8. TikaOutputFormat.java
public class TikaOutputFormat extends FileOutputFormat<Text, Text>
{
@Override
public RecordWriter<Text, Text> getRecordWriter(TaskAttemptContext context)
throws IOException, InterruptedException
{
//to get output files in part-r-00000 format
Path path = getDefaultWorkFile(context, "");
FileSystem fs = path.getFileSystem(context.getConfiguration());
FSDataOutputStream output = fs.create(path, context);
return new TikaRecordWriter(output, context);
}
}
Page 10 of 22
ibm.com/developerWorks/
developerWorks
In the write method shown in Listing 10, use the key and value of type Text created in the mapper
to be written in the output stream. The key contains the file name, and the value contains the text
content of the file. When writing these two in the output, separate them with the delimiter and then
separate each row with a new line character.
Page 11 of 22
developerWorks
ibm.com/developerWorks/
Pay attention to the first line in bold. If the max split size is not defined, the task
attributes all of the input files to only one split, so there is only one map task. To prevent
this, define the max split size. This value can be changed by defining a value for the
mapreduce.input.fileinputformat.split.maxsize configuration parameter. This way, each split
has a configurable size 64MB in this case.
You have now finished the MapReduce job. It reads all files in the HDFS input folder and
transcodes them into a delimited output file. You can then conveniently continue analyzing the data
with text analytical tools, such as IBM Annotation Query Language (AQL). If you want a different
output format or you want to directly transform the data, you must modify the code appropriately.
Because many people are not comfortable programming Java code, this article explains how to
use the same technology in a Jaql module.
Page 12 of 22
ibm.com/developerWorks/
developerWorks
In the next method, as shown in Listing 14, iterate through all the files in the split, one after the
other. After opening a stream to each file, assign the name and the contents as the elements to a
new instance of BufferedJsonRecord. BufferedJsonRecord helps you keep items in an appropriate
format. Jaql internally runs on JSON documents, so all data needs to be translated into valid JSON
objects by the I/O adapters. The BufferedJsonRecord is then assigned as the value of the record.
The key, however, remains empty.
Page 13 of 22
developerWorks
ibm.com/developerWorks/
{
if (count >= split.getNumPaths())
{
done = true;
return false;
}
Path file = paths[count];
fs = file.getFileSystem(conf);
InputStream stream = fs.open(file);
value.setValue(bjr);
stream.close();
count++;
return true;
}
Page 14 of 22
ibm.com/developerWorks/
developerWorks
Create the file tika.jaql, as shown in Listing 15 and put it in the $JAQL_HOME/modules/tika
directory so it can be easily imported into other Jaql scripts. The name of the Jaql file is not
relevant, but the name of the folder you created under the modules folder is important. You can
also add modules dynamically using command-line options from a Jaql-supported terminal.
This code looks for the generated JAR files in /home/biadmin/. You need to copy the Tika JAR file
in this folder and export your created class files as TikaJaql.jar to this folder, as well. In Eclipse,
you can create a JAR file from a project with the Export command.
Using Jaql
Now that the module has been created, use the following examples to help you see some possible
uses of this function.
Jaql is quite flexible and can be used to transform and analyze data. It has connectors to analytical
tools, such as data mining and text analytics (AQL). It has connectors to various file formats (such
as line, sequence, and Avro) and to external sources (such as Hive and HBase). You can also use
it to read files from the local file system or even directly from the web.
The following section demonstrates three examples for the use of the tika module in Jaql. The
first example shows a basic transformation of binary documents on HDFS into a delimited file
containing their text content. This example illustrates the fundamental capabilities of the module;
it is equivalent to the tasks you carried out with the MapReduce job in the previous sections. The
second example shows how to use Jaql to load and transform binary documents directly from
an external file system source into HDFS. This example can prove to be a useful procedure if
you do not want to store the binary documents in HDFS, but rather to store only the contents in
a text or sequence file format, for instance. The load is be single threaded in this case, so it does
not have the same throughput as the first approach. The third example shows how to do text
analysis directly within Jaql after reading the files, without first having to extract and persist the text
contents.
Using the code in Listing 16, read files inside a directory from HDFS and write the results back into
HDFS. This method closely mirrors what you have done in the MapReduce job in the first section.
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Page 15 of 22
developerWorks
ibm.com/developerWorks/
You must import the tika module you created to be able to use the tikaRead() functionality. You
then read the files in the specified folder using the read() function, and write the file names and
text contents to a file in HDFS in delimited file format.
You can find additional information on Jaql in the InfoSphere BigInsights Knowledge Center.
The demo input is a set of customer reviews in Word format in a folder, as shown in Listing 16. Of
the 10 reviews, some are positive and some are negative. Assume you want to extract the text
and store it in delimited format. Later, you might want to perform text analytics on it. You want
to keep the file name because it tells you who created the review. Normally, that relationship is
documented in a separate table.
As shown in Listing 17, run the Jaql command to read all the supported documents of this folder,
extract the text, and save it into a single delimited file that has one line per original document.
You can now find the output in the /tmp/output folder. This folder contains the text content of the
Word documents originally in /tmp/reviews in the format shown below.
You can now easily analyze the document contents with other tools like Hive, Pig, MapReduce, or
Jaql. You have one part file for each map task.
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Page 16 of 22
ibm.com/developerWorks/
developerWorks
Using Jaql, you are not constrained by reading files exclusively from HDFS. By replacing the
input path to one that points to a local disk (of the Jaql instance), you can read files from the local
file system and use the write() method to copy them into HDFS, as shown in Listing 19. This
approach makes it possible to load documents into InfoSphere BigInsights and transform them in a
single step. The transformation is not done in parallel (because the data was not read in parallel to
begin with), but if the data volumes are not so high, this method can be convenient.
If your operation is CPU-constrained, you can also use a normal read operation that runs in
MapReduce. However, this method requires you to put the files on a network file system and
mount it on all data nodes. The localRead command in runs the transformation in a local task.
As you can see, the only difference here is the local file path. Jaql is flexible and can dynamically
change from running in MapReduce to local mode. You can continue to perform all of the data
transformations and analytics in one step. However, Jaql does not run these tasks in parallel
because the local file system is not parallel. Note that in the previous example, the output format
is changed to a Jaql sequence file. This approach is binary and it is faster, so you don't need to
replace characters in the original text. However the disadvantage is that the output files aren't
human readable anymore. This format is great for efficient, temporary storage of intermediate files.
This last example in Listing 20 shows how to run a sentiment detection algorithm on a set of
binary input documents. (The steps on how to create the AQL text analytics code for this are
omitted because there are other comprehensive articles and references existing that go into more
detail. In particular, see the developerWorks article "Integrate PureData System for Analytics and
InfoSphere BigInsights for email analysis" and the InfoSphere BigInsights Knowledge Center.
In a nutshell, the commands in the previous sections can read the binary input documents, extract
the text content from them, and apply a simple emotive tone detection annotator using AQL. The
resulting output is similar to Listing 21.
Page 17 of 22
developerWorks
"label": "review1.doc",
"sentiments": {
"EmotiveTone.AllClues": [
{
"clueType": "dislike",
"match": "not care for"
}
],
"label": "review1.doc",
"text": "I do not care for the camera.
}
ibm.com/developerWorks/
"
},
{
"label": "review10.doc",
"sentiments": {
"EmotiveTone.AllClues": [
{
"clueType": "positive",
"match": "reliable"
}
],
"label": "review10.doc",
"text": "It was very reliable "
}
},
...
You can now use Jaql to further aggregate the results, such as counting the positive and negative
sentiments by product and directly uploading the results to a database for deeper analytical
queries. For more details on how to create your own AQL files or use them within Jaql, see the
developerWorks article "Integrate PureData System for Analytics and InfoSphere BigInsights for
email analysis" and the InfoSphere BigInsights Knowledge Center.
The first argument specifies the output file name, and the second designates the source directory.
This example includes only one source directory, but this tool can accept multiple directories.
After the archive has been created, you can browse the content files.
Processing and content analysis of various document types using
MapReduce and InfoSphere BigInsights
Page 18 of 22
ibm.com/developerWorks/
developerWorks
Because you have the input files in HAR format, you can now delete the original small files to fulfill
the purpose of this process.
It is good to note that HAR files can be used as input for MapReduce. However, processing many
small files, even in a HAR, is still inefficient because there is no archive-aware InputFormat that
can convert a HAR file containing multiple small files to a single MapReduce split. This limitation
means that HAR files are good as a backup method and as a way to reduce memory consumption
on the NameNode, but they are not ideal as input for analytic tasks. For this reason, you need to
extract the text contents of the original files before creating the HAR backup.
Conclusion
This article describes one approach to analyzing a large set of small binary documents with
Hadoop using Apache Tika. This method is definitely not the only way to implement such function.
You can also create sequence files out of the binary files or use another storage method, such
as Avro. However, the method described in this article offers a convenient way to analyze a vast
amount of files in various types. Combining this method with Jaql technology, you have the ability
to extract contents directly while reading files from various sources.
Apache Tika is one of the most useful examples, but you can replicate the same approach with
essentially any other Java library. For example, you can extract binary documents not currently
supported by Apache Tika, such as Outlook PST files.
You can implement everything described in this article by using only Java MapReduce. However,
the Jaql module created in the second part of this article is a convenient way to load and transform
data in Hadoop without the need for Java programming skills. The Jaql module enables you to
do the conversion process during load and to use analytical capabilities, such as text or statistical
analysis, which can be completed within a single job.
Page 19 of 22
developerWorks
ibm.com/developerWorks/
Downloads
Description
Name
Size
SampleCode.zip
26MB
Page 20 of 22
ibm.com/developerWorks/
developerWorks
Resources
Learn
Hadoop: The Definitive Guide offers more information about Hadoop and MapReduce
programming, which offers a great way to learn about Hadoop, MapReduce programming,
and the Hadoop classes used in this article.
Engage with the HadoopDev team.
Read "Integrate PureData System for Analytics and InfoSphere BigInsights for email
analysis" in combination with this article to understand an end-to-end solution with
InfoSphere BigInsights, IBM PureData System for Analytics, and IBM Cognos for Email
Analysis.
Learn more about Apache Tika.
The InfoSphere BigInsights Knowledge Center product documentation includes the full
reference for Jaql and AQL.
Self-paced tutorials (PDF): Learn how to manage your big data environment, import data
for analysis, analyze data with BigSheets, develop your first big data application, develop
Big SQL queries to analyze big data, and create an extractor to derive insights from text
documents with InfoSphere BigInsights.
Technical introduction to InfoSphere BigInsights: Learn more on Slideshare.
Get products and technologies
InfoSphere BigInsights Quick Start Edition: Download this no-charge version, available as a
native software installation or as a VMware image.
Discuss
InfoSphere BigInsights forum: Ask questions and get answers.
Page 21 of 22
developerWorks
ibm.com/developerWorks/
Benjamin G. Leonhardi
Benjamin Leonhardi is the team lead for the big data/warehousing partner
enablement team. Before that, he was a software developer for InfoSphere
Warehouse at the IBM R&D Lab Boeblingen in Germany. He was a developer in
the data mining, text mining, and mining reporting solutions.
Piotr Pruski
Piotr Pruski is a partner enablement engineer within the Information Management
Business Partner Ecosystem team in IBM. His main focus is to help accelerate sales
and partner success by reaching out to and engaging business partners, enabling
them to work with products within the IM portfolio, namely InfoSphere BigInsights and
InfoSphere Streams.
Copyright IBM Corporation 2014
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)
Page 22 of 22