You are on page 1of 36

ANALYTICS AND

TECH MINING FOR


ENGINEERING
MANAGERS

ANALYTICS AND
TECH MINING FOR
ENGINEERING
MANAGERS
SCOTT W. CUNNINGHAM
AND JAN H. KWAKKEL

MOMENTUM PRESS, LLC, NEW YORK

Analytics and Tech Mining for Engineering Managers


Copyright Momentum Press, LLC, 2016.
All rights reserved. No part of this publication may be reproduced, stored
in a retrieval system, or transmitted in any form or by any means
electronic, mechanical, photocopy, recording, or any otherexcept for
brief quotations, not to exceed 400 words, without the prior permission
of the publisher.
First published by Momentum Press, LLC
222 East 46th Street, New York, NY 10017
www.momentumpress.net
ISBN-13: 978-1-60650-510-6 (print)
ISBN-13: 978-1-60650-511-3 (e-book)
Momentum Press Engineering Management Collection
Collection ISSN: 2376-4899 (print)
Collection ISSN: 2376-4902 (electronic)
Cover and interior design by Exeter Premedia Services Private Ltd.,
Chennai, India
10 9 8 7 6 5 4 3 2 1
Printed in the United States of America

SWCThis book is dedicated to my mother, Joan Cunningham.

Abstract
This book offers practical tools in Python to students of innovation as
well as competitive intelligence professionals to track new developments in science, technology, and innovation. The book will appeal to
bothtech-mining and data science audiences. For tech-mining audiences, Python presents an appealing, all-in-one language for managing
the tech-mining process. The book is a complement to other introductory
books on the Python language, providing recipes with which a practitioner
can grow a practice of mining text. For data science audiences, this book
gives a succinct overview of the most useful techniques of text mining.
The book also provides relevant domain knowledge from engineering
management; so, an appropriate context for analysis can be created.
This is the first book of a two-book series. This first book discusses
the mining of text, while the second one describes the analysis of text.
This book describes how to extract actionable intelligence from a variety
of sources including scientific articles, patents, pdfs, and web pages. There
are a variety of tools available within Python for mining text. In particular,
we discuss the use of pandas, BeautifulSoup, and pdfminer.

KEYWORDS
data science, natural language processing, patent analysis, Python, s cience,
technology and innovation, tech mining

Contents
List of Figures

xi

List of Tables

xiii

List of Scripts, Examples, and Outputs


Preface
Acknowledgments
1Tech Mining Using Open Source Tools

xv

xix
xxiii
1

1.1 Why This Book

1.2 Who Would Be Interested

1.3 The State of Play

1.4 What Comes Next

2Python Installation

2.1 Scripts, Data, and Examples

2.2 Different Versions of Python

11

2.3 Installing Python

12

2.4 Development Environment

13

2.5 Packages

17

3Python Basics for Text Mining

21

3.1 Input, Strings, and Output

21

3.2 Data Structures

26

3.3 Compound Data Structures

30

4Sources of Science and Technology Information

33

4.1 Collecting and Downloading the Data

34

4.2 Altmetrics and the Supply and Demand for Knowledge

42

4.3 Examples Used in the Text

44

x Contents

5Parsing Collected Data

47

5.1 Reading Column-Structured Data

47

5.2 Reading Row-Structured Data

51

5.3 Adapting the Parsers for New Databases

53

5.4 Reading and Parsing from a Directory

54

5.5 Reading and Printing a JSON Dictionary of Dictionaries

56

6Parsing Tree-Structured Files

61

6.1 Reading an XML File

62

6.2 Web Scraping Using BeautifulSoup

67

6.3 Mining Content from PDF Files

71

7Extracting and Reporting on Text

75

7.1 Splitting JSONS on an Attribute

77

7.2 Making a Counter

78

7.3 Making Simple Reports from the Data

79

7.4 Making Dictionaries of the Data

82

7.5 Counting Words in Documents

83

8Indexing and Tabulating the Data

89

8.1 Creating a Partial Index of the Data

90

8.2 Making Dataframes

91

8.3 Creating Cross-Tabs

96

8.4 Reporting on Dataframes

100

Conclusions

103

References

111

Index

115

List of Figures
Figure 2.1.
Figure 2.2.
Figure 2.3.
Figure 2.4.
Figure 2.5.
Figure 2.6.
Figure 2.7.
Figure 2.8.
Figure 2.9.
Figure 2.10.
Figure 2.11.
Figure 2.12.
Figure 3.1.
Figure 3.2.
Figure 3.3.
Figure 3.4.
Figure 3.5.
Figure 3.6.
Figure 3.7.
Figure 4.1.
Figure 4.2.
Figure 4.3.
Figure 4.4.
Figure 4.5.
Figure 4.6.
Figure 4.7.

Data, notebook, and output setup.


Anaconda download.
Setup wizard.
Choose install location.
Extracting Anaconda Python.
Completing the setup wizard.
Accessing the Anaconda command prompt.
The command prompt.
Type and run iPython Notebook.
iPython Notebook server in the browser.
New startup directory.
Upgrading a package using Pip.
Data and output directories.
Notebook start page with new directories.
New blank notebook.
Renaming the notebook.
Typing a line in the notebook.
Hello world!
Simple debug statements.
Searching for nanotechnology.
Nanotechnology records.
Web of science categories.
Saving the records.
PubMed search interface.
PubMed records.
PubMed downloads.

11
12
13
13
14
14
15
15
16
16
17
18
22
22
22
23
23
23
24
36
37
37
38
40
41
41

xii List of Figures

Figure 5.1.
Figure 5.2.
Figure 6.1.
Figure 8.1.
Figure 8.2.
Figure 8.3.
Figure 8.4.
Figure 8.5.
Figure 8.6.
Figure 8.7.
Figure 8.8.
Figure 8.9.
Figure C.1.

Example column-structured format.


Example row-structured file.
Tree-structured file.
Dataframe of article ID by year.
Expanded dataframe by year.
Dataframe of indexed data.
Dataframe with filled missing data.
Dataframe with organizations.
Cross-tab of content by year.
Cross-tab of organization by year.
Cross-tab of content by organization.
The info method.
Process for tech mining study.

48
51
62
92
93
94
94
96
98
99
99
101
108

List of Tables
Table 1.1.
Table 1.2.
Table 1.3.
Table 4.1.
Table 4.2.
Table 4.3.
Table 7.1.
Table 7.2.
Table 7.3.

The Journalists questions


Types of information product
Questions and products
Sources of science and technology information
Most frequently used fields in scientific records
Most frequently used fields in patents
Sourcing the journalists questions
Information products
Coverage of information products

5
6
7
34
38
39
76
77
87

List of Scripts, Examples,


and Outputs
Output 2.1.
Output 2.2.
Output 2.3.
Output 2.4.
Output 2.5.
Example 3.1.
Example 3.2.
Example 3.3.
Example 3.4.
Example 3.5.
Example 3.6.
Example 3.7.
Example 3.8.
Output 3.1.
Example 3.9.
Example 3.10.
Example 3.11.
Example 3.12.
Example 3.13.
Output 3.2.
Example 3.14.
Output 3.3.
Example 3.15.
Example 3.16.
Output 3.4.
Example 5.1.
Output 5.1.
Example 5.2.

Default profile location


Notebook directory
Changing the notebook directory
Using pip
Upgrading packages with pip
Hello World!
String printing
Opening a file
Writing to a file
The enumerate function
Lists
Dictionaries
Sorting a dictionary
Sorted dictionary
Counters
Using a counter
Adding two counters
Fields in a record
Storing fields in a record
Stored record
Another example of fields in a record
A second stored record
Dictionary of dictionaries
Retrieving an article from a corpus
Example output from a dictionary of dictionaries
Parsing column-structured data
Example record from PubMed database
Saving a dictionary of dictionaries to JSON

16
17
17
18
18
22
23
24
25
25
26
27
27
28
29
29
30
30
30
31
31
31
32
32
32
48
50
50

xvi List of Scripts, Examples, and Outputs

Example 5.3.
Parsing row-structured data
Example 5.4.
Adapting the parser for a new databases
Example 5.5.
Reading from a directory
Output 5.2.
Example output of reading from a directory
Example 5.6.
Loading and pretty-printing a JSON file
Output 5.3.
Sample dictionary of dictionaries
Example 5.7. Extracting a sample from the dictionary
of dictionaries
Output 5.4.
Displaying part of a sample record
Example 6.1.
XML to dictionary
Output 6.1.
Patent stored in a dictionary
Example 6.2.
Pretty-printing a dictionary
Output 6.2.
Sample pretty-printed output
Example 6.3.
Recursively printing a dictionary and its contents
Output 6.3.
Top of the patent
Output 6.4.
Cited literature in the patent
Output 6.5.
Description of the invention
Example 6.4.
BeautifulSoup example
Output 6.6.
Scraped HTML sample
Example 6.5.
Extracting readable text from HTML
Output 6.7.
Example readable text from HTML
Example 6.6.
Example use of PDFMiner
Output 6.8.
Sample PDF output to text
Example 6.7.
Get outlines method
Example 7.1.
Splitting a corpus
Output 7.1.
Results from splitting
Example 7.2.
Making a counter
Output 7.2.
Screen output from a counter
Example 7.3.
The most common method
Output 7.3.
The top 10 years
Example 7.4.
Counting authors
Output 7.4.
Top 10 authors
Example 7.5.
Counting nations
Output 7.5.
Top 10 nations
Example 7.6.
Extracting a dictionary
Example 7.7.
Loading a JSON
Example 7.8.
Fetching a field
Output 7.6.
Sample counter
Example 7.9.
Counting a field
Output 7.7.
The most frequent words
Example 8.1.
Selective indexing

51
53
55
55
56
57
58
58
63
63
63
64
64
65
66
66
68
68
69
70
71
72
73
77
78
78
79
79
79
80
80
81
81
83
83
84
84
85
86
90

List of Scripts, Examples, and Outputs xvii

Output 8.1.
Example 8.2.
Example 8.3.
Example 8.4.
Example 8.5.
Example 8.6.
Example 8.7.
Example 8.8.
Example 8.9.
Output 8.2.
Example 8.10.
Example 8.11.
Example 8.12.
Output 8.3.
Example 8.13.
Output 8.4.
Example 8.14.
Example 8.15.

Sample counters by records


Making a data frame from an index
Expanding the data frame
Using the head method
Filling missing values
Organizations of interest
Selective organizational search
Creating an organization data frame
Sizing the data frame
The dimensions of a data frame
Creating a content by year cross-tab
Creating an organization by year cross-tab
Creating a content by organization cross-tab
The info method
Summing a data frame
A summed data frame
The describe method
Saving a data frame

91
92
93
94
94
95
95
96
97
97
97
98
98
100
101
101
101
102

Preface
The authors of this book asked me to share perspectives on tech mining.
I co-authored the 2004 book on the topic (Porter and Cunningham 2004).
With an eye toward Scott and Jans materials, here are some thoughts.
These are meant to stimulate your thinking about tech mining and you.
Who does tech mining? Experience suggests two contrasting types
of people: technology and data folks. Technology folks know the subject;
they are either experienced professionals or trained professionals or both,
working in that industry or research field to expand their intelligence via
tech mining. They seek to learn a bit about data manipulation and analytics to accomplish those aims. For instance, imagine a chemist seeking a
perspective on scientific opportunities or an electrical engineer analyzing
emerging technologies to facilitate open innovation by his or her company.
The data science folks are those whose primary skill include some variation of data science and analytics. I, personally, have usually been in this
groupneeding to learn enough about the subject under study to not be
totally unacquainted. Moreover, in collaborating on a major intelligence
agency project to identify emerging technologies from f ull-text analyses,
we were taken by the brilliance of the data folksreally impressive
capabilities to mine science, technology, and innovation text resources.
Unfortunately, we were also taken by their disabilities in relating those
analyses to real applications. They were unable to practically identify
emergence in order to provide usable intelligence.
So, challenges arise on both sides. But, a special warning to readers of this bookwe suspect you are likely Type B, and we fear that the
challenges are tougher for us. Years ago, we would have said the oppositeanalysts can analyze anything. Now, we think the other way; that
you really need to concentrate on relating your skills to answering real
questions in real time. My advice would be to push yourself to perform
hands-on analyses on actual tech-mining challenges. Seek out internships

xxPreface

or capstone projects or whatever, to orient your tech mining skills to generate answers to real questions, and to get feedback to check their utility.
Having said that, an obvious course of action is to team up Types A
and B to collaborate on tech-mining work. This is very attractive, but you
must work to communicate well. Dont invest 90 percent of your energy in
that brilliant analysis and 10 percent in telling about it. Think more toward
a 5050 process where you iteratively present preliminary results, and get
feedback on the same. Adjust your presentation content and mode to meet
your users needs, not just your notions of whats cool.
Whats happening in tech mining? The field is advancing. Its hard
for a biased insider like me to gauge this well, but check out the website www.VPInstitute.org. Collect some hundreds of tech-mining-oriented
papers and overview their content. You can quickly get a picture of the
diverse array of science, technology, and innovation topics addressed in
the open-source literature. Less visiblebut the major use of tech-mining
toolsare the competitive technical intelligence applications by companies and agencies.
Tech mining is advancing. In the 2000s, studies largely addressed
who, what, where, and when questions about an emerging technology.
While research profiling is still useful, we now look to go further along
following directions.
Assessing R&D in a domain of interest, to inform portfolio management or funding agency program review.
Generating competitive technological intelligence, to track known
competitors and to identify potential friends and foes. Tech mining
is a key tool to facilitate open innovation by identifying potential
sources of complementary capabilities and collaborators.
Technology road mapping by processing text resources (e.g., sets
of publication or patent records on a topic under scrutiny) to extract
topical content and track its evolution over time periods.
Contributing to future-oriented technology analysestech mining provides vital empirical grounding to inform future prospects.
Transition from identifying past trends and patterns of engagement
to laying out future possibilities is not automatic, and offers a field
for productive study.
Id point to some resources to track whats happening in tech mining as time progresses.
Note the globalization of tech-mining interest. For instance, this
book has been translated in Chinese (Porter and Cunningham

Preface xxi

2012)not expecting many of you to rush off to read it, but it is an


indicator of considerable interest in Asian economies pursuing science, technology, and innovation opportunities. And that reinforces
the potential of text processing of languages other than English.
Track the scholarly literature. Tech mining analytics and applications splatter across various scholarly fields. Here I note a few
pieces from our colleagues. Bibliometrics journals cover analytical advancesc.f., Ma and Porter (2014) and Zhang et al. (2014).
Management of technology-oriented journals cover analytics and
applicationsc.f., Guo et al. (2014), Newman et al. (2014), and
Porter, Cunningham, and Sanz (2015).
Should you choose open-source or proprietary tools and software?
This book advances an open-source strategy for you to learn skills in
Python and other open software, especially to apply to open source data
resources. I come from the proprietary sidepursuing use of commercial
software (VantagePoint 2016), particularly in analyzing leading science,
technology, and innovation resources like Web of Science and Derwent
Patents. Id like to say a bit about the pros and cons of each.
Open source advantages favor software, data, and learning. In software, there are lots of open advantages. This includes leveraging others
contributions and free access. In data, free is good, and this is certainly
on the upswing in science, technology, and innovation. And finally, there
are learning opportunities that offer inherent value beyond the immediate
tasks. For instance, you may be learning Python to do other things with it,
as well as gaining transferable programming skills.
But dont write off proprietary resources. If better data are available
for a price, they may dominate free options. One can waste a lot in cleaning the crappy data and never catch up with the readily available alternative. Without such suitable cleaning, one could be generating garbage
out from garbage in. Making your own scripts is expensive. If there is
already good software, already available, you should use it.
And surely, consider combinations. Dont rule out open source data,
just because youre using proprietary softwarefor example, MEDLINE
offers uniquely rich coverage of the worlds biomedical research and its
free to all. Conversely, your open source software may enable you to
generate particularly valuable CTI by analyzing a premium information
resource, such as Web of Science or Derwent Patentsalone or in conjunction with additional open source resources. Combinations increase
your potential resourcesfor example, VantagePoint (proprietary) works

xxiiPreface

well with Pajek (open source) to generate science and patent overlay maps
to show disciplinary participation in R&D areas under study (c.f., Kay
etal. 2014).
Alan Porter
Atlanta, Georgia
July 30, 2015

Acknowledgments
SWCThis work was partially funded by a European Commission grant,
grant number 619551.
JHKThis work was partially funded by the Dutch National Science
foundation, grant number 451-13-018.

CHAPTER 1

Tech Mining Using Open


Source Tools
This book is for readers with a pervasive interest in science, technology,
and innovation and those who invest time in analysis in order to get a deeper
sense of the underlying trends in human knowledge. The book provides
the tools to let you monitor and analyze the raw by-products of scientific
activitywhether they are scientific articles, patents, web pages, or social
media posts. Although the book requires a basic level of p rogramming
skills in Python, many detailed examples have been provided that can
be used as a basis for further learning. In addition, the examples can be
further adapted and customized to meet the specific needs of the readers.

1.1 WHY THIS BOOK


This book is about using open source software. Broadly speaking, software is an open source if its source code is made available. The license of
the software specifies that the copyright holder provides the right to study,
change, and distribute the software to anyone and for any purpose. There
exists a plethora of open source licenses and a complete discussion of
details are beyond the purpose of this book.
The emergence of open source software is tied to the rise of the
Internet. However, some of the essential ingredients of open source software existed before this. For example, in the 1920s and 1930s, various
motor companies in the United States freely shared patents based on a
cross-licensing agreement. The Internet itself emerged out of the collaborative process adopted in the context of ARPANET for the development of
communication protocols.

2ANALYTICS AND TECH MINING FOR ENGINEERING MANAGERS

Nowadays, open source software spans the space from the operating
system (e.g., the Linux kernel) all the way to very specialized applications
like GIMP (a Photoshop alternative). Moreover, the idea of open source
has spread to other domains as well. For example, in electronics, Arduino
is based on open source principles, and a website like Wikipedia also used
open source ideals.
There are many programming languages available. Why are we
using Python in this book? There are several reasons why we have chosen
Python. First, Python is an open source software. The licenses under which
most Python libraries are being distributed are quite liberal. Therefore,
they can be distributed freely even in case of commercial applications. It
is also free, and can easily be acquired via the Internet. Python is platform
independent. It runs under Windows, Mac OSX, and virtually all Linux
distributions. Moreover, with a little care, programmers can write code
that will run without change on any of these operating systems.
Second, Python is a general purpose programming language. This
means that Python is designed to be used for developing applications in
a wide range of application domains. Other domain-specific languages
are much more specialized. For example, Matlab, frequently used in
engineering, is designed for matrix operations. Being a general purpose
programming language, Python can be used for, among other things,
string handling, mathematics, file input and output, connecting to databases, connecting over a network and to websites, and GUI development.
Python comes with a comprehensive standard library. The library contains
modules for graphical user interface development, for connecting to the
Internet as well as various standard relational databases, for handling regular expressions and for software testing. Next to the extensive standard
library, there are many libraries under active development for scientific
computing applications. This scientific computing community is vibrant
and actively developing bothcornerstone libraries for general scientific computing and domain-specific libraries. This implies that Python
is increasingly being seen as a viable open source alternative to many
established proprietary tools that have typically been used in science and
engineering applications.
Third, the language design of Python focuses on readability and
coherence. Quite often, it is hard to read code, even if you have written
it yourself a few weeks ago. In contrast, the language design of Python
strives to result in a code that is easy to read. Both collaboration and education benefit from the feature of readability. One of the most obvious
ways in which Python enforces readability is through its use of indentation

Tech Mining Using Open Source Tools3

to structure code blocks. In many programming languages, blocks of code


are in between curly braces. To understand the structure of the code, one
has to detect which curly braces belong together. In contrast, Python structures code blocks through the use of indentation. This structuring might
take some getting used to but it produces highly readable code.
Fourth, Python is a high level programming language. This means
that many details related to the exact working of a computer are
abstracted away. For example, in Python, a user typically need not worry
about memory management. Because many of these details have been
abstracted away, a programmer can focus on getting things done, and
done quickly. It is not uncommon that performing a given task in Python
requires half the amount of code as compared to the same task in a lower
level language.

1.2 WHO WOULD BE INTERESTED


You might be a practicing scientist, engineer, or a trainee. You might be
an analyst working in a research-intensive industry. You might work in
a government or nonprofit agency, and you might need to evaluate the
impact of current research funding. Or you might be a director or vice
president who wants to know what is possible given current state-of-theart capabilities for analysis. A number of different professionals need to
use and know more about technology intelligence and text mining.
In fact, this book has a lot in common with data mining and data
science. As a result, we sometimes speak of tech mining, which is the
specific application of data mining to studying science, technology, and
innovation. But we will, on occasion, also use the wold text mining. Text
mining is the application of data mining techniques to qualitative data and,
more specifically, text. We expect there to be an increasing and fruitful
exchange between applied practitioners in tech mining and those who are
mining texts (of social media especially) for other business and organizational purposes.
Given the examples in this book, the interested reader will want
to build and expand upon them for their own standard routines. This
book also does not discuss the design of a complete tech mining study
to meet a practical need. If data and standardization is important to you,
you might consider acquiring proprietary software for analysis, many of
which come with standard subscriptions to large databases of science
and technology.

4ANALYTICS AND TECH MINING FOR ENGINEERING MANAGERS

1.3THE STATE OF PLAY


How much scientific data is out there today? According to some estimates,
the world exceeded 1 zettabyte of storage capacity in 2014. Rough estimates suggest that 6 exabytes of this data are texts related to research and
development activity. Thats written material equal to the contents of 600
Libraries of Congress. Of course a lot more raw scientific data is being
collected, which were earlier refined and analyzed in text only.
The Industrial Revolution presaged our modern era of change. The
world has been managing rapid technological development for a long time
now! Is there anything unique about the character of technological growth
being experienced today? Development has historically occurred in fits
and startsthere are periods of stability and economic growth, periods of
stagnation, and also periods of rapid technological change and disruption.
We may be entering another such period of rapid technological change,
the likes of which have not been seen since the 1950s.
The character of this growth will be qualitatively different. For one,
it will likely be heavily dependent on computers, data, and the Internet.
Second, it will most likely be science based. And third, this is likely to
be a period of open innovationdistributed across many parties rather
than being concentrated on large government or industrial bureaus.
And finallyalthough perhaps it need not be saidthese technological
changes will be surprising and unforeseen for many.
Novel societal changes demand new techniques for governance and
management. Since this new wave of change is likely to involve (or be
about) computing and the Internet, it makes sense to have a set of tools
utilizing computers and the Internet And, since this new wave of change
is likely to be science based, it makes sense that keeping abreast with
change will involve monitoring the by-products of science. This includes
scientific articles and also patents. The new age of innovation will also be
distributed. This means that individuals and organizations cannot count
on having necessary knowledge right at hand; in turn, this places a high
demand on coordination between various parties. Participation in science
and technology is increasingly becoming a highly distributed process of
searching social networks, and a highly asynchronous process of reading
and writing large repositories of knowledge.
The fascinating work by Hilbert (2014) quantifies the format of
worlds data. The digital revolution has been a revolution in text. Given
Hilberts data, we estimate that there is more than 15 exabytes of stored
text today. It plays a more important role than ever before. At least for a
while, before digital video fully takes hold, text is growing faster, and is a
significant part of our computing and communication infrastructure.

Tech Mining Using Open Source Tools5

Our final point in this section is about the monitoring of change.


Changes bring both disruption and opportunities. We believe that skills
for monitoring technological change will be critical for organizations as
well as technology professionals. In the next section, well discuss what
open source software brings to the practice of open innovation.

1.4 WHAT COMES NEXT


This closing section of the chapter discusses three thingsthe structure of
the book, the case examples used, and the various code examples. In the
structure section of the book, we discuss the organization of the book and
how this relates to general processes for data mining and data science. In
the example section, we discuss various sources of data used to illustrate
the core concepts of the book. In the code example section, we discuss
where you can find online the full iPython notebooks, which accompany
each of the chapters of the book. With these, you can run the examples at
home or office, on your own computers.
1.4.1 STRUCTURE OF THE BOOK
This book is the first of the two-volume sets on text mining and text analysis. It covers the mining of text, and takes simple approach to lay out a
variety of different possible text analysis questions. The approach is based
on journalists questions (Table 1.1). These are the who, what, when,
where, and why of the technology intelligence world. These questions
can be used to appraise the kinds of technological intelligence you can create using your data. As we progress through the book, we will clarify how
to source information that can help answer each of these five questions.
Information products come in successive levels of complexity. S
imple
lists, such as a top 10 author list or a top 10 publishing organization list,
give quick insight into the data. Tables provide comprehensive input
Table 1.1. The Journalists
questions
Type of question
Who
What
When
Where
Why

6ANALYTICS AND TECH MINING FOR ENGINEERING MANAGERS

s uitable for further analysis. For instance you can create a set of articles
fully indexed by content. Then it becomes possible to filter and retrieve
your content. This often reveals surprising relationships in the data. You
can also compile organizational collaborations across the articles. Like
article indices, these collaboration tables are often inputs to data analysis
or machine-learning routines.
The final form of information products, which well consider here,
are cross-tabs. Cross-tabs mix two or more of the journalists questions
to provide more complex insight into a question of research and development management. For instance, a cross-tab that shows which organizations specialize in which content can be used for organizational profiling.
A decision-maker may use this as an input into questions of strategic
alliance. The variety of information products that we will be considering
in the book is shown in Table 1.2.
Lists include quick top 10 summaries of the data. For instance, a list
might be of top 10 most published authors in a given domain. These lists
should not be confused with the Python data structure known as a list.
Well be discussing this data structure in subsequent chapters. A table is a
complete record of information, indexed by a unique article id or a unique
patent id. Such a table might include the presence or absence of key terms
in an article. Another example of a table could include all the collaborating organizations, unique to each article. A cross-tab merges two tables
together to produce an informative by-product. For instance, we could
combine the content and organization tables to produce an organizational
profile indicating which organizations research what topics. Our usage of
list, table, and cross-tab is deliberately informal here.
The table shows the type of question being asked, as well as the form
of information product, resulting in a five by three table of possibility
(Table 1.3). Although we havent created examples of all the 15 kinds of
questions and products represented in this table, there is a representative
sample of many of these in the book to follow.
We now briefly introduce the book to follow. The next chapters,
Chapters 2 and 3, provide a quick start to the Python programming language. While there are many fine introductory texts and materials on
Table 1.2. Types of information product
Type of information product
List
Table
Cross-tab

Tech Mining Using Open Source Tools7

Table 1.3. Questions and products


List

Table

Cross-tab

Who
What
When
Where
Why

Python, we offer a quick start to Python in these two chapters. The chapters provide one standard way of setting up a text mining system, which
can get you started if you are new to Python. The chapters also provide
details on some of the most important features of the language, to get you
started, and to introduce some of the more detailed scripts in the book to
follow.
There is also a chapter on data understanding, which is Chapter 4 of
the book. This chapter covers sources of science, technology, and innovation information. There is a wealth of differently formatted files, but they
basically break down into row, column, and tree-structured data. During
the data mining process, cleaning and structuring the data is incredibly
important.
We provide two full chapters on the topic, Chapters 5 and 6, where
we guide you through processes of extracting data from a range of text
sources. Here, especially the differences between text mining and more
general data mining processes become apparent. These chapters introduce
the idea that text is structured in three major waysby rows, by columns,
and by trees. The tree format in particular leads us to consider a range of
alternative media formats including the pdf format and the web page.
The book concludes with Chapters 7 and 8, where we discuss producing informative lists and tables for your text data. These chapters walk
into the gradually increasing levels of complexity ranging from simple top
10 lists on your data, to full tables, and then to informative cross-tabs of
the data. These outputs are useful for both decision-makers as well as for
additional statistical analysis.

Index
A
altmetrics, 42
Anaconda, 12
ARPANET, 1
arrays, 18
B
BeautifulSoup library, 67
Web scraping, 6770
blogs, 42
Boolean query, 40
Bouzas, V., 30
C
Canopy, 12
citation
forward, 82
measures, 43
cited record field (CR), 39
column-formatted data, 35
reading, 4751
compound data structures, 3032
Continuum Analytics page, 12
corpus, 32, 48, 52, 54, 83
counter, 29, 8485
counter, making, 7879
CRISP-DM. See cross-industry
standard process for data
mining
cross-industry standard process for
data mining (CRISP-DM), 106
cross-tabs, 6, 76
creating, 96100

D
data
collecting and downloading,
3441
formats, 35
mining, 5, 7
unstructured, 35
data directory, 21
data structures, 2630, 105
compound, 3032
counter, 29
data transformation, 105
databases, 38
DataFrames, 18, 9196
reporting on, 100102
Delicious, 34, 43
Derwent, 38
describe() method, 101
Designated states (DS), 40
development environment, 1317
dictionaries, 27
corpus, 32
reading and parsing, 5456
reading and printing JSON
dictionary of, 5659
dictionary, defined, 48
distribution, Python, 12
E
Eclipse, 13. See also integrated
development environment
(IDE)
enumerate() method, 26

116Index

EPO. See European Patent Office


European Patent Office (EPO),
44, 62
Extensible Mark-up Language
(XML) file, 62
reading, 6267
F
field tag, 52
fields, 38. See also records
forward citations, 82
G
github, 9, 25
H
hyperlink, 40, 82
I
IDE. See integrated development
environment
innovation, 33
integrated development
environment (IDE), 13
International Patent Classification
(IPC), 40, 44
invention, 33
IPC. See International Patent
Classification
iPython Notebook, 10, 1314, 21,
22
J
Java Script Object Notation
(JSON), 24, 47, 105
reading and printing dictionary
of, 5659
splitting on attribute, 7778
JSON. See Java Script Object
Notation
L
library, 17
lists, 26

M
machine learning, 107
machine-accessible format, 35
Matplotlib, 18
MEDLINE database, 44
Mendeley, 34
module, 17
N
nanometer, 44
nanotechnology, 36, 44
National Library of Medicine, 40
Natural Language Tool Kit
(NLTK) package, 107
NetworkX, 18
Networkx package, 107
NLTK. See Natural Language Tool
Kit
NumPy package, 18
O
open source software, 1
output directory, 21
P
package, 1719
NetworkX, 18
NumPy, 18
Pandas, 18
Pip, 18
Scikit-learn, 18
SciPy, 18
Pandas package, 18
dataframes, 91, 95
parsers
pdf files, 71
PubMed, 5354
partial index, 9091
pdf files
mining content, 7174
parsing, 71
pdfminer3k, 71
Pip package, 18
PMID, 52

Index 117

PubMed, 40, 44, 47


parsing, 5354
Python, 23
coding, 3, 11
data structures, 2630
dictionaries, 27
different version of, 1112
distribution, 12
installing, 1213
language, 19
lists, 26
meaning, 11
part of text mining, 109110
scripts coding, 13
versions of, 1112
Python Application Programming
Interface, 19
R
R language, 11
records, 38. See also databases
row-formatted data, 35
reading, 5153
S
sample record counter, 84
Scikit-learn package, 18
SciPy package, 18
script, 17, 108109
Python coding, 13
simple reports, 7982
stop words, 85
syndicated database, 44
T
t-stochastic neighborhood
embedding, 107
tech mining, 103
lessons, 104106

technology, 33
text analytics, 106108
text mining, 3
input organization, 21
outputs organization, 21
python part of, 109110
text visualization, 107
Text-Mining-Repository, 9
Thomson Data Analyzer, 110
to_csv() method, 102
tree-formatted data, 35
Twitter, 18, 34, 43
U
University of Science and
Technology of China, 99
unstructured data, 35
urllib module, 6768
V
VantagePoint software, 110
W
walk method, 55
Web of Science, 35
web pages, 42
Web scraping, 67
BeautifulSoup library, 6770
Wikipedia, 34, 43
wildcards, 35. See also Web of
Science
words, counting, 8388
X
XML. See Extensible Mark-up
Language
Z
zip method, 50

You might also like