You are on page 1of 152

GUNADARMA UNIVERSITY

FACULTY OF INDUSTRIAL TECHNOLOGY


Internet Rsum Document Classication using
Nave Bayes Algorithm
Name : Septiawan
Student ID : 56410473
Department : Informatics Engineering
Supervisor : Dr.-Ing. Adang Suhendra, SSi, SKom, Msc
Thesis Submitted to The Faculty of Industrial Technology
Gunadarma University
In Partial Fulllment of The Requirements
For The Undergraduate Degree
2014
Statement of originality and
publications
Here by :
Name : Septiawan
Student ID : 56410473
Title of
Undergraduate
Thesis
: "INTERNET RSUM DOCUMENT
CLASSIFICATION USING NAVE BAYES
ALGORITHM"
Session Date : 5 March 2014
Passing Date : 5 March 2014
States that the writings are the result of my own work and may be pub-
lished entirely by Gunadarma University. All quotations in any form have
been following the rules and ethics that applicable. All copyrights of logos
and products mentioned in this book are the property of their respective
rights holders, unless otherwise noted. Regarding the content and writing is
the responsibility of authors, not Gunadarma University.
Hence this statement is made actually with full awareness.
Jakarta, 5 March 2014
(Septiawan)
Validation Page
Advisory Comitee
Session Date : 5 March 2014
No. Name Position
1. Dr.-Ing. Adang Suhendra, SSi, SKom, Msc
2.
Board of Examiners
Passing Date: 5 March 2014
No. Name Position
1.
2.
3.
4.
5.
Acknowledged by ,
Supervisor Department Head of Bachelors
Defense Examination
(Dr.-Ing. Adang Suhendra, SSi, SKom, Msc) (Dr. Edi Sukirman, MM)
Abstract
Septiawan. 56410473
"INTERNET RSUM DOCUMENT CLASSIFICATION USING NAVE BAYES
ALGORITHM".
Undergraduate thesis, Faculty of Industrial Technology, Informatics Engi-
neeringDepartment, Gunadarma University, 2014.
Keyword : Job rsum, Document classication, Nave Bayes classier, Data
Crawling
(xiii+ 105+ Appendix)
Classifying or categorizing rsum documents into an appropriate job cat-
egory takes much time to be done. Especially, if an employer has many
rsum documents to be classied properly in time. An automated docu-
ment classication system could recognize a document if the system has been
trained to recognize the pattern of the document. Rsum documents are
distinguished into multiple categories, then to classify rsum documents,
multi-class document classication would be better to be used than binary
document classication. In training process, example rsum documents are
needed for pattern reference. To train the pattern, a classier algorithm
is needed to create a classication model of the pattern reference, and to
classify a document that is wanted to be classied. Nave Bayes classier
algorithm is a classier algorithm that can be used to create classication
model. In this thesis, a rsum document classication have been developed
to classify rsum documents into the appropriate category. Example rsum
documents were obtained from online rsum directory, Indeed.com. Data
crawling is method that crawling links, and fetching all informations from
the website page by using HTML Document Object Model structure.
References (1996-2014)
iv
Acknowledgements
First of all I would like to thank Jesus Christ for all the blessing, because only
with Hiss help and permission I be able to nish my undergraduate thesis
which entitled Internet Rsum Document Classication using Nave Bayes
Algorithm in time.
This undergraduate thesis is intended to complete the requirement to n-
ish my study in the Informatics Engineering Department, Gunadarma Uni-
versity. This writting would not have been possible without the support and
encouragement from people arround. Therefore, I would like to say thanks
to:
1. Prof. Dr. E. S. Margianti, SE., MM, as Rector of Gunadarma University.
2. Prof. Suryadi Harmanto, SSi, MMSI, as 2nd Rector Assistant of Gu-
nadarma University who gave me the opportunity to join SARMAG
Program.
3. Prof. Dr. Ir. Bambang Suryawan, as Dean of Industrial Technology
Faculty.
4. Dr.-Ing. Adang Suhendra, SSi, SKom, MSc, as Head of Informatics
Engineering Program and as my supervisor. For his guidance, support,
encouragements, and especially, this thesis idea during this research.
Guided me to do well, shown me what is right and what is not during
this research, and thesis writing.
5. Drs. Haryanto, MMSI as director of SARMAG program that giving me
a chance to join SARMAG program.
6. Remi Sanjaya, ST., MMSI as SARMAG coordinator who always gave us
all of informations about SARMAG and supports.
7. All of SARMAG lecturers, for their knowledge and experience given to
me.
v
vi
8. My beloved father who passed away in 2009 when i was in second
grade senior high school. My beloved mother for her exceptional sup-
port, love, encouragement to nish this writing in time, and prayers
that could not be disclosed by words. I dedicate this thesis for my fa-
ther and mother, I could not even start and nish anything without
your love.
9. My beloved sister, and brother, even we were not being close each other
when i nished this thesis, your love and support always embraced me
all the time.
10. My partner, Damai Subimawanto, for help and support when i was lost
and stuck, and did not know what should i do, for sharing lof of idea
and information during this research. Trully, without his help, i could
not nish this thesis. Very thank you, i addressed to him.
11. SMTI05 friends for the laugh even when we are in the worst time.
Especially Zaki, Tryono, Donny, Ario, Ropy, Jonathan, and Yusuf.
12. Matius, Noy, and Hoki, for the support.
13. Everyone that I could not mention here.
This writting is still need some improvement. Therefore, I am looking for-
ward to get some developed critics and suggestions from anyone. Hopefully,
I could continue to higher education and this undergraduate thesis could
bring more advantages specially to me and generally to readers.
Jakarta, March 2014
Septiawan
Contents
Abstract iv
Acknowledgements v
List of Figures xi
List of Tables xii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scope of the research . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Objective of the research . . . . . . . . . . . . . . . . . . . . . 4
1.5 Method of the research . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Structured of Thesis . . . . . . . . . . . . . . . . . . . . . . . 5
2 Theoretical Background 7
2.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . 8
2.1.1.1 Text Tokenizing . . . . . . . . . . . . . . . . 9
2.1.2 Text Transformation/Feature Extraction . . . . . . . . 9
2.1.2.1 Text Filtering . . . . . . . . . . . . . . . . . . 9
2.1.3 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . 10
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Learning Technique . . . . . . . . . . . . . . . . . . . . 12
2.2.1.1 Supervised Learning . . . . . . . . . . . . . . 12
2.2.1.2 Unsupervised Learning . . . . . . . . . . . . 13
2.2.1.3 Semi-supervised Learning . . . . . . . . . . . 13
2.3 Document Classication . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Classication by Hand . . . . . . . . . . . . . . . . . . 14
vii
Contents viii
2.3.2 Classication with Machine Learning . . . . . . . . . . 15
2.4 Nave Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Explanation of Bayess Rule . . . . . . . . . . . . . . . 18
2.4.2 Explanation of Nave Bayes . . . . . . . . . . . . . . . 19
2.4.3 Nave Bayes Classier for Text Classication . . . . . . 22
2.5 Crawling the Web . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 DOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.3 Jsoup API . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 WEKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Online Rsum Directory . . . . . . . . . . . . . . . . . . . . . 30
2.7.1 Indeed.com . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7.2 Rsum Format . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Analysis and Design 33
3.1 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1.1 HTML Element Structure of Indeed Rsum
Page . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1.2 HTML Element Structure of Searching Result
Page Indeed . . . . . . . . . . . . . . . . . . 37
3.1.1.3 Document for Classication Modeling . . . . 39
3.1.1.4 Document for Testing . . . . . . . . . . . . . 40
3.1.2 Document Classication with WEKA . . . . . . . . . . 40
3.1.2.1 Steps of Classication on WEKA . . . . . . . 40
3.1.2.2 Nave Bayes Classier on WEKA . . . . . . . 43
3.1.2.3 ARFF File . . . . . . . . . . . . . . . . . . . . 44
3.1.3 System Specications . . . . . . . . . . . . . . . . . . 45
3.1.4 User Specications . . . . . . . . . . . . . . . . . . . . 45
3.1.5 Development Tools . . . . . . . . . . . . . . . . . . . . 46
3.1.6 Description of System . . . . . . . . . . . . . . . . . . 46
3.1.6.1 Data Crawling Stage . . . . . . . . . . . . . . 46
3.1.6.2 Preprocessing Stage . . . . . . . . . . . . . . 47
3.1.6.3 Learning Stage . . . . . . . . . . . . . . . . . 48
3.1.6.4 Classication Stage . . . . . . . . . . . . . . 50
3.1.7 Class Analysis of System . . . . . . . . . . . . . . . . . 51
3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Contents ix
3.2.1 Data Crawling . . . . . . . . . . . . . . . . . . . . . . 52
3.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . 55
3.2.3 ARFF File Converter . . . . . . . . . . . . . . . . . . . 56
3.2.4 Data Training/Modelling . . . . . . . . . . . . . . . . . 60
3.2.5 Data Classication . . . . . . . . . . . . . . . . . . . . 61
3.3 Testing Design . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Objective of Testing . . . . . . . . . . . . . . . . . . . 62
3.3.2 Scenario of Testing . . . . . . . . . . . . . . . . . . . . 63
4 Implementation and Testing 67
4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Data Crawling class . . . . . . . . . . . . . . . . . . . 67
4.1.1.1 Declaration of Variable . . . . . . . . . . . . 67
4.1.1.2 createFileFromIndeed method . . . . . . . . 68
4.1.1.3 fetchIndeed method . . . . . . . . . . . . . . 71
4.1.1.4 Print Function method . . . . . . . . . . . . . 73
4.1.1.5 Main method . . . . . . . . . . . . . . . . . . 73
4.1.2 Data Preprocessing class . . . . . . . . . . . . . . . . . 74
4.1.2.1 initStopWords method . . . . . . . . . . . . . 74
4.1.2.2 start method . . . . . . . . . . . . . . . . . . 74
4.1.2.3 removeNotLettersAndStopWords . . . . . . . 75
4.1.3 Training/Learning class . . . . . . . . . . . . . . . . . 75
4.1.3.1 loadDataset method . . . . . . . . . . . . . . 76
4.1.3.2 evaluate method . . . . . . . . . . . . . . . . 77
4.1.3.3 learn method . . . . . . . . . . . . . . . . . . 78
4.1.3.4 saveModel method . . . . . . . . . . . . . . . 78
4.1.4 Classication class . . . . . . . . . . . . . . . . . . . . 79
4.1.4.1 loadNewset method . . . . . . . . . . . . . . 79
4.1.4.2 loadModel method . . . . . . . . . . . . . . . 79
4.1.4.3 classifyDocument method . . . . . . . . . . . 80
4.1.5 ARFF File Converter class . . . . . . . . . . . . . . . . 82
4.1.5.1 convertText2Arff method . . . . . . . . . . . 82
4.1.5.2 checkNLoadFile method . . . . . . . . . . . . 82
4.1.5.3 lterFunction method . . . . . . . . . . . . . 84
4.1.5.4 prepareDataTrainTest method . . . . . . . . . 85
4.1.6 Timer class . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Testing Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Report of Data Crawling process . . . . . . . . . . . . 88
Contents x
4.2.2 Report of Document Classication process . . . . . . . 90
4.2.3 Result of Atypical Types of Job . . . . . . . . . . . . . 90
4.2.3.1 With Proportion of Training and Testing data
set . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.3.2 Without Proportion of Training and Testing
data set (100% data set) . . . . . . . . . . . 94
4.2.4 Result of Typical Types of Job . . . . . . . . . . . . . . 95
4.2.4.1 With Proportion of Training and Testing data
set . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.4.2 Without Proportion of Training and Testing
data set (100% data set) . . . . . . . . . . . 100
4.2.5 Comparison Atypical and Typical types of job . . . . . 102
5 Concluding Remarks 103
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Suggestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Bibliography 106
Appendix A Code Listing 109
Appendix B Stopwords List 135
List of Figures
2.1 Email ltering illustration . . . . . . . . . . . . . . . . . . . . 11
2.2 Document classication types . . . . . . . . . . . . . . . . . . 15
2.3 HTML DOM Tree of Objects . . . . . . . . . . . . . . . . . . . 29
3.1 An Example of Indeed rsum page . . . . . . . . . . . . . . . 35
3.2 Searching Result Page, with keyword web developer, and shows
the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Searching Result Page, with next page link information at page
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Searching Result Page, that shows change of link at page 2 . . 38
3.5 An example output of Evaluation process . . . . . . . . . . . . 42
3.6 An example output of Testing/Validation process . . . . . . . 43
3.7 Workow of System . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Folder Structure of System . . . . . . . . . . . . . . . . . . . . 57
4.1 Output that shows start of crawling process . . . . . . . . . . 89
4.2 Output that shows end of crawling process . . . . . . . . . . . 89
4.3 Output that shows data preprocessing is done . . . . . . . . . 89
xi
List of Tables
2.1 Example weather data set for predicting play condition . . . . 20
2.2 Nave Bayes model of example weather data set . . . . . . . . 20
2.3 Example new instance prediction, Evidence E . . . . . . . . . 21
2.4 Prediction result of Evidence E . . . . . . . . . . . . . . . . . 22
2.5 Example training data D0 - D5 . . . . . . . . . . . . . . . . . . 23
2.6 Example Nave Bayes model for D0 - D5 . . . . . . . . . . . . 23
2.7 Example testing data D
t
. . . . . . . . . . . . . . . . . . . . . 24
2.8 Prediction result of data D
t
. . . . . . . . . . . . . . . . . . . . 26
2.9 Tag Form of HTML element . . . . . . . . . . . . . . . . . . . 27
3.1 Types of job . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Table Design of Accuracy Rate result with Proportion of data set 64
3.3 Table Design of Class Prediction Result with Proportion of data
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Table Design of Accuracy Rate result with 100% of data set . . 65
3.5 Table Design of Class Prediction Result with 100% of data set 66
4.1 Result Accuracy Rate of Atypical group with 1000 documents 91
4.2 Result Class Prediction of Atypical group with 1000 documents 91
4.3 Result Accuracy Rate of Atypical group with 2500 documents 92
4.4 Result Class Prediction of Atypical group with 2500 documents 92
4.5 Result Accuracy Rate of Atypical group with 5000 documents 93
4.6 Result Class Prediction of Atypical group with 5000 documents 93
4.7 Result Accuracy Rate of Atypical group with whole data set . 94
4.8 Result Class Prediction of Atypical group with whole data set . 95
4.9 Result Accuracy Rate of Typical group with 1000 documents . 96
4.10 Result Class Prediction of Typical group with 1000 documents 96
4.11 Result Accuracy Rate of Typical group with 2500 documents . 97
4.12 Result Class Prediction of Typical group with 2500 documents 98
xii
List of Tables xiii
4.13 Result Accuracy Rate of Typical group with 5000 documents . 99
4.14 Result Class Prediction of Typical group with 5000 documents 99
4.15 Result Accuracy Rate of Typical group with whole data set . . 100
4.16 Result Class Prediction of Typical group with whole data set . 101
4.17 Comparison Table between Atypical and Typical types of job . 102
Chapter 1
Introduction
1.1 Background
Growth of information would increase along with growth of technology.
There are many ways to save informations, the one that already known is,
saving the information into a document, in piece of paper form or digital
form. Document is an article that contains information. At present, with
the ease of technology use, people prefer to use digital form of document
than the paper form, because the digital form gives much more advantages
than the paper form, easy to create, easy to use, and of course, along with
the improvement of digital storage technology, people could bring the digital
document anywhere with ease. A digital document could be created by using
a computer, and a digital document should be kept in a digital storage in the
computer, and people also already know that documents could be saved not
only in our hard drive computer. Since the internet was introduced, people
know that documents can be saved in internet, or known that way with doc-
ument hosting service term. There are many kind of document that could be
hosted or posted in internet, such image, music le, or text document such
as job rsum or curriculum vitae. The site that provides job rsum posting
could be called as online rsum directory. There are many online rsum
directory that allow user to post, search, review, or download the rsum
information, one of the well known online rsum directory is Indeed.com.
Searching for job position is a difcult and tedious process for both the
employees and the employers. There needs to be a good match between the
qualications of the work experience, skills and educations of the employee
and the qualications that employers seeks. The qualication information
of employee is listed in job rsum or could be said as CV, Curriculum Vitae.
1
1.1. Background 2
Most rsum contains Name of the rsum owner, domicile, work experience
list, education, skills, and additional information that should be known about
the rsum owner. It is a daunting process to go over a pile of rsums in
order to nd the right t. It is not uncommon for an employer to quickly
glance over a rsum before deciding whether the coulddidate is suitable for
a job. And sometimes, a rsum contains many work experience list that
maybe not similar each other, depends on the work history of the rsum
owner. And of course, most of rsum does not have a title that mentions
the rsum is a rsum for a specic job. That is why employer or hiring
managers need to examine the rsum carefully to nd out that rsum is
belongs to what type of job. This work is needed to be done very carefully.
But what if the hiring managers got a lot of job rsums that are sent to the
company, a hundreds, a thousands. How could the hiring managers could
categorize or classify those pile of rsums according to the specic job?
Seems, the hiring managers could not do the rsum classication manually
in short time, yes, this job takes a lot of time. So, this job is needed to be
done not with manual way, since this job takes a lot of time and of course
high cost.
The computer could recognize a specic document if the computer has
been trained to recognize the pattern of the document. That could be said
as the pattern with term How the document looks like?, What kind of in-
formation that stored in the document?, The information is built with what
words?, and What is a word feature of the information so that could be said
the information that is constructed with the words is a specic document?.
The way to train the computer to recognize it is by taking samples of docu-
ments to build the classier model, and then take a some another samples
to test the classier. If the classier model is good enough to recognize the
pattern, and then the model can be used to classify the new document which
have not been known what kind of document it is, and what kind of informa-
tion it stored. This kind of automated work, as well known with document
classication in machine learning context. As mentioned before, there is a
problem that experienced by the hiring managers or employers to handle the
job to classify the rsum documents. So, computer program or system could
be build to be trained with to classify rsum documents as automatically, so
this waste of time job could be done with computer, in shorter of time than
manual way. But of course, it is needed to train the program/system rst,
it is needed to teach the program what or how the pattern of specic type
1.2. Problem statement 3
of rsum document looks like, so it is needed examples of each type of
rsum document to show and teach the program what is looks like.
Since, job rsum document is written by human and his experience, one
rsum document from a person should contain a different information with
other rsum from different people, so one example rsum document could
not be used to train the program, many example rsum document with
different kind of information contained on it is needed, more the example
document could be given to teach the program, very sure the program would
be smarter than if only one document is provided. There is a question occurs,
what if building a program to be able to do document classication but we
do not have much document is needed, or could be said, a researcher with
a lack of rsum document source want to try build a rsum classication
program. In this case, online rsum directory in internet that provides a
free rsum documents from every people and every types of job could be
used. As have been mentioned before, Indeed.com is one of the most well
known online rsum directory, so Indeed site could be used as data source
. Because Indeed provides all rsum documents in web page format, so it
is needed to get all rsum documents that is stored in each Indeed page,
and save those rsums in our computer directory as le document. The
technique to obtain all information from web page is called web crawling
(data crawling or data fetching).
To do classication, a classication model is needed to be built. Classi-
cation model contains information about patterns of document with specic
type. This information is created by a classier algorithm. Nave Bayes is one
of the most commonly used classier algorithm for document classication.
It is popular along with the other algorithm because its successful history
and its simplicity to be able to classify the large amount of document. In
this paper, a system that could obtain rsum information from Indeed.com,
and saves all informations into le document would be developed. Use the
document to train a classier, build a model for rsum classication, and
classify a new rsum document that the type of job is still unknown.
1.2 Problem statement
The problem of this research that:
1. How to crawl an online rsum directory page to obtain job rsum
documents, and save the informations into documents?
1.3. Scope of the research 4
2. How to train the example rsum documents (labeled documents) to
create a classication model that represent the pattern of documents
by using Nave Bayes classier algorithm?
3. How to classify the new rsum document that the type is unknown
into the right category by using the classication model?
4. If there is two category of job, the Atypical type of job (types of job that
is different of each other) and Typical type of job (types of job that has
similarity of each other), which one would be suitable to build a good
classication model?
5. And from those two category of job, how is quality of the accuracy of
classication model.
1.3 Scope of the research
In this research, example rsum documents would be obtained from In-
deed.com site to learn the pattern of these rsum documents and create the
classication model that would be used to classify or predict the unlabeled
rsum documents to the most appropriate category. A web crawler to obtain
all rsum informations from Indeed by using Jsoup tool would be built, and
the classication system by using WEKA environment would be developed .
1.4 Objective of the research
The objective of this research is to build a system that could do web crawling
to obtain rsum documents from internet, and build classication system
to classify rsum documents. And the accuracy rate of this classication
system would be shown, and how is the quality of our classication system
to predict the unlabeled document.
1.5 Method of the research
Theoretical Study
Any basis knowledges and theories that would be used in our system
development.
1.6. Structured of Thesis 5
Design and Analysis
Analysis the system that would be built, and design the works of sys-
tem, from data analysis, specication of system, user specication, and
the system description. A testing scenario to test the quality of system
would also be built .
Implementation and Testing
The system design to develop the classication system would be imple-
mented, and do testing to test if the system works as expected from the
design.
1.6 Structured of Thesis
Brief overview of the thesis organization. the thesis is organized as follows:
Chapter 1 : Introduction
Chapter 1 explains the background, problem statement, scope, objec-
tive of the thesis, and research method. The structure of the thesis
would be also explained briey in this chapter.
Chapter 2 : Theoretical Background
Chapter 2 provides the relevant and necessary theoretical knowledges
which is used by us to develop the internet rsum document classi-
cation system, and for reader to be able to understand the rest of the
thesis. By understanding the theoretical knowledges, reader could un-
derstand the whole process in system development. But, it is not meant
as a thorough explanation of all details regarding the topics, and it also
refers to books and articles (most of them comes from library digital)
Chapter 3 : Analysis and Design
Chapter 3 explains the design of system that would be made, including:
analysis, and stage of system development.
Chapter 4 : Implementation and Testing
Chapter 4 shows how we implemented our approach and also take a
closer look at the description and explanation of codes. Testing stage
of the system that has been completed would also be shown and de-
scribed.
1.6. Structured of Thesis 6
Chapter 5 : Concluding Remarks
This chapter would take a look at the results and how good our imple-
mentations are done. It would also discuss what needs to be done in
future work to improve the results.
Chapter 2
Theoretical Background
This chapter would provide relevant information that is necessary to be
able to understand the basic knowledges of this thesis, due to the devel-
opment of rsum classication system, and this thesis writing process, we
use a variety of basis knowledges that would be described in this chapter to
provide a strong foundation in understanding each technique development.
Any information that is contained in this chapter, are referenced to books,
journals, articles, or electronic library with citation that could be traced in
the reference section at the end point of this thesis, which is intended for
readers who want to understand more clearly about the specic information
contained in this thesis. The purpose of this chapter is as a basic introduction
to the topics that raised in this thesis, so that the whole of this thesis could
be understood. Basis knowledges of this thesis are text mining, web crawling
technique, and machine learning.
2.1 Text Mining
Text mining has denition as mining data in the formof text where the source
data is usually obtained from a document, and the goal is to nd words
that could represent the contents of the documents so connection between
documents could be analyzed (Mooney, 2007). FromData Mining: Practical
Learning Tools and Techniques (Witten et al., 2011) dene text mining as:
The process of discovering patterns in data. The process must be automatic
or (more usually) semiautomatic. The patterns discovered must be meaningful
in that they lead to some advantage, usually an economic one. The data is
invariably present in substantial quantities. With text mining tasks associated
with analyzing large amounts of text, pattern discovery, and extracting useful
information from a text might be done. Text mining could also be interpreted
7
2.1. Text Mining 8
as a discovery of new information and is not known by computer before,
by automatically extracting information from different sources. The key in
this process is to combine the information that was extracted from various
sources (Hearst, 2003).
Although the point of classication system is pattern discovery phase, but
in general, the process of text mining is divided into three (3) main stages
continuous, rst stage is initial preparation of the text or could be said as
text preprocessing; further, second stage is the text transformation into a
simpler form (text transformation/feature extraction). This second stage is
more widely known as feature extraction, and based on its term, the aim of
second stage is to nd keywords or feature from a text, that become core of
text, and the core of text is the headline of a text. And third stage is pattern
discovery. The following is a brief description of each stage:
2.1.1 Text Preprocessing
Initial stage of text mining is text preprocessing, that has goal to prepare
text into data that would be experiencing a processing at a next stage. Some
examples of actions that could be done at this stage, start from a complex
action such as part-of-speech (POS) tagging, parse tree, to the simple actions
such as simple parsing process of the text, that breaking a sentence into a set
of words or known as text tokenizing. Also in this stage, are also conducted
case folding action, that convert all capital letters into lowercase letters. Case
folding action is applied to all words contained in the text, with no exception,
capital character in the name, term, or acronym, considering that generally,
in in the next stage of feature extraction, name word, terms, or acronyms
regarded as too unique word, and is invalid if it is considered as a feature in
a text. For example, in 100 rsum texts, there is small chance that 5 texts
with Damai Subimawanto as name word would occur, if there is exists, the
possibility of that name word could be used as a feature of a text is too small,
and less relevant.
Part-of-speech (POS) tagging action does parsing for all sentences in text
and then assign roles to each word, for example: director (as subject) go
(as predicate) to (as conjuction) ofce (as adverb place). The result of POS
tagging could be used to parse tree, which each sentence stands as an inde-
pendent tree, with each word which built the tree as the branches.
For simple parsing process, parse tree is not built with POS tagging. In
simple parsing process, system would break down the text into group of
2.1. Text Mining 9
words (tokenizing), and then the group of words would be taken as input to
the next stage of text mining process.
2.1.1.1 Text Tokenizing
Text tokenizing is an action cutting or break down the input that formed
as sentence into smaller parts words that construct the sentence (parsing)
(Mooney, 2007). For example, there is the phrase this thesis gives an expla-
nation about crawling data process from indeed directory. If the sentence is
through tokenizing process, tokens would be produced as every words that
contained inside it. The result from tokenizing process of the sentence is
word this, thesis, gives, an, explanation, about, crawling, data, pro-
cess, from, indeed,and directory. And also in tokenizing process, the sim-
ilar words could be eliminated, so although no matter how much the word
thesis in a sentence, the word thesis is only counted as a single token.
2.1.2 Text Transformation/Feature Extraction
After passed preprocessing stage, input sentence is no longer as a complete
sentence form, but has become a collection of words. In the feature extrac-
tion stage, the collection of words that already obtained from preprocessing
stage would go through the extraction stage. It is said with extraction stage
term, because the collection of words that already obtained before, would
go through the selection process based based on the nature and weight of a
word to the sentence it built. The selection process is done by reducing the
number of words that the existence considered does not affect the quality of
sentence, it could be said, whether or not the word in a sentence does not
really give effect to the quality of the sentence.
Person pronouns, possessive pronouns, and conjuctions are words that
could be classied into words that could be eliminated, or referred to as
stopwords in text ltering process. Besides the selection process could also
be done by reducing the words that have the same basic shape in text stem-
ming process. Here is a further explanation about the process in the feature
extraction stage.
2.1.2.1 Text Filtering
Text ltering process removes stopwords from a collection of words that
have been obtained on tokenizing stage, and instead, just take the impor-
2.1. Text Mining 10
tant words. As described previously, the stopword is a collection of words
whose existence does not affect the quality of sentence formation, such as
personal pronouns, possessive pronouns, and conjunctions. The quality of
stopwords is determined by the number of words included in the list of stop-
words. The more the words in the list of stopwords affect the quality of
the results of text ltering process, the better the quality of the results of
text ltering process, the better the quality of the nal results of text min-
ing. But, keep in mind that the determination of the word to the stopword
list should be done with as tight as possible, in order to avoid errors in the
later stages of pattern discovery, caused of error of word classication, that
classify important words as stopword. Stopword list for each language dif-
ferent from one another, which of course stopword with a list of words in
the English could not be used in the words ltering process in Indonesian,
and vice versa. Dependance stopword to language factor, is often said to be
a weakness of the ltering process, but the process remains ltering is used
because this process would greatly reduce the workload of the system.
2.1.3 Pattern Discovery
The nal stage of text mining is the pattern discovery stage, which is the most
important stage of the entire text mining process. Based on the two previous
stages, the preprocessing stage and the feature extraction stage, at this state,
a collection of words already acquired to be processed to determine how
far the relationship between a collection of words in a text (Mooney, 2007)
or in the other words, the pattern discovery stage aims to nd patterns or
knowledge of the entire text contained. The pattern gives the idea that, if
there is the word A, B, and C in the text, it could be predicted as text with
class X.
To be able to predict, in pattern discovery stage, learning the techniques
required to read patterns from a text, and apply it to a new text that its pat-
tern not yet known. In the discussion of text mining, there are three (3)
learning techniques in pattern discovery stage, Supervised, Unsupervised,
and Semi-supervised learning. The difference of these three is Is the training
dataset was labeled/given class or not?. The dataset is meant here is a col-
lection of texts which are processed by the text mining process. These three
techniques are learning techniques in Machine Learning algorithm. Since in
this thesis, using classication method that is a form of Supervised Learning
techniques, so to be better in understanding about various tehniques and al-
2.2. Machine Learning 11
gorithms used in machine learning, in the next chapter would explain about
Machine Learning.
2.2 Machine Learning
Machine learning is a subeld of articial intelligence (AI) concerned with
algorithms that allow computers to learn. What this means, in most cases,
is that an algorithm is given a set of data and infers information about the
properties of the data, and that information allows it to make predictions
about other data that it might see in the future. This is possible because
almost all nonrandom data contains patterns, and these patterns allow the
machine to generalize. In order to generalize, it trains a model with what it
determines are the important aspects of the data (Segaran, 2007).
To understand how models come to be, consider a simple example in
the otherwise complex eld of email ltering. Suppose we receive a ot of
spam that contains the words free rsum search or win lottery. As a
human being, we are well equipped to recognize patterns, and we quickly
determine that any message with the words free rsum search or win
lottery is spam and should be moved directly to the trash. From there, we
have created a mental model of what is spam, which is spam and which is
not spam. In this case, same like when we determine which is the spam and
which is not, a machine learning algorithm designed to lter spam should
also be able to know how spam looks like, and determine which is spam and
which is not from your email storage. Figure 2.1 show illustration of the
model.
Figure 2.1 Email ltering illustration
Source: Text Categorization, Raymond J. Mooney, 2007
2.2. Machine Learning 12
However, machine learning, initially, is not equipped with recognize pat-
terns ability, and could not determine whether an email is spam or not.
Therefore, machine learning needs several important things to be able to
do spam ltering, as following:
1. Learning Algorithm, What kind of learning technique would be used to
do the job?
2. Classier Model, What kind of classier would be applied to train the
training data? If there is any feature that could set the classier to
make the output more precise, what feature is it?
3. Training Data, Training data is a set of data consists of an input vector
and an answer vector, and used to train a knowledge database along
with classier model. At spam ltering model, training data could be
set up from a collection of spam emal and not-spam email, so the ma-
chine learning could nd out the pattern of spam email and not-spam
email.
4. Unknown/New Data, Unknown data could be said as actual data that
yet unknown, is it spam or not-spam email. And would be tested on
already trained machine learning model.
There are many different machine learning algorithms, all with different
strengths and suited to different types of problems. There are Nave Bayes,
Support Vector Machine, K-means, Fuzzy Logic, and many more. But gener-
ally, all the machine learning algorithms are divided into three major parts,
based on the learning techniques is used. That three major parts are Su-
pervised, Unsupervised, and Semi-supervised learning technique. The dif-
ference bertween these three types of learning technique is whether training
data has been hand-labeled (by humankind) or not to generate the algorithm
classiers output.
2.2.1 Learning Technique
2.2.1.1 Supervised Learning
Supervised learning assumes that a set of labeled training data has been pro-
vided, which has been labeled by a person a qualied person. A classier
rst receives a training data in which each item is marked with a label or we
2.3. Document Classication 13
could call as a class from a discrete nite set (sometimes these labels may be
related through a taxonomic structure, such as hierarchical topic catalog).
Then, the learning algorithm is trained using this data. Once the classi-
er is trained, it is given unlabeled test data and has to guess the label.
The more labeled instances in the data the classier gets, the more precise
would be the output. Goal of supervised learning is to predict value from
a model for an input valid data, after trained training data. Regression and
Classication are example of supervised learning. Regression happens if the
output is a continuous value, while classication occurs when the output is a
particular value of a goal attribute (not continuous). In this thesis, we would
use classication, because the desired output is to know certain class from
the classied data.
2.2.1.2 Unsupervised Learning
On the other hand, Unsupervised learning has no labeled training data, at-
tempts to nd out similar patterns in the data to determine the output. This
type of learning needs nobody for the training process. The goal for unsuper-
vised learning is to have the computer learns how to do something that we
do not tell it how to do. Clustering is an example of unsupervised learning.
2.2.1.3 Semi-supervised Learning
Semi-supervised learning is a combination of using both labeled and unla-
beled training data. This kind of learning is actually a supervised learning
that avoids labeling a large number of instances. This is done by using some
of the labeled data to help the classier labeling the unlabeled data. Then,
this automatic labeled data is also used by the training process.
2.3 Document Classication
Document classication, or would be said with classication term, is one
of Supervised learning techniques. Based on its form as one of supervised
learning technique, document classication aims to nd a pattern (pattern
discovery) of a document, which is the process requires training data that has
been labeled, to produce a model of machine learning. A typical classication
problem could be stated as follows: given a set of labelled examples belong-
ing to two or more classes (as training data), we classify a new test sample
to a class with the highest similarity. Document classication could often be
2.3. Document Classication 14
viewed as a two-class classication problem where a document is labelled
as relevant or non-relevant class. User feedback provides a set of training
examples with positive and negative labels. A document is presented to the
user if it is classied as the relevant class (Li and Jain, 2000).
The documents to be classied may be text, image, music, etc. Each kind
of document possesses its special classication problems. When not other-
wise specied, text classication is implied. Documents may be classied
according to their subjects or according to other attributes (such as docu-
ment type, author, printing year etc.) There are two main philosophies of
subject classication of documents: the content based approach and the re-
quest based approach. Content based classication is classication in which
the weight given to particular subjects in a document determines the class
to which the document is assigned. It is, for example, a rule in much library
classication that at least 20% of the content of a book should be about the
class to which the book is assigned. In automatic classication it could be the
number of times given words appears in a document. Request based classi-
cation (or indexing) is classication in which the anticipated request from
users is inuencing how documents are being classied (Wik, 2013).
In this thesis, we want to build a document classication system, but with
a lot of class or multi-class. Document classication system which would be
built by us, aiming to classify the rsum document (text-based) that we
get from a online rsum directory, and of course, rst we have to prepare
the training data that has been labeled as the particular classes. In chapter
3, would be explained that we want our classication system consists of at
least 5 (ve) classes. For example, we want that our document classica-
tion system could classify a rsum document into Accounting, ComputerIT,
Manager, Medical, or Technician job.
2.3.1 Classication by Hand
The way such classication has been done since the beginning has been by
hand (human kind). A person had to either have enough background infor-
mation about a document from before, or go through the documen and try to
gure out what class it belonged to. Librarians and other similar professions
have been doing this for as long as classication has been around and it is
a very manual, labor-intensive work. It is relatively effective as long as the
amount of documents are small, or you have small amounts of documents to
classify at any give time, but as soon as it increases substantially it would be
2.3. Document Classication 15
less and less cost-effective.
The case in our thesis involves trying to classify a thousand documents,
and of course, in the most effective way possible, and thus we need machine
learning help to achieve this.
2.3.2 Classication with Machine Learning
Classication system that we build in this thesis could be classied as classi-
cation because we want a system that could predict the class or kind of job
eld of a rsum document.
Figure 2.2 Document classication types
Source: Evaluating the Use of Learning Algorithms in Categorization of
Text, Alf Simen Nygaard Sorensen, 2012
Some of classication problems could only be classied into two cate-
gories like positive or negative and this is called binary classication. On
the other hand (as it is in our case) there are more than only two classes,
and this is called as multi class classication (Sorensen, 2012). And, the
documents we classify only may be categorized within just one class, be-
cause we want to restrict the rsum classication based on job eld that is
far different from each other. But, in addition, we would do a test on our
classication system to predict the job eld classes that has similarity of each
other. We want to know if our classication system could do prediction docu-
ments without having to be distracted by similarity of each class. Illustration
of classication types could be seen in Figure 2.2.
Machine learning-based document classication is automated document
2.3. Document Classication 16
classication, because the set of rules, or more generally, the decision crite-
rion of the document classier, is learned automatically from training data,
but remember, that the need for hand classication is not eliminated because
the training documents come from a person who has labeled them (Cam,
2008b).
Supervised learning is used to create a classication model of documents
by using already classied or labeled documents. The resulting model could
then be used to automatically determine the class of new documents that we
have not been labeled any class yet. The steps in classication with super-
vised learning: (Sorensen, 2012)
1. Data collection and preprocessing. All documents that are already la-
beled with classes are collected in this step. Collecting already labeled
documents with classes also could be called as crawling. After this we
have to do preprocessing step to cleaning the document. Preprocess-
ing is already explained in the text mining session in previous chapter.
Lastly we divide the collection of documents into two subsets:
(a) A training set. Or we could say as a training data, would be used
to create the model and may also be divided into two subsets; one
would be the part that actually creates the model, and the last one
used to ne tune the parameters for the model learning.
(b) A test set. We could say as a unknown or unlabeled or new data,
which is used to test if the model are good enough.
2. Building the model. This step is where you actually learn or train
the model with the use of a learning algorithm. It are often an iterative
process where you ne tune parameters of the feature selection and the
algorithm. Steps to iterate:
(a) Apply an evaluator that chooses the appropriate features.
(b) Apply the learning algorithm to get a model.
(c) Validate the model with the model learning fromthe training data.
3. Testing the model. At this step, applying the model to the test set from
the previous step, and we compare the predicted classes with the actual
classes of the documents. It should be noted that here the classes of the
documents are only used for evaluation and not used by the learning
algorithm as in model building step.
2.4. Nave Bayes Algorithm 17
4. Classication of new documents. When the model are considered good
enough it could be used to classify new documents that have unlabeled
class or unknown.
2.4 Nave Bayes Algorithm
To building a classication model, an algorithm is needed. The choice of
which specic learning algorithm we should use is a critical step. Once pre-
liminary testing is judged to be satisfactory, the classier (a term that we
used to name learning algorithm model to do mapping from unlabeled doc-
uments to classes) is available for routine use. The classiers evaluation is
most often based on prediction accuracy (the percentage of correct predic-
tion divided by the total number of predictions). As we already mentioned
before that there are many different learning algorithms, all with different
strengths and suited to different types of problems. A learning algorithm
couldt handle all types of problems, and otherwise, a problem maybe couldt
be handled by every algorithms. Or maybe they could, the result would not
be good enough to do calculation of the efciency from the cost.
A Nave bayes (NB) algorithm is one of the Supervised learning algo-
rithms along Support Vector Machine, k-Nearest Neighbor, and Logistic re-
gression, which is commonly used as classier in document classication,
especially text classication where NB really shines in there. The reason why
we choose NB algorithm as our rsum classication classier relates to NB
popularity in text classication eld. What makes NB classier so popular?
NB is one of the simplest classiers that one could use because of the simple
mathematics that are involved (Vryniotis, 2013).
Nave Bayess main strength is its efciency: Training and classication
process could be accomplished with one pass over the data. Because it com-
bines efciency with good accuracy it is often used as a baseline in text classi-
cation research (Cam, 2008a). In general, text classication could be done
a lot better with more specialized algorithm, even Support Vector Machine
could give more accurate prediction result of classication (with huge or less
document), however the NB classier is general purpose, simple to imple-
ment and good-enough for most applications (Nedelcu, 2012). Even though,
we would just use ready-to-use NB classier function from WEKAs package,
in term we would not bother about NB algorithm in our classication system
development later, it is not a bad idea to talk about the algorithm briey.
2.4. Nave Bayes Algorithm 18
One of the main reason that NB model works well for text domain be-
cause the evidence are vocabularies or words appearing in texts and the
size of the vocabularies is typically in the range of thousands. The large size
of evidence (or vocabularies) makes NB model works well for text classi-
cation problem. Here is an explanation about Nave Bayes classier, all the
explanation below from (Haruechaiyasak, 2008).
2.4.1 Explanation of Bayess Rule
Nave Bayes is a simple probabilistic classier based on applying Bayess the-
orem (or Bayess rule) with strong independence (Nave) assumptions.
Bayess Rule:
P(H | E) =
P(E | H) P(H)
P(E)
(2.1)
The basic idea of Bayess rule is that the outcome of a hypothesis or an
event (H) could be predicted based on some evidences (E) that could be
observed. From Bayess rule, we have:
1. A priori probability of H or P(H): This is the probability of an event
before the evidence is observed.
2. A posterior probability of H or P(H | E): This is the probability of an
event after the evidence is observed.
Example case 1: To predict the chance or the probability of raining, usually
use some evidences such as the amount of dark cloud in the area.
Let H be the event of raining and E be the evidence of dark cloud, then
we have:
P(raining | darkcloud) =
P(dark cloud | raining) P(raining)
P(dark cloud)
1. P(dark cloud | raining) is the probability for dark cloud when it rains.
And of course, dark cloud could occur in many other events such as
overcast day or forest re, but we only consider dark cloud in the
context of event raining. this probability could be obtained from his-
torical data recorded by some meteorologists.
2. P(raining) is the priori probability of raining. This probability could be
obtained from statistical record, for example, the number of rainy days
throughout a year.
2.4. Nave Bayes Algorithm 19
3. P(dark cloud) is the probability of the evidence dark cloud occuring.
Again, this could be obtained from the statistical records, but the evi-
dence is not usually well recorded compared to the main event. There-
fore, sometimes the full evidence, in example, P(dark cloud), is hard to
obtain.
2.4.2 Explanation of Nave Bayes
As you could see from Example case 1, that an outcome of some events could
be predicted by observing some evidences. Generally, it is better to have
more than one evidence to support the prediction of an event. Typically, the
more evidences that could be gathered, the better the classication accuracy
could be obtained. But, the evidence must relate to the event (must make
sense). For example, if you add an evidence of earthquake to Example case
1, the above model might yield worse performance. This is since raining
is not related to the evidence of earthquake, in example, if there is an
earthquake, it does not mean that it would rain, and vice versa.
Suppose that we have more than one evidence for building our NB model,
we could run into a problem of dependencies, in example, some evidences
may depend on one or more of other evidences. For example, the evidence
dark cloud directly depends on the evidence high humidity. However,
including dependencies into the model would make it very complicated. This
is because one evidence could depend on many other evidences. To make
our life easier, we make an assumption that all evidences are independent of
each other (this is why we call this model as Nave).
Bayess rule for multiple evidences:
P(H | E
1
, E
2
, ..., E
n
) =
P(E
1
, E
2
, ..., E
n
| H) P(H)
P(E
1
, E
2
, ..., E
n
)
(2.2)
With the independence assumption, we could rewrite the Bayess rule from
equation 2.2 as following:
P(H | E
1
, E
2
, ..., E
n
) =
P(E
1
| H) P(E
2
| H) ... P(E
n
| H) P(H)
P(E
1
, E
2
, ..., E
n
)
(2.3)
Example case 2: From Example case 1, now, we could have the following
NB model for raining.
2.4. Nave Bayes Algorithm 20
P(E
1
|H)...P(E
n
|H)P(H) = P(dark cloud|raining)P(wind speed|raining)P(humidity|raining)P(raining)
P(raining|dark cloud, wind speed, humidity) =
P(E
1
| H) ... P(E
n
| H) P(H)
P(dark cloud, wind speed, humidity)
Example case 3: This example shows how to build an NB model. Given
the weather data set for predicting play condition. There are 14 documents,
and 5 features. We could say document as instance, and feature as attribute
in Machine Learning topics. All attributes are nominal.
Table 2.1 Example weather data set for predicting play condition
Instance outlook temperature humidity windy play
1 sunny hot high false no
2 sunny hot high true no
3 overcast hot high false yes
4 rainy mild high false yes
5 rainy cool normal false yes
6 rainy cool normal true no
7 overcast cool normal true yes
8 sunny mild high false no
9 sunny cool normal false yes
10 rainy mild normal false yes
11 sunny mild normal true yes
12 overcast mild high true yes
13 overcast hot normal false yes
14 rainy mild high true no
A NB model from the given data set from table 2.1 are shown below:
Table 2.2 Nave Bayes model of example weather data set
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
There are two part in table 2.2, lower and upper part. The upper part of
the table contains the frequency of different evidences. For example, there
2.4. Nave Bayes Algorithm 21
are 2 instances from the data set showing (outlook = sunny) when (play
= yes). When all frequency already counted, then the NB model could be
created by calculating all P(E | H) and P(H). For example,
P(outlook=sunny | play=yes) = 2/9
P(play=yes) = 9/14
With the NB model, we could use it to predict the class play based
on different set of evidences. For example, if we observe (outlook=sunny),
(temperature=cool), (humidity=high), and (windy=true), then we could
estimate the posterior probability (likelihood) as follows (with using equa-
tion 2.3):
Table 2.3 Example new instance prediction, Evidence E
Outlook Temperature Humidity Windy Play
Sunny Cool High True ?
With E = all attribute, Outlook, Temperature, Humidity, and Windy, but
we could ignore Pr[E] because we only need to relatively compare the
value to other class, and the probability of class yes and no:
Pr[play|E] =
Pr[outlook|play] Pr[temperature|play] Pr[humidity|play] Pr[windy|play] Pr[play]
Pr[E]
Posterior Probability of class yes:
Pr[yes|E] =
Pr[outlook = sunny|yes] Pr[temperature = cool|yes] Pr[humidity = high|yes] Pr[windy = true|yes] Pr[yes]
Pr[E]
Pr[yes|E] =
2
9
x
3
9
x
3
9
x
3
9
x
9
14
Pr[E]
=
2
9
x
3
9
x
3
9
x
3
9
x
9
14
= 0.0053
Posterior Probability of class no:
Pr[no|E] =
Pr[outlook = sunny|no] Pr[temperature = cool|no] Pr[humidity = high|no] Pr[windy = true|no] Pr[no]
Pr[E]
Pr[no|E] =
3
5
x
1
5
x
4
5
x
3
5
x
5
14
Pr[E]
=
3
5
x
1
5
x
4
5
x
3
5
x
5
14
= 0.0206
To get the probability from the posterior probability, we could use nor-
malization as follows:
2.4. Nave Bayes Algorithm 22
P(play
1
) =
Pr[play
1
|E]
(Pr[play
1
|E] + Pr[play
2
|E])
Probability of class yes:
P(yes) =
Pr[yes|E]
(Pr[yes|E] + Pr[no|E])
=
0.0053
(0.0053 + 0.0206)
= 0.205
Probability of class no:
P(no) =
Pr[no|E]
(Pr[yes|E] + Pr[no|E])
=
0.0206
(0.0053 + 0.0206)
= 0.795
Because the probability of class no is higher than yes, then we could
assume that the class for that instance is no. The result of event play
prediction with attribute outlook = sunny, temperature = cool, humidity =
high, and windy = true, is shown at table 2.4.
Table 2.4 Prediction result of Evidence E
Outlook Temperature Humidity Windy Play
Sunny Cool High True NO
2.4.3 Nave Bayes Classier for Text Classication
For text classication, this is example for text classication using NB model.
Consider the following data set. We have 6 documents D0 D5 as the
training data. Suppose we extract and consider only 6 vocabularies from all
documents. There are two classes of documents: medical and technician.
Documents are preprocessed and shown in table 2.5. The numbers are the
frequency of the word in the documents. For example, the word bio occurs
twice in the document D0.
1. Phase 1: Building the NB model
The NB model for the table 2.5 data set is as shown at table 2.6:
|V| = the number of vocabularies
P(c
i
) = the priori probability of each class = number of documents in
a class/number of all documents
2.4. Nave Bayes Algorithm 23
Table 2.5 Example training data D0 - D5
Training Data bio medicine chemical operator engine machine Class
D0 2 1 3 0 0 1 Medical
D1 1 1 1 0 0 0 Medical
D2 1 1 2 0 1 0 Medical
D3 0 1 0 2 1 1 Technician
D4 0 0 1 1 1 0 Technician
D5 0 0 0 2 2 0 Technician
Table 2.6 Example Nave Bayes model for D0 - D5
|V| Class P(c
i
) n
i
P(bio
| c
i
)
P(medici
ne | c
i
)
P(chemi
cal | c
i
)
P(operato
r | c
i
)
P(engine
| c
i
)
P(mac
hine |
c
i
)
6
Medical 0.5 15 0.2381 0.1905 0.3333 0.0476 0.0952 0.0952
Technician 0.5 12 0.0556 0.1111 0.1111 0.3333 0.2778 0.1111
P(Medical) = 3/6 = 0.5
P(Technician) = 3/6 = 0.5
n
i
= the total number of word frequency of each class
n
Medical
= 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15
n
Technician
= 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12
P(w
i
| c
i
) = the conditional probability of keyword occurence given a
class
For example,
P(bio | Medical) = (2 + 1 + 1) / 15 = 4 /15
P(bio | Technician) = (0 + 0 + 0) / 12 = 0/12
The problem occurs here is that if a particular feature/word does not
appear in a particular class, then its conditional probability is equal
to 0. If we use the rst decision method (product of probabilities) the
product becomes 0, while if we use the second decision method (sum of
their logarithms) the log(0) is undened. To avoid the zero frequency
problem, we would use add-one or Laplace smoothing by adding 1 to
each count:
2.4. Nave Bayes Algorithm 24
P(w| c) =
T
cw
+ 1
n
i
+ |V |
(2.4)
Where T
cw
is the sum of word occurence w in documents class w, and
|V| is the number of vocabularies.
We apply Laplace smoothing by assuming a uniform distribution over
all words as follows, with |V| = 6:
P(bio | Medical) = (2 + 1 + 1 + 1) / (15 + |V|) = 5 / (15 + 6) = 5 / 21 = 0.2381
P(medicine | Medical) = (1 + 1 + 1 + 1) / (15 + 6) = 4 / 21 = 0.1905
P(chemical | Medical) = (3 + 1 + 2 + 1) / (15 + 6) = 7 / 21 = 0.3333
P(operator | Medical) = (0 + 0 + 0 + 1) / (15 + 6) = 1 / 21 = 0.0476
P(engine | Medical) = (0 + 0 + 1 + 1) / (15 + 6) = 2 / 21 = 0.0952
P(machine | Medical) = (1 + 0 + 0 + 1) / (15 + 6) = 2 / 21 = 0.0952
P(bio | Technician) = (0 + 0 + 0 + 1) / (12 + |V|) = 1 / (12 + 6) = 1 / 18 =
0.0556
P(medicine | Technician) = (1 + 0 + 0 + 1) / (12 + 6) = 2 / 18 = 0.1111
P(chemical | Technician) = (0 + 1 + 0 + 1) / (12 + 6) = 2 / 18 = 0.1111
P(operator | Technician) = (2 + 1 + 2 + 1) / (12 + 6) = 6 / 18 = 0.3333
P(engine | Technician) = (1 + 1 + 2 + 1) / (12 + 6) = 5 / 18 = 0.2778
P(machine | Technician) = (1 + 0 + 0 + 1) / (12 + 6) = 2 / 18 = 0.1111
2. Classifying a test document
Table 2.7 Example testing data D
t
Test Doc bio medicine chemical operator engine machine Class
D
t
2 1 2 0 0 1 ?
To classify a test document D
t
(table 2.7), we have to calculate the
posterior probabilities, P(c
i
| W) for each class as follows:
P(c
i
| W) = P(c
i
)
V

j=1
P(w
j
| c
i
) (2.5)
c
map
= arg max(P(c | W)) = arg max

P(c)
V

j=1
P(w
j
| c)

(2.6)
2.4. Nave Bayes Algorithm 25
P(Medical | W) = P(Medical) x P(bio | Medical) x P(medicine | Med-
ical) x P(chemical | Medical) x P(operator | Medical) x P(engine |
Medical) x P(machine | Medical)
= 0.5 x 0.2380
2
x 0.1904
1
x 0.3333
2
x 0.0476
0
x 0.0952
0
x 0.0952
1
= 0.5 x 0.0566 x 0.1904 x 0.1110 x 1 x 1 x 0.0952
= 5.7 x 10
-5
P(Technician | W) = P(Technician) x P(bio | Technician) x P(medicine
| Technician) x P(chemical | Technician) x P(operator | Technician) x
P(engine | Technician) x P(machine | Technician)
= 0.5 x 0.0555
2
x 0.1111
1
x 0.1111
2
x 0.3333
0
x 0.2777
0
x 0.1111
1
= 0.5 x 0.0030 x 0.1111 x 0.0123 x 1 x 1 x 0.1111
= 2.27 x 10
-7
Since P(Medical | W) has the highest value, therefore D
t
is classied
into Medical. This makes sense because D
t
contains many words
related to medical such as bio, medicine, and chemical.
3. Underow Prevention
From the example, the posterior probability value is very small. Typ-
ically the number of conditional probabilities is in the rangeof thou-
sands or more (number of words appearing in a document collection),
the value would be too low for the computers to handle (oat point
undeow). This means that we would end up with a number so small,
that would not be able to t in memory and thus it would be rounded to
zero, rendering our analysis would be useless. This problem is referred
to as the underow problem. To solve or avoid this problem, instead
of maximizing the product of the probabilities we would maximize the
sum of their logarithms that is shown at equation 2.7.
c
map
= arg max

logP(c) +
V

j=1
logP(w
j
| c)

(2.7)
P(Medical | W) = log(0.5 x 0.2380
2
x 0.1904
1
x 0.3333
2
x 0.0476
0
x 0.0952
0
x 0.0952
1
)
=log(0.5) +2 log(0.2380) +1 log(0.1904) +2 log(0.3333) +0 log(0.0476)
+ 0 log(0.0952) + 1 log(0.0952)
2.5. Crawling the Web 26
= -0.3010 - 1.2468 - 0.7203 - 0.9543 + 0 + 0 - 1.0213
= -4.2437
P(Technician | W) = log(0.5 x 0.0555
2
x 0.1111
1
x 0.1111
2
x 0.3333
0
x
0.2777
0
x 0.1111
1
)
=log(0.5) +2 log(0.0555) +1 log(0.1111) +2 log(0.1111) +0 log(0.3333)
+ 0 log(0.2777) + 1 log(0.1111)
= -0.3010 - 2.511 - 0.9542 - 1.9085 + 0 + 0 - 0.9542
= -6.6289
Again, since P(Medical | W) has the higher value, therefore D
t
is clas-
sied into Medical, and the result is shown at table 2.8.
Table 2.8 Prediction result of data D
t
Test Doc bio medical chemical operator engine machine Class
D
t
2 1 2 0 0 1 Medical
2.5 Crawling the Web
Web crawlers are an essential component to search engines; running a
web crawler is a challenging task. There are tricky performance and reli-
ability issues and even more importantly, there are social issues. Crawling
is the most fragile application since it involves interacting with hunders of
thousands of web servers and various name servers, which are all beyond the
control of the system. Web crawling speed is governed not only by the speed
of ones own Internet Connection, but also by the speed of the sites that are
to be crawled. Especially if one is a crawling site from multiple servers, the
total crawling time could be signicouldtly reduced, if many downloads are
done in parallel.
Despite the numerous applications for Web crawlers, at the core they are
all fundamentally the same. Following is the process by which Web crawlers
work:
1. Download the web page.
2. Parse through the downloaded page and retrieve all the links.
2.5. Crawling the Web 27
3. For each link retrieved, repeat the process.
Web crawler could be used for crawling through a whole site on the internet.
With specifying a start-URL and the Crawler follows all links found in that
HTML page. This usually leads to more links, which would be followed
again, and so on. A site could be seen as a tree-structure, the root is the start-
URL; all links in that root-HTML-page are direct sons of the root (Peshave,
2010).
2.5.1 HTML
Web pages are written in a tagged markup language called the hypertext
markup language (HTML). HTML lets the author specify layout and type-
face, embed diagrams, and create hyperlinks (Chakrabarti, 2003). HTML is
written in the form of HTML elements consisting of tags enclosed in angle
brackets (like <html>), within the web page content. HTML tags most com-
monly come in pairs like <h1> and </h1>, although some tags represent empty
elements and so are unpaired, for example <img>. The rst tag in a pair is the
start tag, and the second tag is the end tag (they are also called opening tags
and closing tags). In between these tags web designers could add text fur-
ther tags, comments and other types of text-based content. Table 2.9 shows
tag form of HTML elements, more tag forms could be seen at (Werbach and
Jahja, 1996).
Table 2.9 Tag Form of HTML element
Source: (Werbach and Jahja, 1996)
Function Tag Form
Main tag of document <html></html>
Title of document <title></title>
Header of document <head></head>
Contents/body of document <body></body>
Division of document <div></div>
Citation <cite></cite>
Code <code></code>
Reference to certain document <a href = "url"></a>
Reference to target in document <a href = "url#
***
"></a>
Reference to target in same document <a href = "#
***
"></a>
Relationship <a rel = "
***
"></a>
Image content <img src = "url">
Alternative text <img src = "url" alt = "
***
">
2.5. Crawling the Web 28
2.5.2 DOM
A Document Object Model (DOM) is a model of how the various HTML el-
ements in a page (paragraphs, images, form elds, etc.) are related to each
other and to the topmost structure: the document itself. So the document
is represented as a kind of tree, in which each HTML element is a branch
or leaf, and has a name (Koch, 2001). The Document Object Model is a
cross-platform and language-independent convention for representing and
interacting with objects in HTML, XHTML, and XML documents. Object in
the DOM tree may be addressed and manipulated by using methods on the
objects (Ano, 2014).
The use of the DOM as a kind of naming magic. If you call on an HTML el-
ement using its proper name, you are granted access and you could inuence
the element, forcing the browser to react to your arcoulde incouldtations. Of
course, like in fairy tales, if you use a wrong name or try to inuence the
wrong property, terrible things may start to happen. Therefore it is very im-
portant that you know the proper incouldtations (plural, because sometimes
you need to know several names for the same element).
For instance, when you write a rollover script you access a certain image
in the page by using its correct name:
document . images [ thename ]
When you are granted access, you could change its src property. As soon
as you do that, the browser reacts to your spell by loading another image
in the place of the rst. If the image you try to name doesnt exist, however,
or if you misspelled the name, the browser gives error messages and your
magic wont work.
With the HTML DOM, JavaScript could access and change all the ele-
ments of an HTML document. When a we page is loaded, the browser creates
a DOM of the page, and the HTML DOM is constructed as a tree of Objects
(Ano, 2013). Figure 2.3 shows the tree of objects of DOM. With the tree
of objects model, JavaScript gets all the power it needs to create dynamic
HTML. With JavaScript method, JavaScript could change all the HTML ele-
ments, attributes, and CSS styles in the page. JavaScript could remove exist-
ing HTML elements and attibutes, add new HTML elements and attributes,
react to all existing HTML events in the page, and create new HTML events
in the page.
2.6. WEKA 29
Figure 2.3 HTML DOM Tree of Objects
Source: JavaScript HTML DOM,
http://www.w3schools.com/js/js_htmldom.asp
2.5.3 Jsoup API
Jsoup is a open-source Java library for working with HTML documents and it
provides an API for extraction and manipulation of the data within the doc-
ument. It uses a documents DOM in combination with jQuery-like methods
to achieve this. Some of its main usage applications are: scrape and parse a
HTML document from URL, local le or a string, nd and extract data from
HTML documents by traversing the DOM, manipulate the HTML elements,
attributes and text, clean user-submitted content against a safe white-list, to
prevent XSS attacks, and output tidy HTML.
If we were to try to parse the HTML by developing our own methods it
would be like inventing the wheel all over again, and we would waste a lot
of time probably ending up with a sub-par solution compared to jsoup. It is
therefore a good choice to use such a library for this kind of task.
2.6 WEKA
There are a lot of different types of automatic classication development
tools, but we have focused on the ones that employ supervised learning be-
cause this ts the goal of our thesis. And based on the our need of our
rsum classication system, we much prefer to use a tool that provides a
convenience in implementing each classier algorithm (in our case, Nave
Bayes classier), to make our classication system development much easier
2.7. Online Rsum Directory 30
and faster. Not only that, we want a tool that open-source, user-friendly, has
API library that supports Java programming code, and of course, has enough
documentations.
Based on the our needs, there are three (3) major names that have been
widely recognized in the eld of Machine Learning. Those three are WEKA,
RapidMiner, and Java Machine Learning Library (Java-ML). In summary, we
chose WEKA to be machine learning tool in our rsum classication system,
because WEKA has a GUI Interface, that is provided along Java API Library,
in where Java-ML does not have GUI interface, but only the libraries. And
we found that, although the construction of system completely done in Java
code, GUI interface could help us to test some things (test-before-coding)
beforehand we implement the system into Java code.
WEKA (Waikato Environment for Knowledge Analysis) is a popular suit
of machine learning tool written in Java. WEKA was developed by a machine
learning group at The University of Waikato in New Zealand and their goal is
to to build state-of-the-art software for developing machine learning tech-
niques and to apply them to real-world data mining problems. Out from
this goal sprung the WEKA software, a software now used both by specialists
within a particular eld, researchers and scientists, and also within teaching
(Witten et al., 2009). Data Mining: Practical Machine Learning Tools and
Techniques (Witten et al., 2011) is showing that it is powerful and could be
used for exactly what we wanted to do.
2.7 Online Rsum Directory
Online rsum directory is a site that provides a rsum search service based
on avrious kinds of criteria, such as, elds of employment, education, r-
sums owners name, domicile, or just keywords listed in the rsums. On-
line rsum directory is divided into two parts based on the data source,
rsum directory site that provides rsum hosting service, and rsum di-
rectory site that just simply a form of search engine that takes links form
other sites. Rsum directory site, generally, is an implementation of a web
crawler.
2.7.1 Indeed.com
Indeed.comis an employment-related metasearch engine for job listings launched
in November 2004. As a single-topic search engine, it is also an example of
2.8. Java 31
vertical search. Indeed is currently available in 53 countries and 26 lan-
guages. In 2010, Indeed surpassed Monster.com to become the most visited
job site in the United States. As of February 2013, Indeed reaches over 100
million unique visitors every month. The site aggregates job listings from
thousands of websites, including job boards, newspapers, associations, and
company career pages. In 2011, Indeed began allowing job seekers to apply
directly to jobs on Indeeds site; also in 2011, Indeed began offering rsum
postings and storage.
Indeed.com is a rsum posting and searching service provider, that is the
most visited site among many rsum posting service providers. What makes
it as number 1 rsum searching service provider, is caused by several factors,
Indeed.com is free, and giving free full access for rsum seekers to search,
download, and post rsums. Although it is free, rsum search results that
offered is a good quality rsum, and most of the rsum sourced from the
most paid hosting service. Rsum page that is shown in Indeed.com has a
tidy format for each element in rsum document.
2.7.2 Rsum Format
The format of rsum document varies based on the template used, but on
Indeed.com, the overall rsum has the same format. Overall in a rsum,
there is information about Name, Domicile, WorkExperience, Education, and
Skill. And Indeed.com presents all this information in the form of an HTML
tag that has been formatted, the information such as the name of a rsum,
there is a class with the name on the tag .fn. Makes it easier for us to get
information from each page rsum, using jsoup API for doing Web Crawling
and parsing document.
2.8 Java
Java is a computer programming language that is concurrent, class-based,
object-oriented, and specically designed to have as few implementation de-
pendencies as possible. It is intended to let application developers write
once, run anywhere (WORA), meaning that code that runs on one platform
does not need to be recompiled to run on another. Java applications are
typically compiled to bytecode (class le) that could run on any Java virtual
machine (JVM) regardless of computer architecture. Java is, as of 2014, one
2.8. Java 32
of the most popular programming languages in use, particularly for client-
server web applications, with a reported 9 million developers.
Chapter 3
Analysis and Design
In this chapter, we would discuss detail of document classication system
design that we want to build to fulll the objective of this thesis. In this
chapter, we would describe description, design stages, and testing scenario
of our document classication system.
3.1 System Analysis
System analysis aims to identify the emerging requirement that become ba-
sis for system design, which would be further explained in description of
system. System analysis also aims to identify the problems that exist in the
system., which includes program and analysis result of the system and its
related elements. This analysis is needed as a basis for design stage of clas-
sication system. This analysis includes data analysis, description of system,
and process design.
3.1.1 Data Analysis
In our document classication system, type of document that would be used
in this system, is job rsum documents, in the form of text le. We use
rsum data that was obtained from online rsum directory site named In-
deed.com. As already explained briey in the previous chapter, Indeed.com
provides rsum searching service, everyone who wants job rsum docu-
ment in any job eld, could do search, and download the rsum documents
provided by Indeed.com. Indeed.com henceforth would be referred as In-
deed.
According to our analysis, Indeed not only facilitates the rsum seeker
33
3.1. System Analysis 34
with search and download service, but also provides ease to do data crawling
in its pages. We conclude this thing from looking at the HTML element
structure that built it. In this chapter, we would explain process of data
crawling, and the reason why the HTML element structure of Indeed could
make data crawling process become easy.
3.1.1.1 HTML Element Structure of Indeed Rsum Page
Figure 3.1
1
shows a rsum page that obtained from Indeed directory. In the
rsum page, there is information about name of the rsum owner, domi-
cile, a list of work experience, last education, and skills. List of work expe-
rience consists of work title, name of company, and job description. These
informations that we need as our data. To obtain these informations, we
would do information parsing from HTML element structure of each rsum
page.
1
http://www.indeed.com/r/Mike-Fehrle/0c7973c92c030db8?sp=0/
3.1. System Analysis 35
Figure 3.1 An Example of Indeed rsum page
3.1. System Analysis 36
<html>
<head></head>
<body>
<h1 i d =resumecont act c l a s s=fn>Mike Fehrl e </h1>
<p i d=headl i ne_l oc at i on c l a s s=l o c a l i t y >Houston , TX</p>
<di v i d=workExperi enceFehrl e 1 c l a s s=workexperi ence
s ect i on>
<p c l a s s=wor k_t i t l e>Accounti ng Supervi sor </p>
<di v c l a s s=work_company>Pet r i s </di v>
<p c l a s s=work_descri pt i on>
Hal l i bur t on | Pe t r i s
Pa r t i c i pa t e i n and manage dai l y oper at i ons of US
Manage Accounti ng s t a f f , domesti c and i nt e r na t i ona l
</p>
<di v i d=workExperi enceFehrl e 2 c l a s s=workexperi ence
s ect i on>
<di v i d=workExperi enceFehrl e 3 c l a s s=workexperi ence
s ect i on>
<di v i d=educati oni t ems c l a s s=i temscont ai ner >
Bachel or of Bus i nes s Admi ni s t r at i on i n Accounti ng
<di v i d=addi t i onal i nf o i t ems c l a s s=i temscont ai ner >
Sof tware S k i l l s
Nav , Solomon , Great Pl ai ns , Paradox , Peacht ree , MS
</di v>
</body>
</html>
To nd which are the HTML elements that carry out the information we
need, we do analysis of each rsum page structure with the help from
jsoup HTML parser tool
2
. jsoup HTML parser tool could provide informa-
tions that contained in a particular HTML element., by specifying thedesired
CSS queries. For example, if you want to know the information contained
in a class named namecontent then only need to do is input the keyword
as .namecontent. So we get the corresponding HTML element for each infor-
mation, as following:
1. A class with name fn would give name of the rsum owner.
2. A p with id name headline
_
location would give domicile from the r-
sum owner.
3. A class with name work-experience-section would give a work experi-
ence list. Generally, on a rsum, there are more than one work expe-
2
http://try.jsoup.org
3.1. System Analysis 37
rience. And children of work-experience-section consists of work title,
name of company, and job description.
4. A p with class name work
_
title would give work title of one corre-
sponding work experience.
5. A class with name work
_
company would give name of company/work
company of one corresponding work experience.
6. A p with class name work
_
description would give job description of
one corresponding work experience.
7. A div with id name education-items would give history of education.
8. A div with id name additionalinfo-name would give additional info,
which is usually list of skills.
3.1.1.2 HTML Element Structure of Searching Result Page Indeed
We have found a way to parse information from the rsum page, but we
also need to know how to get to a particular rsum page automatically. At
this point, working scheme of data crawling is analyzed.
By using a browser to open Indeed searching page, we could search any
rsums that we want based on keyword what and where. For exam-
ple, we want to search any rsum with web developer as the job eld, so
we input web developer keyword as the what keyword. And ignores the
keyword where. So we get 10 search result in the links form in one result
page, with web developer as the job eld. Figure 3.2 shows example of result
page with web developer as keyword. Each link would direct us to a partic-
ular rsum page. And, for searching with web developer as the keyword,
there are around 160 thousand rsums as the result. Indeed splits the 160
thousand rsum links to a few pages, with each result page consists of 10
rsum links as maximum, then there are approximately 16 thousand result
pages than could be browsed. In every result page, there is a direct link to
the next page.
3.1. System Analysis 38
Figure 3.2 Searching Result Page, with keyword web developer, and shows
the result
Figure 3.3 Searching Result Page, with next page link information at page 1
Figure 3.4 Searching Result Page, that shows change of link at page 2
And we do the analysis in the similar way, as we did on rsum page, and
we get the analysis result as following:
3.1. System Analysis 39
1. Every searching based on what keyword has URL naming code www.
indeed.com/resumes/what-keyword. For example, searching web devel-
oper as the keyword, has URL www.indeed.com/resumes/web-developer. If
the keyword is composed of two or more words, each word is separated
by character -.
2. Tracing from the rst result page to the next result page, also has a
URL naming code. Tracing to second page from web-developer key-
word has URL www.indeed.com/resumes/web-developer?start=10&co=US. Fig-
ure 3.3 shows the link information to next page. For next page, the
value of 10 could be replaced with a multiple value of 10. Although
Indeed informed that there are around 160 thousand rsums with
web-developer, the result pages only up to 100 pages. Figure 3.4
shows the increment of 10 from the link to next pages.
3. Links that directed to a certain page from result page, is contained on
the tag a with class name app
_
link, and the information is stored
as href. For example, in a result page, searching for Mike Fehrle
rsum gives link information to his rsum page as /r/Mike-Fehrle/
0c7973c92c030db8?sp=0. If we trace the link after www.indeed.com, then it
would display the rsum page of Mike Fehrle.
4. A link that directing to the next result page from a current result page,
is stored on tag a with class name instl which is children of tag
div with id name pagination. For example, in the rst searching re-
sult page of web developer, the class instl gives href information
?q=web+developer&co=US&start=10. If the link is traced by adding the link
after www.indeed.com/resumes/, then the next result page would be shown.
3.1.1.3 Document for Classication Modeling
To make a classication model of our rsum classication system, we use
rsum documents that provided by Indeed. We are planning to do accuracy
rate comparison of classication model that we would make, with group-
ing type of work, group of work in general context, and in specic context,
Atypical and Typical types of job.
Which belong to Atypical types of job is a job that is really far different of
each other. On the other hand, Typical types of job is a job that has similarity
of each other or belongs to the same eld of work. From the two groups, we
would create two different classication models. The number of types of job
3.1. System Analysis 40
is 5 types of job, and the number of documents from each type of job would
depend on a Scenario of testing that would be explained in Testing Design
subsection.
1. Atypical types of job:
(a) Accounting
(b) ComputerIT
(c) Manager
(d) Medical
(e) Technician
2. Typical types of job:
(a) Web Analyst
(b) Web Designer
(c) Web Developer
(d) Web Master
(e) Web Writer
3.1.1.4 Document for Testing
To test classication model, we would take randomly 10 documents from
each group. For example, for Atypical types of job, we would take 2 docu-
ments from each job, and the same thing for Typical types of job. We would
not seek a new document for testing purpose, because logically, if our classi-
cation system could classify the documents that used in modelling, then we
predict that our classication system would be able to classify a new rsum
document, with note, that the type of job is valid.
3.1.2 Document Classication with WEKA
3.1.2.1 Steps of Classication on WEKA
Classication steps in WEKA framework (for use with GUI Interface, com-
mand line, or Java code) (Abernethy, 2010):
3.1. System Analysis 41
1. Preparing data set into ARFF format. WEKA has a restriction in term of
input le that could be processed. Type of input le format that could
be processed by WEKA, one of many formats is ARFF le format. If
data set is prepared in text le format (UTF-8) in general, the data set
contained in the text le format must rst be converted into ARFF le
format. TextDirectoryLoader function package is provided by WEKA to
convert the data set into the ARFF le.
2. Preparing training data and testing data from data set. For reasons
of accuracy purpose in the classication system, the data set should
be divided into two parts. The rst part aims to store most of data
from the data set, and would be used to make classication model
(called as training data), and the second part aims to keep a small
portion of data from the data set, which would be used for testing
the classication model, to ensure that the classication model is not
overtting. Overtting is a problem if we supply too much creation
of data into our model, the model would actually be created perfectly,
but just for that data. But of couse, in classication, we want to use the
model to predict future unknown; we dont want the model to prefectly
predict values we already know. This is why we create a test set. If a
data set number is about 1500 documents, the ratio is about 60-80%
for training data, and the remaining approzimately 20-40% for testing
data.
3. Perform ltering of data set. If data set obtained still in form a col-
lection of text, WEKA could not be able to process the text data to be
classied. So we must do ltering process, by turning it into a Word
Vector format. StringToWordVector is a function package to convert the
text into Word Vector format. Preprocessing process in text mining
could also be applied in the ltering process.
4. Conduct training or evaluation of training data set. Evaluating train-
ing data set is a main process in classication model formation. By
using classier algoritm to training data set, WEKA could provide out-
put model with the values that could be used as a reference to tell the
user whether the use of classier algorithm to training data set, could
give a good model to be classication model to predict unknown data
later. Figure 3.5 is showing example of output from evaluation process.
In analyzing the result of an evaluation process, we only need to fo-
3.1. System Analysis 42
cus on the value of Correctly Classified Instances (CCI) and Incorrectly
Classified Instances (ICI), which tell us the accuracy rate value of the
model. From this example model, thevalue of CCI only 59.1%, and
40.9% of ICI, and it could be said that the accuracy rate is only 59%,
which conclude that the model is not a good model to be used as a
classication model.
Figure 3.5 An example output of Evaluation process
5. Perform testing on the model. From evaluation process, the output
shows a quality of the model which could be obtained from data set,
so far we could say that we already got the model that could be used
to classication, but still, we need to validating our model with test-
ing data. If testing output shows that the value of CCI and ICI to be
obtained is not very different, then the model does not have problems
overtting, and it is predicted that the model would not give incon-
sequent prediction result, when a unknown data tested in the model.
Figure 3.6 is showing the output of testing process. The value of CCI
and ICI from testing process is 55.6% and 44.3%, indicating that the
model is estimated from the data set would not give an inconsequent
prediction result. However, the classication model is created from the
data set is not eligible to serve as a classication model, because the
3.1. System Analysis 43
accuracy rate only 55%. If this model is used for classication of un-
known document, it could be estimated that the predicted results are
not as good as expected.
Figure 3.6 An example output of Testing/Validation process
6. If you have obtained classication model that is really good, then the
model could be saved into a model le format, to be loaded back in the
WEKA to perform the classication process to unknown document.
3.1.2.2 Nave Bayes Classier on WEKA
In our classication system, we want a system that could classify text-based
document, in which each document would be classied based on multi-class,
into at least ve (5) different classes of job, and it is not the same as binary
classication that there are only two classes, the Positive or Negative. To be
able to implement a text classication with multi-class, is not an easy thing,
because the reading patterns from every class, and the classication would
become more complex than the binary one. But WEKA makes it easy for us
to implement a multi-class classication system.
WEKA has provided classier MultiClassClassifier in its library package,
that contained in package WEKA.classifiers.meta.MultiClassClassifier. We just
3.1. System Analysis 44
need to use the classier MultiClassClassifier into our Java code, and with
only default option, multiclass classication could be implemented. MultiClass
Classifier classier is not the classier algorithm that could process the data
set, and do prediction, we need Nave Bayes classier as the base classi-
er that included in MultiClassClassifier classier to work along together in
classication process. MultiClassClassifier to be able to do multi-class clas-
sication, and Nave Bayes to do the classication job, till give the result of
classication. WEKA makes the classication easier with base classifer pack-
age than could be used easily, one of them, is NaveBayes package.
3.1.2.3 ARFF File
An ARFF (Attribute-Relation File Format) le is an ASCII text le that de-
scribes a list of instances sharing a set of attributes. ARFF les were de-
veloped by the Machine Learning Project at the Department of Computer
Science of The University of Waikato for use with WEKA machine learn-
ing tool. In our text classication system, we would convert data set that
we get, the format of the text le into ARFF le format with the help of
TextDirectoryLoader function. Data set that we would get is text sentence that
stored in different folder, folder is dened as a class. Below is code snippets
of ARFF le.
@rel at i on j obresume
@at t r i but e t hi s numeric
@at t r i but e i s numeric
@at t r i but e example numeric
@at t r i but e document numeric
@at t r i but e as numeric
@at t r i but e c l as s 1 numeric
@at t r i but e c l as s 2 numeric
@at t r i but e c l as s 3 numeric
@at t r i but e @@class@@ { cl as s 1 , cl as s 2 , c l as s 3 }
@data
{0 1,1 1,2 1,3 1,4 1,5 1,6 0,7 0,8 c l as s 1 }
{0 1,1 1,2 1,3 1,4 1,5 0,6 1,7 0,8 c l as s 2 }
{0 1,1 1,2 1,3 1,4 1,5 0,6 0,7 1,8 c l as s 3 }
3.1. System Analysis 45
3.1.3 System Specications
System that to be built have capabilities and features as following:
1. System could perform data crawling and parsing from Indeed.com on-
line rsum directory automatically.
2. System could collect rsum documents in a large quantities based on
type of job, and save it in a le, and group it into a directory based on
the type of job.
3. System could do document labeling process based on the type of job.
4. System could perform text preprocessing from rsum documents that
already obtained, to change uppercase into lowercase letter, remove
stopwords and unimportant characters.
5. System could convert a whole documents from text document into
ARFF le format (data set).
6. System could convert ARFF le from string-type text into word vector
type, is valid form of system needs.
7. System could divide data set (in ARFF format) into two parts, training
data and testing data, with proportion 75% from total of data set for
training data, and 25% for testing data.
8. System could do training of training data with help from testing data
to make evaluation and validation result, and create the classication
model.
9. System could classify rsum documents that the type is unknown au-
tomatically.
10. System could show the estimation time that needed of training and
classication process.
3.1.4 User Specications
This system is intended to use by all parties that require automatic rsum
classication system, especially rsum seekers or employer, and also for
those who are interested in studying data crawling and document classica-
tion system.
3.1. System Analysis 46
3.1.5 Development Tools
In developing rsum classication system, we use development tools with
the specications of software as following:
1. Windows/Linux Operating System
2. Java Standar Development Kit and Libraries
3. NetBeans IDE 7.2.1 and Notepad++ v5.8.1
4. Jsoup API Library
5. WEKA API Library
And the hardware that used in this thesis are notebook with Intel Core 2 Duo
processor, and 8 GB memory.
3.1.6 Description of System
We would discuss the decription of this systemthat is developed in this thesis.
Objectives of this system is to create a classication system that could fetch
documents from online rsum directory site, create classication model,
and do classication of rsum documents automatically. And this rsum
classication system is a form of an implementation of text mining and ma-
chine learning. The entire system is divided into 4 stages, as follows:
3.1.6.1 Data Crawling Stage
Data crawling stage aims to collect and prepare documents that required in
the next stage, the process is done:
1. User species ve types of job into Atypical types of job, and also ve
types of job into Typical types of job.
2. User inputs a keyword (which is the name of each job) to be added to
the Indeed.com search section URL address, to fetch documents based
on the desired results.
3. User species the directories address where the rsum documents
would be saved, with folder grouping based on type of job.
3.1. System Analysis 47
4. Grouping documents into folders based on type of job, is data labeling
process. Based on the denition of classication, that each data set
must rst be labeled by category, before do the training process.
5. Systemdo crawling process to fetch url link of rsum documents based
on type of job that desired by user.
6. For each rsum page is obtained, system performs parsing process to
obtain name, domicile, work experience, education, and additional info
information.
7. For every information that obtained from a rsum page, system stores
the information without any format, to the le, with the naming of each
le based on ID that obtained from the concerned link rsum page
8. System stores each rsum document based on type of job and group.
3.1.6.2 Preprocessing Stage
1. User creates a text le that stores stopwords list. Stopwords list is used
is English stopwords, which is used as a dictionary in removing unim-
portant and un-needed words in document (ltering). User denes
directory address of stopword le.
2. User denes directory address of rsum documents that has been ob-
tained from the data crawling stage. Each type of job has been differ-
entiated with different folder.
3. User denes a new directory that is used as a place to store documents
that have been preprocessed. Directory making follows the same regu-
lations, certain folder for certain type of job.
4. System loads stopword le into the process.
5. System reads each character from the rsum documents that has been
obtained from directory, and converts every uppercase letter into low-
ercase letter.
6. System eliminates stopwords from the document, if the word is in-
cluded in the stopwords list.
7. System deletes all non-letter characters, and eliminates excessive new-
line and whitespace characters.
3.1. System Analysis 48
8. System then stores each document that has been preprocessed, with
text le formatting, into directory address that already dened by user.
3.1.6.3 Learning Stage
1. At the beginning of this stage, user already had a directory with all
rsum documents that already preprocessed, which is categorized into
folders based on the type of job. However, this rsum documents
could not be processed for the learning stage yet, because for learning
stage, type of ile that could be accepted only ARFF le format. In this
classication system, we use a machine learning library named WEKA
to perform modeling, training/learning, and classication process. And
WEKA would not accept an input from other le format, only ARFF le
format. Therefore all rsum documents must rst be converted into
ARFF le.
2. To convert documents into ARFF le, we could use TextDirectoryLoader
function of WEKA. TextDirectoryLoader transform a whole documents
into an instance in a data set. But TextDirectoryLoader could only read
the documents that already groupped by folder. And the name of folder
that is type of job, would be class/label for each instance.
3. System applies TextDirectoryLoader into documents directory, nd save all
documents into a single ARFF le. And then system save the ARFF le
into a new directory (now there is no need multiple folder anymore
4. This ARFF le which then becomes the data set. Job rsum data set.
5. This single ARFF le contains an amount of instances, with 5 classes
(type of job), each class has an amount of instances. To perform learn-
ing/training stage, we also need to divide the data set into two parts,
training data set, and testing data set.
6. To divide the data set into training and testing data set, the data set,
rst, must be ltered. Filtering process in the scope of WEKA, that
change string-type instances (because the rsum document is in sen-
tences) into word vector, by breaking down each sentence into a set of
words that composed (vector). Filtering process could be performed
using StringToWordVector function of WEKA.
3.1. System Analysis 49
7. System loads the ARFF le from a directory, and perform ltering pro-
cess with StringToWordVector fucntion, and obtain a new data set that
has been processed.
8. And then, system divides the data set into two parts, with proportion
75% and 25% of the number of data set. 75% from the data set would
be training data, and 25% would be testing data. System saves those
two data sets, respectively, into a new ARFF le.
9. Learning/training stage uses WEKA as a basis in the training process.
We use WEKA framework to perform evaluation, validation, and clas-
sication process. We use WEKAs libraries to perform the whole clas-
sication process in our rsum classication process. Especially, we
use WEKA to use classier algorithms, in our system, we would use
MultiClassClassifier (to allow us do multi-classication process) and
NaiveBayes classier. We would not dene the Nave Bayes algorithm
in our own way, but it depends entirely on WEKAs libraries.
10. Learning stage is subdivided into two phase. Evaluation phase, and
Validation phase. Evaluation phase is the learning phase which only
uses training data set to perform CrossValidation evaluation. The result
of evaluation phase is a classication model that formed from training
process. But the model is not yet valid to be used as classication
model. So, Validation phase is needed to be done after Evaluation
phase. Validation phase is learning phase that uses training and testing
data set simultaneously. If there is no overtting problem occurs after
the validation phase done, then the model could be used as a model in
classication stage to classify new unlabeled document.
11. System loads the training data sets, to do the Evaluation phase. The
system denes MultiClassClassifier and NaiveBayes classier a the classi-
er. System performs CrossValidation training to the training data setm
and shows the result output the accuracy rate.
12. Next, system loads the testing data set, while the training data set is
still being loaded in the system. System denes MultiClassClassifier
and NaiveBayes classier once again as the classier. System do the eval-
uation of the training and testing data set, and shows the second result
output.
3.1. System Analysis 50
13. If from the both result output, there is no overtting problem, and the
user is satised with the accuracy rate of the model, then system saves
the model as model le in same directory where training and testing
data set was saved, and at this state, we already have a classication
model.
3.1.6.4 Classication Stage
1. At this stage, we already got a classication model that we would use
to classify new unlabeled documents.
2. However, the classication model could only classify a document with
same le format, ARFF le. What if we have a unlabeled rsum docu-
ment, and in text le format? While TextDirectoryLoader function could
only convert labeled documents (documents not a single document).
This could be done with Batch Filtering method that WEKA offers.
Batch Filerting is a method that labeled and unlabeled data set were
made together in the same phase, and got similar ltering options.
ARFF le that generated remain two les, but with the same lter set-
tings and in compatible type.
3. User saves unlabeled rsum documents in a new directory that is dif-
ferent from the data set directory. Rsum documents stored in text
format les.
4. Then, systemconverts unlabeled rsum documents fromtext-type into
ARFF le, with Batch ltering method, and with the same ltering op-
tions with the labeled data sets, and ARFF le from unlabeled rsum
documents is ready to be classied.
5. System loads the classication model from the directory, and denes
the model as classication model with MultiClassClassifier and NaiveBayes
classier.
6. System loads the unlabeled data set.
7. Then, system do the classication process based on the classication
model on the unlabeled data set. If the type of job that contained in
the unlabeled data set, it is the type of job that belonging to types of
job of the classication model, then system would show the prediction
result from the unlabeled data set. But if the type of job is not included,
3.1. System Analysis 51
then system could not classiy the unlabeled data set, and shows error
message output.
3.1.7 Class Analysis of System
Classes that would be built in the system, as following:
1. Data Crawling class
This class serves as a crawler on Indeed.com (crawling), fetch, and
store rsum information (parsing) into document in a directory that
has been categorized by types of job.
2. Data Preprocessing class
Text preprocessing function, from lowercase function, stopwords and
non-letter characters deletion, and newline and whitespace deletion,
included in this class.
3. Training/Learning class
This class used to create a classication model from training and testing
data set.
4. Classication class
This class is the main class of the system, that aims to classify unla-
beled rsum documents into certain type of job, using the classica-
tion model that created in training class.
5. ARFF File Converter class
Function of this class is as le converter. Conversion process rsum
document into ARFF le format is included in this class. Filtering pro-
cess that converts string-type into word vector type, and also the pro-
cess that divides data set into training and testing are included in this
class.
6. Timer class
The function of this class is as a stopwatch, or timer. Serves to inform
the estimation time or the time that it takes a process from start to
nish. The need of this class due of learning (modelling) stage in this
system, sometimes the process consumes too much time, as well as
comparison between the cost of time of one document classication
process and the other one.
3.2. System Design 52
3.2 System Design
In this section, would be explained about our rsum document classication
system design. We would explain what we do in the system with the features
that would be used in this system. The object of the system is job rsum
document that the whole information in the document is sourced from online
rsum directory page. Whole the system would be divided into two main
parts, (1) web crawling part, and (2) document classication part, however,
each part would also be divided into parts. Web Crawling part is divided into
two processes, Data Crawling and Data Preprocessing process. In the other
hand, Document Classication part is divided into three processes, ARFF le
converter, Data Training/Modelling, and Data Classication process. Figure
3.7 shows the workow of system.
Figure 3.7 Workow of System
3.2.1 Data Crawling
To obtain the object of our system that is job rsum information, we would
need a help from online job rsum directory site. Most of online job rsum
directory site provides access to review job rsums information. In this case,
we would use Indeed.com as our job rsums source, since Indeed.com gives
free full access (without a period trial or premium and non-premium access
restriction) of job rsums information that Indeed got.
There are two kind of rsum information that Indeed had (as far we
know), rst, since Indeed provides a rsum posting service based on his
own directory, we could get rsum informations that was posted by people
directly into Indeed service, and second one, we could get rsum infor-
mations that is from another source, such as LinkedIn, or VisualCV, but is
presented in Indeed page. That could be happened, because basically, not
3.2. System Design 53
only provides rsum posting service, Indeed also has a web crawler abbility,
that could crawl any rsum informations to another online rsum direc-
tory. Indeed fetches the rsum information that is posted in another site,
stores it in Indeed directory, and shows the information in its own rsum
page, but of course, all those process could be done with a permission from
the source, consider that LinkedIn and VisualCV are premium site. But we
dont want bother how Indeed gets all the rsum informations, since we just
want to use the rsum information for our system.
Indeed places all the rsum information in individual page, so a rsum
information has its own specic page, and Indeed gives an identity to each
page. The rsum page is HTML based page, and Indeed stores every detail
information as a specic content in a specic HTML element. Fortunately,
Indeed declares all specic element with specic name of HTML element,
so we could say Indeed has a tidy DOM structure of HTML. We need to get all
detail information from every HTML element, so we would use a tool (java
library) to do this job, the tool is Jsoup library.
Jsoup is one of web crawler tool library, which is oriented in Java lan-
guage programming. Jsoup has syntax selector functions that allow users to
get every available content on a web page, links, images, informations, page
title, or specic content that stored in HTML elements. Jsoup is equipped
with the ability to do online web crawling, as long as the system is con-
nected to the internet. Based on Data analysis that has been done, there
would be two main activity in this process:
1. Data Fetching
This method aims to fetch informations needed based on the HTML
elements, that associated from every rsum page, as following:
(a) Make connection to the Indeed.com page.
(b) Fetching Name the owner of rsum information from .fn ele-
ment.
(c) Fetching Domicile/location of the owner information from head-
line_location element. If there is more than one location infor-
mation, just fetch the rst one from the list.
(d) Fetching the list of Work Experience information fromwork-experience-
section element.
3.2. System Design 54
(e) For every Work Experience information, fetching Work Title infor-
mation fromwork_title element, that is child of work-experience-
section element.
(f) For every Work Experience information, fetching Work Company
information fromwork_description element, that is child of work-
experience-section element.
(g) Fetching Education information from education-items element.
If there is more than one education prole, fetch the rst one from
the list.
(h) Fetching Additional Info information from additionalinfo-items
element. If there is more than one additional info in the list, just
fetch the rst one.
If all necessary information has been obtained, that information would
be saved into a text le, without any addition formating.
There is a function to fetch url link from the current rsum page, and
parse it to get unique ID from the page. Indeed names every rsum
page with an unique ID from the page. Indeed names every rsum
page with an unique ID along with name of the rsum owner. The
ID is a random generated characters, with 10-15 characters, letter, and
number combinations. Every rsum document would be saved into
text le, with the unique ID as the name, but if in the link url, there
is no an unique ID, name of the rsum owner would be used as the
name le. This class uses BufferedWriter function to store information
in a document, and the document would be saved in a directory, each
document is grouped by type of job.
2. Link Fetching
This method serves as a link searcher, which initializes rsum search-
ing page based on job keyword. In data analysis, it is mentioned that
rsum searching process with what keyword could be done directly
fromurl. For example, with inserting this url www.indeed.com/resumes/web-
developer into the browser, we could get access directly to web-
developer result pages, even without using the searching column which
has been provided by Indeed. From each result page, we could get the
rsum result list that contains the keyword, and from each result list,
we could get links to the rsum pages. To go to the each rsum page
3.2. System Design 55
from the list, the link could be obtained from app_link element. By
adding the link that obtained , into url www.indeed.com would direct
to a particular rsum page. From searching result page, there are links
pointing to the next searching result page. The link that direct us to
the next page from searching result page, is contained in pagination
HTML element. By using the information that contained in app_link
and pagination HTML element, system could crawl all rsum page
from Indeed directory freely.
3.2.2 Data Preprocessing
After we got rsum documents that already stored in text le, each le for
each rsum page, we need to clean the documents from unnecessary char-
acters/words. In text mining context, clean a document from unnecessary
characters or words, well known as Text Preprocessing. We need to do pre-
processing because we really dont want unnecessary characters or words
would take a part in Text Mining process, because it would affect the result
of pattern discovery. For example, there is a + character and the word in
documents, + character is just used for numbering list icon, and dont have
a meaning, and same for the word. And we do not do text preprocessing,
at the pattern discovery stage, + and the - if the occurences number is
high - would be proceed as a pattern of a document of a specic class/label,
then if there is another one document with high occurences + and the,
the document would be classied as the same class of previous document,
instead to be classied as the real class. This would give a fatal error.
Text preprocessing function, from lowercase function, stopwords dele-
tion, deletion of non-letter characters, and newline and whitespace char-
acters deletion, would be included in this process. Removal of stopwords
is done using stopwords library (a text le that contains English stopwords
list, for the stopwords list could be seen in Appendix B). Converting upper-
case/capital letter to lowercase uses toLowerCase function from Java Library,
that could be applied for each letter in the document. Non-letter characters
deletion would be done with per letter checking method, if the certain letter
is not one of alphabet (A-Z) or number (0-9), then the character would be
deleted. We would delete newline character since we just need all words in
a row, since it would make easier the next process later.
After all documents already preprocessed, then all documents would be
saved into new le text, with different folder.
3.2. System Design 56
3.2.3 ARFF File Converter
Before we could do Data Classication process, we need to prepare a data
set that would be used in the classication process. For Data Classication
process, we would use WEKA library to do whole machine learning process,
from training the data set, modelling, and classication. But WEKA could
only process a data set le with ARFF format, so we need to convert all pre-
processed rsum document into ARFF le. In WEKA, a ARFF le represents
one data set, which includes all documents, thats why even we have a lot
of document, all document would be converted as a single data set, a single
ARFF le. For Data Classication process, we would need more than one
data set, includes trainin data set, testing data set, or unlabeled data set to
be classied, and of course, all data set must be in ARFF le format.
Converting rsum documents from text le into a single ARFF le, could
be done with TextDirectoryLoader function from WEKA library. This function
only works with labeled data set, the function would get the content from
each documents, but the documents must be already classied with folder.
Basically, in the classication process, there should be a data set with at
least two class/label (binary classication), yes or no, positive or negative,
etc. In our system, there would be around ve class, so if we want to use
TextDirectoryLoader function to make ARFF data set from our documents, in
our directory, there should be ve folder (represents each class) which stores
our documents from each class. Even TextDirectoryLoader could convert lot of
documents into single ARFF le, TextDirectoryLoader could only accept data
set that already classied into folders, so this function could not convert
documents that does not have class (folder), and the class should be more
than one. So, what if we want to make an unlabeled data set to be classied
later, since the class/label of the document is still unknown (we would use
unlabeled data set to test our classication model, so the class of it could
be predicted). Figure 3.8 shows the folder structure that is used in system,
since we would do comparison between Atypical and Typical data set, there
would be two folder for those data set (general for Atypical and specic for
Typical), and we set one folder for unlabeled data set with testClassication
as the folder name.
For unlabeled documents, we need to create the ARFF data set manu-
ally. We could create the data set manually with WEKAs help, since WEKA
has variables such as FastVector, Instances, and Attributes that represent the
content of an ARFF le, so we just need to dene relation, instances, and
3.2. System Design 57
Figure 3.8 Folder Structure of System
attributes of the ARFF le that we would create, with manually. Every unla-
beled rsum documents content would be declared as an instance (data).
ARFF data set from TextDirectoryLoader process already contains all r-
sum documents, but still in string sentence form, then we need to get all
words from those sentences to be features (feature extraction), since in clas-
sication process, the pattern could be discovered from the occurence of
each word/feature. WEKA knows the feature of each data as Word Vector.
We would use StringToWordVector function from WEKA library to extract fea-
tures from all data, so the data set would show word occurences from data
set. StringToWordVector function is one of data ltering function among lot
of ltering function that provided by WEKA. In our system, we would use
ltering functions as follows:
1. Randomize
Randomize function randomly shufes the order of instances passed through
it. The random number generator is reset with the seed value when-
ever a new set of instances is passed in. Randomize function shufes the
order of data in data set. We apply this function to data set (labeled
one), because we want to divide the data set into training and testing
data set later. With randomize, the division would be fair enough for both
training and testing data set. We would just use default parameter of
this function.
3.2. System Design 58
2. StringToWordVector
StringToWordVector function converts string attributes into a set of at-
tributes representing word occurence information from the text con-
tained in the strings. The set of words (attributes) is determined by
the rst batch ltered (typically training data). We would use the pa-
rameter as below for StringToWordVector, the rest parameter that is not
explained in below, we would leave it as default.
(a) outputWordCounts
Output word indicating presence or absence of a word. This pa-
rameter would set the format of instance would only show the
occurence of word that contained in a sentence. We would dene
this parameter as true, since the format that shows the occurence
of word would be easier to inspected.
(b) attributeIndices
Specify range of attributes to act on. This is a comma separated
list of attribute indices. We would use rst-last to this parameter,
so the attributes index would be rst for the index 0, and last for
the last index. Or we could leave this parameter as default, since
the default is rst-last.
(c) tokenizer
The tokenizing algorithm to use the strings. Tokenizer would
break down the sentence into word parts, or into the smaller form.
This smaller form would be a feature of data set. We would use
NGramTokenizer which splits a string into an n-gram with min
and max grams. We would set the min N-Gram as 1, and max N-
Gram as 3, so the system would split the features into n-gram for
maximum. The reason why we use n-gram tokenizer because we
want to obtain a feature that constructed from more than 1 word.
(d) wordsToKeep
The number of words (per class if there is a class attribute as-
signed) to attempt to keep. We would just dene 1000 words to
keep, because sometimes, value more than 1000 for this parame-
ter would weigh the process.
3.2. System Design 59
3. Reorder
Reorder function is a lter that generates output with a new order of
the attributes. Useful if one wants to move an attribute to the end to
use it as class attribute. In Weka, classes/labels would be considered
as one of the attribute (class attribute), and the index of class attribute
must be the last, but after StringToWordVector function applied, the index
of class attribute would change to the rst. To change the index into
last again, Reorder could be applied. We would use 2-last,1 as the
parameter of this function to move the index of class attribute to the
last.
4. RemovePercentage
RemovePercentage function is a lter that removes a given percentage of a
dataset. After a labeled data set already ltered with StringToWordVector,
then we need to divide the data set into two parts, training and test-
ing data set, because at training process, we would do evaluation that
requires the data set in two different proportion set. The proportion
of data would specied with percentage parameter. We would spec-
ify the number of percentage parameter with different percentage. For
invertSelection parameter, we would specify true and false value. True
value if we want to remove the invert percentage of percentage value,
and false if we want to remove by the percentage value. For example;
if we want to divide 75% from data set for training data, and the rest
(25%) for testing data, then we just specify the percentage value with
75 for both training and testing data, and invertSelection as true for
training data, and false for testing data.
In next process, classication process, to do the classication or predict class
of unlabeled data set we need to specify the unlabeled data set with similar
attributes set with the labeled data set. The set of words (attributes) is deter-
mined by the rst batch ltered (typically training data). StringToWordVector
set the attributes of data set with word that occurs in data set, so if the la-
beled and unlabeled data set is given StringToWordVector lter separately, then
the lter would give two different attributes set, since the word A that occurs
in labeled data set, may not occur in unlabeled data set. So we need to do
Batch Filtering method, that applies the same lter for both labeled and un-
labeled data set at the same time. If that so, in logic, before we do ltering
to labeled data set, we should have the unlabeled data set rst, and do the
3.2. System Design 60
ltering process at the same time. This is a bit tricky, if we dont do this,
StringToWordVector would most certainly create two differently output ARFF
les, two incompatible les.
After ltering process, then we need to divide the labeled data set into
training and testing part that required in training and modelling process to
create a classication model. RemovePercentage function would create training
and testing data set with a specic proportion of data.
3.2.4 Data Training/Modelling
After we already training data set and testing data set, then we could do
training process to create a classication model with a specic classier. In
our case, we want to create a classier model with class more than one,
specically, ve class. And we need to use Nave Bayes algorithm as our clas-
sier model. To implement Nave Bayes classier algorithm, we would use
NaiveBayes classier package from WEKA library, but technically, NaiveBayes
classier from WEKA could only do binary classication with two class. So
we would use MultiClassClassifier from WEKA library to do multi-class clas-
sication with NaiveBayes classier algorithm. MultiClassClassifier is a meta
classier for handling multi-class datasets with 2-class classiers. MultiClass
Classifier transforms the multi-class problem into several 2-class ones, so
actually with MultiClassClassifier, the system would do 2-class classications
with NaiveBayes as much as the class numbers. We would use default value
of parameter for both MultiClassClassifier and NaiveBayes. But we would de-
ne that the main classier is MultiClassClassifier, and the base classier is
NaiveBayes.
In this process, there would be two phase to be done, as follows:
1. Evaluation Phase
The evaluation phase would be done with 10-folds Cross Validation of
training data set. The purpose of this phase is to build classication
model and see the accuracy rate of our classication model that is cre-
ated from training data set.
10-folds Cross Validation would randomly partition the training data
set into 10 equal size subsamples. Of the 10 subsamples, a single sub-
samples is retained as the validation data for testing the model, and the
remaining 9 subsamples are used as training data. The cross-validation
process is then repeated 10 times, with each of the 10 subsamples used
3.2. System Design 61
exactly once as the validation data. The 10 results from the folds then
could be averaged (or otherwise combined) to produce a single esti-
mation. 10-folds Cross Validation is the most used validation method
to train classier and build classication model. We could do Cross
validation with crossValidateModel function from WEKA library.
The result of this phase is classication model that is created from the
training data set, and the system would show the accuracy rate of the
model, but the accuracy rate need to be compared with the accuracy
rate at validation phase, to see if there is no overtting problem of the
model.
2. Validation Phase
After the evaluation phase is done, we need to test the classication
model with testing data set to validate that the model does not get
overtting problem. We could just use evaluate function from WEKA
library with the testing data set to see the accuracy result. If there is no
overtting problem occurs, then we could save the classication model
in model format le. The model le saves the conguration of classier
that used in training process, and the training result of the training data
set.
3.2.5 Data Classication
After we get classication model, then we could test the model to classify
unlabeled data set. At ARFF le converting process, we already prepared
the unlabeled data set from rsum documents that do not have class or
unknown. So we would use the classication model to predict the class of
unlabeled data set, of course the class of the unlabeled data set must be
the valid one or listed in the classes of training data set. The class of each
data in unlabeled data set is unknown (character ?). System would load the
classication model le and the unlabeled data set, and with using prediction
function from WEKA library, the result would shows class predicted of each
unlabeled data.
If there is incorrect prediction, we could conclude that the classication
model is not good enough to classify unlabeled documents. The reasons
that affect the quality of classication model, we could say, the training data
set is not good enough, or does not have the good pattern, or the classier
3.3. Testing Design 62
algorithm is not the right algorithm for the training data set (the amount of
data set, the type of data, or the quality of data).
3.3 Testing Design
Previously, we have said that we are planning to do comparison test of the
two job categories, which is Atypical, and Typical types of job. We want to
see the difference in the accuracy rate of classication model of Atypical and
Typical types of job. Which belong to the Atypical types of job, that types
of job is really far different from each other (dont have similarity). While
the Typical types of job, that types of job has similarity between each other,
or belong to the same eld of work. Also, we would do comparison from
different amount of data set, to see the accuracy rate from each amount of
data set, and discuss the enhancemenet of accuracy rate. And of course along
with the comparison of accuracy rate, we would like to see a comparison of
quality prediction result from the classication model of Atypical and Typical
types of job, which one could do better prediction.
And along with classication testing, we would test our data crawling
if the crawling process to get rsum documents from Indeed.com, would
work well as we expected. Since our object of testing is rsum documents,
we would test the crawling system to get different amount of documents
from Indeed.com.
3.3.1 Objective of Testing
1. Checking whether data crawling process has been going well as ex-
pected (from crawling till preprocessing).
2. Checking whether the classication process has been going well as ex-
pected (from ARFF le converting till classication).
3. Evaluating accuracy rate of classication model from data set (in train-
ing and classication stage), and conclude whether classication model
is a good model or not.
4. Seeing the comparison of the accuracy rate of Atypical and Typical
types of job data set.
3.3. Testing Design 63
3.3.2 Scenario of Testing
Object of testing is job rsum document that is obtained from Indeed.com.
The object is divided into two groups, Atypical types of job, and Typical types
of job. Each group consists ve (5) types of job that would be class/label from
each certain document. Table 3.1 shows the types of job from each group.
Table 3.1 Types of job
Group
Atypical Typical
Accounting Web Analyst
ComputerIT Web Designer
Manager Web Developer
Medical Web Master
Technician Web Writer
Testing is divided into three (3) phase, based on amount of data set, to
see the comparison of accuracy rate from different amount of data set. The
amount of data set:
1. 200 documents for each class x 5 class = 1000 documents
2. 500 documents for each class x 5 class = 2500 documents
3. 1000 documents for each class x 5 class = 5000 documents
For each phase, we would test the system to see the accuracy rate value
at Training stage from Atypical and Typical group. We would compare the
testing with two method:
1. 10 Folds Cross Validation with proportion of training and testing data
set (by percentage).
2. 10 Folds Cross Validation with whole data set, without proportion of
training and testing data set.
We compare the testing with these two method to see the comparison of
accuracy rate from data set. And we would test each classication model to
predict the unlabeled rsum document. For the second method, we would
not use the RemovePercentage function, because we just intend to use whole
data set (100% data set), instead divide it into training and testing data set.
But since we need to check whether the model has overtting problem or
not, we would use 10% of data set as testing data set.
3.3. Testing Design 64
Note: The accuracy rate value is taken from Correctly Classied Instances
value from WEKA calculation, because we would focus on that value, the
other value from WEKA calculation would not be discussed.
For test classication model, we intentionally take two (2) rsum docu-
ment from each class, and collect them in a folder (depends on which group
the class belongs), Atypical group folder, and Typical group folder, as if these
rsum documents has not been labeled yet. We do this in order so we could
perform the evaluation of accuracy rate from the prediction result with eas-
ier, and we assume that if the classication model could predict those doc-
uments correctly, then the model would predict other unlabeled document
correctly too.
For the rst method with proportion of training and testing data set, result
of accuracy rate at Training stage (evaluation and validation phase) would
be shown in a table. Table 3.2 is design table of accuracy rate result.
Table 3.2 Table Design of Accuracy Rate result with Proportion of data set
Group
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10%
75% 25%
50% 50%
25% 75%
10% 90%
Result of class prediction would shown in a different table. Table 3.3 is
design table of class prediction result, with different proportions of training
data set based on the the accuracy result table.
3.3. Testing Design 65
Table 3.3 Table Design of Class Prediction Result with Proportion of data set
Group
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Class 1
2 Class 2
3 Class 3
4 Class 4
5 Class 5
6 Class 6
7 Class 7
8 Class 8
9 Class 9
10 Class 10
Total Data
Predicted
Correctly
For the second method with whole data set, result of accuracy rate at
Training stage (evaluation and validation phase) would be shown in a table.
But it is different from the table result of rst method, we would use the
table to show the accuracy rate of those three phases (1000, 2500, and 5000
documents). As already mentioned before that we would use 10% of data
set to be used as testing data set, but we would keep the amount of data set
as 100%. Table 3.4 is design table of accuracy rate result.
Table 3.4 Table Design of Accuracy Rate result with 100% of data set
Group
Total Amount
of Documents
(data set)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
1000
2500
5000
Result of class prediction for the second method would shown in a differ-
ent table. Table 3.5 is design table of class prediction result, with different
amount of documents based on the the accuracy result table.
3.3. Testing Design 66
Table 3.5 Table Design of Class Prediction Result with 100% of data set
Group
No Real Class
Total Amount of Documents (data set)
1000 2500 5000
1 Class 1
2 Class 2
3 Class 3
4 Class 4
5 Class 5
6 Class 6
7 Class 7
8 Class 8
9 Class 9
10 Class 10
Total Data
Predicted
Correctly
Chapter 4
Implementation and Testing
4.1 Implementation
This section would describe the implementation of system design to the pro-
gramming language. In this case the programming is discussed is the devel-
opment of Data crawling, Data preprocessing, Training/Learning, Classica-
tion, ARFF le converter, and Timer class.
The discussion only focuses on the functions that support the perfor-
mance of each class, to see full listing of each class code, we provide the
full code listing in Appendix.
4.1.1 Data Crawling class
As explained in previous chapter, this class serves as a data crawler of In-
deed.com, fetch, and store rsum information (parsing) into a document in
a directory that has been categorized based on types of job. This class uses
jsoup library to perform data crawling from each information that stored in
HTML element. This class is divided into several method which have the own
function to build data crawling process.
4.1.1.1 Declaration of Variable
This method is a method for declaring a global variable that needed in the
whole class.
s t at i c Ar r ayLi s t <St r i ng> f et c hed_ur l = new Ar r ayLi s t <St r i ng >() ;
pri vat e s t at i c St r i ng r oot _f ol der = pat hDi r ;
67
4.1. Implementation 68
From the snippets of code above, dene the path of directory that would
be used to save all documents. And fetched
_
url arraylist string to save tem-
porarily the information of each rsum page.
4.1.1.2 createFileFromIndeed method
This method has function to do data parsing from Indeed.com (to get all in-
formation from the specic HTML element) and save the informations into
text le. As explained before in the previous chapter, that we already knew
the HTML elements that store the information such as Name, Domicile, Work
Experience, Work Title, Work Company, and Work Description. And we al-
ready knew how to do connection to Indeed page with jsoup. The snippets of
code below is showing the implementation of data parsing from each HTML
element:
1 Document resume = Jsoup . connect ( ur l ) . t i meout (0) . get () ;
2
3 Element name = resume . s e l e c t ( " . f n " ) . f i r s t () ;
4 Element l oc at i on = resume . s e l e c t ( "#headl i ne_l oc at i on " ) . f i r s t () ;
5 El ements work_sect i on = resume . s e l e c t ( " di v . workexperi ences ec t i on
" ) ;
6 El ements wor k_t i t l e = resume . s e l e c t ( " . wor k_t i t l e " ) ;
7 El ements work_company = resume . s e l e c t ( " di v . work_company " ) ;
8 El ements wor k_des cr i pt i on = resume . s e l e c t ( " . wor k_des cr i pt i on " ) ;
9
10 Element educat i on = resume . s e l e c t ( "#educati oni t ems " ) . f i r s t () ;
11 Element add_i nf o = resume . s e l e c t ( "#addi t i onal i nf o i t ems " ) . f i r s t () ;
To open up connection to Indeed page we need the jsoup connect func-
tion, with url as the Indeed page link. Every parsing session with jsoup we
need to make a Document to get all element or elements from HTML page.
We use element when we need one and only one information that stored
in the specic element. We use elements when we need to fetch the list of
element. For example in the snippets of code above, we use elements to
fetch work experience information, because usually, work experience infor-
mation in a rsum page, contains more than one experience. The snippets
of code above shows that each specic information is stored in the elemen-
t/elements variable, before we could save it into text le. After we have got
the informations, then we need to save it into text le in an order. First would
be Name, followed by Domicile/location, then Work Experience, Education,
4.1. Implementation 69
and Additional Info as the last. Work Experience contains more than one ex-
perience informations, that means there would be more than one Work Title,
Work Company, and Work Description, so we need to do looping to save a
Work experience with work title, work company, and work description along
with it. The snippets of code below is showing the code for saving Name
and Domicile information in text le, for each information is separated with
whitespace .
1 i f ( l oc at i on != nul l ) {
2 l i ne = name . t e xt () + " " + l oc at i on . t ext () ;
3 l i ne s += l i ne ;
4 } el se {
5 l i ne = name . t e xt () ;
6 l i ne s += l i ne ;
7 }
And the snippets of code below is showing the scheme for WorkExperi-
ence information along with Work Title, Work Company, and Work Descrip-
tion.
1 f or ( Element workSecti on : work_sect i on ) {
2 ws = workSecti on . t e xt () ;
3 f or ( Element wor kTi t l e : wor k_t i t l e ) {
4 wt = wor kTi t l e . t e xt () ;
5 check = ws . cont ai ns (wt ) ;
6
7 i f ( check ) {
8 l i ne = " " + wt ;
9 l i ne s += l i ne ;
10 i sWorkTi t l eEmpt y = f al s e ;
11 break ;
12 }
13 }
14
15 i f ( i sWorkTi t l eEmpt y ) {
16 }
17
18 f or ( Element workCompany : work_company) {
19 wc = workCompany . t e xt () ;
20 check = ws . cont ai ns (wc) ;
21
22 i f ( check ) {
23 l i ne = " " + wc ;
4.1. Implementation 70
24 l i ne s += l i ne ;
25 break ;
26 }
27 }
28
29 f or ( Element workDescr i pt i on : wor k_des cr i pt i on ) {
30 wd = workDescri pt i on . t e xt () ;
31 check = ws . cont ai ns (wd) ;
32
33 i f ( check ) {
34 l i ne = " " + wd;
35 l i ne s += l i ne ;
36 break ;
37 }
38 }
39 }
After Name, Domicile, Work Experience information have already been
saved, then system would save Education and Additional Info information.
Here the snippets of code for saving Education and Additional Info into text
le, as following:
1 i f ( educat i on != nul l ) {
2 l i ne = " " + educat i on . t ext () ;
3 l i ne s += l i ne ;
4 }
5
6 i f ( add_i nf o != nul l ) {
7 l i ne = " " + add_i nf o . t ext () ;
8 l i ne s += l i ne ;
9 }
At this state, we already got every information that we need in a rsum
document. So what is next? Next, we need to dene how we would name
each document so each document would represent the rsum page of In-
deed. Once again, as explained before in previous chapter, we already knew
that every rsum page of Indeed, is identied with a unique ID, that is a let-
ter and number combination, so we want to name our each document with
that ID. We need to do URL splitting to extract the unique ID. The snippets
of code below is showing the URL splitting scheme:
4.1. Implementation 71
1 words = ur l . s p l i t ( " / " ) ;
2 i nt l ength_words = words . l engt h 1;
3 word_end = words [ l ength_words ] . s p l i t ( " \\? " ) ;
4 i d = word_end [ 0] ;
After we got the name of each le, nally we need to make initiate text
le making to save the information into the le, and save the le into the
specic directory. With createNewFile() function, we could make text le, and
save the content into it, and to save the content to le, we use BufferedWriter
function. The snippets of code is below:
1 St r i ng path = f ol der Pat h + " / " + i d ;
2 F i l e f i l e = new F i l e ( path ) ;
3 f i l e . creat eNewFi l e () ;
4
5 Wr i t er wr i t er = nul l ;
6
7 t ry {
8 wr i t er = new Buf f er edWr i t er (new OutputStreamWri ter (new
Fi l eOut put St ream ( path ) , " ut f 8" ) ) ;
9 wr i t er . wr i t e ( l i ne s ) ;
10 } catch ( I OExcept i on ex ) {
11 } f i nal l y {
12 t ry { wr i t er . c l os e () ; } catch ( Except i on ex ) {}
13 }
4.1.1.3 fetchIndeed method
This methods function is to crawl and fetch every link that is shown in
Indeed rsum searching result page. This method aims to crawl all the
links that point to rsum page, and next page of rsum searching re-
sult page. If createFileFromIndeed method aims to nd informations based
on specic HTML element, this method aims to collect links to be given
to createFileFromMethod method. As already explained before, that in each
rsum searching result page (Indeed), there is HTML element that con-
tains link to the rsum page, and to the next page. The HTML element is
app
_
link and pagination. The overall function of this method is dened
in the while loop with a condtion that the number of searching page from
rsum pages still not meet the dened treshold value, then search function
4.1. Implementation 72
would continue to search link rsum page. The snippets of code is shown
below:
1 while ( keepSearchi ng ) {
2 pr i nt ( " \nFETCHING ALL URLs FROM %s . . . " , ur l _ s t a r t e r ) ;
The value of url
_
starter is dened in the lower part of this snippets of
code. The url
_
starter is the combination of www.indeed.com and the link
from pagination element, to construct the link to crawl each page of search-
ing result page. Inside of pagination, there is a child node instl with relation
nofollow. The child node instl contains links to next pages. So we could
use this link to nd the specic amount of rsum page links. If we want to
get around 200 rsums, that means we need to crawl for 20 times, since a
result page in Indeed only contains 10 rsum page links for maximum.
1 System . out . pr i nt l n ( " Page " + ( page_counter+1)) ;
2 Document doc = Jsoup . connect ( ur l _ s t a r t e r ) . t i meout (0) . get () ;
3 El ements l i nk_resume = doc . s e l e c t ( " . app_l i nk " ) ;
4 El ements page_i ndi cat or = doc . s e l e c t ( "#pagi nat i on > . i n s t l ~ [ r e l=
nof ol l ow] " ) ;
This for looping code snippet is used to store a whole links into an ArrayList
which name forPageIndicator. In each looping, it is always checked if there
is an element that contains link, is already in the ArrayList. If the element is
not already in the ArrayList, then store the element, if it is, then forward to
the next element. This is used to prevent redundant documents in the list.
1 f or ( Element page : page_i ndi cat or ) {
2 pageAt t r = page . a t t r ( " hr ef " ) ;
3 i f ( f or PageI ndi c at or . cont ai ns ( pageAt t r ) ) {
4 continue ;
5 } el se {
6 f or PageI ndi c at or . add( pageAt t r ) ;
7 }
8 }
The snippets of code below is used to fetch all rsum page from the
link that was constructed from the crawl. In this code, createFileFromIndeed
method is called to do the document fetching. There is a condition if the r-
sum numbers already exceed the limit, then change the keepSearching value
as false, and it would trigger the while looping to stop.
4.1. Implementation 73
1 f or ( Element l i nk : l i nk_resume ) {
2 url _resume= " ht t p : //www. i ndeed . com" + l i nk . a t t r ( " hr ef " ) ;
3
4 i f ( ! f et c hed_ur l . cont ai ns ( url _resume ) ) {
5 System . out . pr i nt ( " \n" + ( resume_counter+1)) ;
6 cr eat eFi l eFr omI ndeed ( url _resume , f ol der Pat h ) ;
7 f et c hed_ur l . add( url _resume ) ;
8 resume_counter+=1;
9 }
10
11 i f ( resume_counter >= resume_page_t hreshol d ) {
12 keepSearchi ng = f al s e ;
13 break ;
14 }
15 }
16
17 i f ( page_counter < f or PageI ndi c at or . s i z e () ) {
18 ur l _ s t a r t e r = ur l _c ons t ant + f or PageI ndi c at or . get ( page_counter ) ;
19 } el se {
20 break ;
21 }
22
23 page_counter+=1;
24 }
4.1.1.4 Print Function method
This print function method is to simplify the System.out.println function
with print, so we could use print instead System.out.println to show an
output. The snippets of code as following:
1 pri vat e s t at i c void pr i nt ( St r i ng msg , Obj ect . . . ar gs ) {
2 System . out . pr i nt l n ( St r i ng . format (msg , ar gs ) ) ;
3 }
4.1.1.5 Main method
fetchIndeed method has a parameter (StringrootURL, int treshold, String folderPath),
which the aims to be called and executed in the main method. For example,
fetchIndeed(http://www.indeed.com/resumes/web-developer, 200, path
_
directory),
4.1. Implementation 74
would start crawling process, with web-developer as the initial searching,
around 200 documents, and saved as atext le to path_directory.
4.1.2 Data Preprocessing class
Data Preprocessing class serves as the text preprocessing. Before a document
is processed in the machine learning, the document must be cleaned from
stopwords, unimportant characters, not-letter characters, and conversion of
uppercase/capital to lowercase. This aims to reduce the load of the process
later on during the learning/training stage. This class uses stopwords list
le, which could be seen in the appendix. Data preprocessing class is divided
into three (3) method
4.1.2.1 initStopWords method
This method is used to store the stopwords that acquired from stopwords le
into an ArrayList. The ArrayList that stores the stopwords list would be used
in removeLettersAndStopWords method, snippets of code is shown below:
1 stopWords = new Ar r ayLi s t <St r i ng >() ;
2
3 t ry {
4 Fi l eI nput St r eam s t o p _ f i l e = new Fi l eI nput St r eam( r oot _f ol der + "
stopwords . t x t " ) ;
5 DataInputStream i n = new DataInputStream ( s t o p _ f i l e ) ;
6 Buf f eredReader br = new Buf f eredReader (new InputStreamReader ( i n
, " ut f 8" ) ) ;
7 St r i ng l i ne ;
8 while (( l i ne = br . readLi ne () ) != nul l ) {
9 i f ( l i ne . charAt (0) == 32) {
10 } el se {
11 stopWords . add( l i ne ) ;
12 }
13 }
14 } catch ( Except i on e) {
15 e . pr i nt St ackTr ace () ;
16 }
4.1.2.2 start method
start method is a main method that calls initStopWords and removeNotLettersAndStopWords,
and perform preprocessing of each document, and store the preprocessing
4.1. Implementation 75
result into a text le in a different directory.
4.1.2.3 removeNotLettersAndStopWords
This method aims to delete stopwords, non-letter characters, newline and
whitespace character, and convert uppercase/capital into lowercase letter.
For converting uppercase into lowercase, using toLowerCase() function from
Java library. The snippets of code is shown below:
1 St r i ng l owercase = t e xt . toLowerCase () ;
2 St r i ng al l owed_s t r i ng = " " ;
3
4 f or ( i nt i =0; i <l owercase . l engt h () ; i ++) {
5 i nt number_of _st ri ng = ( i nt ) l owercase . charAt ( i ) ;
6 i f (( number_of _st ri ng >= 97 && number_of _st ri ng <= 122) ||(
number_of _st ri ng == 32) ) {
7 al l owed_s t r i ng += l owercase . charAt ( i ) ;
8 } el se {
9 al l owed_s t r i ng += " " ;
10 }
11 }
12
13 St r i ng [] words = al l owed_s t r i ng . s p l i t ( " \\ s+" ) ;
14 Li s t <St r i ng> wor d_l i s t = new Ar r ayLi s t <St r i ng >() ;
15
16 f or ( i nt j = 0; j < words . l engt h ; j ++) {
17 i f ( ! stopWords . cont ai ns ( words [ j ]) && words [ j ] . l engt h () > 1) {
18 wor d_l i s t . add( words [ j ]) ;
19 }
20 }
21
22 St r i ng output = wor d_l i s t . t oSt r i ng () ;
23 output = output . r e pl ac e Al l ( " , " , " " ) ;
24 output = output . r e pl ac e Al l ( " \\[|\\] " , " " ) ;
25
26 return out put ;
4.1.3 Training/Learning class
Training/Learning class is used to create a classication model from training
and testing data set, using MultiClassClassifier and NaiveBayes classier which
is the WEKA package library. This class import WEKA.classifiers.meta.MultiClass
Classifier package to use MultiClassClassifier classier function that allows
4.1. Implementation 76
the system to be able to do the training process of multi-class data set.
This class imports WEKA.classifiers.bayes.NaiveBayes package to use Naive-
Bayes classier as base classier to perform the training process. Training
process of training data set could be done using Evaluation function, from
WEKA.classifiers.Evaluation package. This class also uses the Timer class to
calculate the estimate of time required in the training process from start till
nish. This class is divided into four (4) method, as following:
4.1.3.1 loadDataset method
loadDataset() method has function to load the training and testing data set
that generated in ARFF File converter class, into system from directory path
rootFolder. To load a data set that has been saved in ARFF format le, could
use DataSource function. According WEKA docummentation, DataSource is the
most recommended way to load ARFF le. The training and testing data set
would be loaded with DataSource function from a specied path directory.
1 DataSource s our ceTr ai n = new DataSource ( r oot Fol der +
dataTrai nArff Name ) ;
2 DataSource s our ceTes t = new DataSource ( r oot Fol der +
dataTestArf f Name ) ;
3 t r ai nDat a = s our ceTr ai n . get Dat aSet () ;
4 t es t Dat a = s our ceTes t . get Dat aSet () ;
The training data set is stored in the instance variable named trainData.
Instance variable is a local variable types from WEKA, that used for repre-
senting an instance. A data set could be stored in instance variable for later
processing. Same as the training data set, the testing data set also stored in
an instance variable named testData.
1 i nt cI dxTr ai n = t r ai nDat a . numAt t ri but es () 1;
2 t r ai nDat a . s et Cl as s I ndex ( cI dxTr ai n ) ;
3 i nt cI dxTes t = t es t Dat a . numAt t ri but es () 1;
4 t es t Dat a . s et Cl as s I ndex ( cI dxTes t ) ;
5
6 System . out . pr i nt l n ( " Data Loaded . . . " ) ;
7 System . out . pr i nt l n ( "=== Tr ai ni ng data : " + dataTrai nArff Name ) ;
8 System . out . pr i nt l n ( "=== Tes t i ng data : " + dataTestArf f Name ) ;
9 System . out . pr i nt l n ( "=== Source l oc at i on : " + r oot Fol der ) ;
10 System . out . pr i nt l n () ;
4.1. Implementation 77
Typically, when a data set is loaded from ARFF le, the data set would
experience a change of attribute index, therefore, before the data set could
be used, the index of attribute of the data set should be set up rst. Snippets
of code above, explains the setting of the index attribute of trainData and
testData data set. After that, the system would display a message that the
two data sets was loaded successfully.
4.1.3.2 evaluate method
Method for evaluation process (Evaluation phase). As described in the pre-
vious chapter, in the training stage, there are two (2) phase, (1) Evalua-
tion phase, and (2) Validation phase. In snippets of code below, the sys-
tem would declare the type of classier that would be used in the evalu-
ation and validation phase. The classier that used is MultiClassClassifier
that imported from MultiClassClassifier package of WEKA. And base classi-
er that used as main classier in MultiClassClassifier is NaiveBayes classifier.
classifier.setClassifier(baseClassifier) set baseClassifier that is NaiveBayes
classier which is to be used together in MultiClassClassifier. With buildClassifier
method, the systemmakes classier model with MultiClassClassifier and NaiveBayes
as classier, and trainData as training data set.
1 c l a s s i f i e r = new Mul t i Cl a s s Cl a s s i f i e r () ;
2 Nai veBayes b a s e Cl a s s i f i e r = new Nai veBayes () ;
3 c l a s s i f i e r . s e t Cl a s s i f i e r ( b a s e Cl a s s i f i e r ) ;
4 c l a s s i f i e r . b u i l d Cl a s s i f i e r ( t r ai nDat a ) ;
Snippets of code below is an order to declare the evaluation process of
the training data set, with 10-folds Cross Validation. In the evaluation pro-
cess, the training data set would be trained with 10-folds model. Result of the
evaluation process would be shown using toSummaryString, toClassDetailString,
and toMatrixString commands. toSummaryString would display the result of
Summary of Evaluation of Training dataset, which consists of Accuracy value
of model. toClassDetailString would display detailed accuracy by class in-
formation. And toMatrixString would display Confusing Matrix information
from the evaluation phase.
1 Eval uat i on eval = new Eval uat i on ( t r ai nDat a ) ;
2 eval . cr os s Val i dat eModel ( c l a s s i f i e r , t r ai nDat a , 10, new Random(1) ) ;
3 System . out . pr i nt l n ( eval . toSummaryString () ) ;
4.1. Implementation 78
4 System . out . pr i nt l n ( eval . t oCl a s s De t a i l s St r i ng () ) ;
5 System . out . pr i nt l n ( eval . t oMat r i xSt r i ng () ) ;
4.1.3.3 learn method
Method to do Validation phase, which is a test of the accuracy of model using
the training and testing data set, to see if the classication model meets
overtting problem or not.
1 Eval uat i on eval uat i on = new Eval uat i on ( t r ai nDat a ) ;
2 eval uat i on . eval uateModel ( c l a s s i f i e r , t es t Dat a ) ;
3 System . out . pr i nt l n ( "=== Summary of Tr ai ni ng Test dat as et ===" ) ;
4 System . out . pr i nt l n ( eval uat i on . toSummaryString () ) ;
5 System . out . pr i nt l n ( eval uat i on . t oCl a s s De t a i l s St r i ng () ) ;
6 System . out . pr i nt l n ( eval uat i on . t oMat r i xSt r i ng () ) ;
7 System . out . pr i nt l n ( "=== Tr ai ni ng / t e s t i ng on t e s t i ng dat as et done
====" ) ;
8 System . out . pr i nt l n () ;
From snippets of code above, the evaluation process once again declared.
However, for the validation phase, using evaluateModel function, not crossValidate
Model, because the training data would be tested with the testing data. Same
like before, the result of evaluation phase would be shown using toSummaryString,
toClassDetailString, and toMatrixString, to provide a comparison of the accu-
racy rate from the evaluation phase and validation phase.
4.1.3.4 saveModel method
1 Obj ectOutputStream out = new Obj ectOutputStream(new
Fi l eOut put St ream( f i l eName ) ) ;
2 out . wr i t eObj ect ( c l a s s i f i e r ) ;
3 out . c l os e () ;
4 System . out . pr i nt l n ( "=== Cl a s s i f i e r Model has been saved ===" ) ;
5 System . out . pr i nt l n () ;
6
7 l ear ner . saveModel ( "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n / s p e c i f i c F i e l d /
ml t i Cl as s NvBayes Spec i f i c . model " ) ;
As its name, this method is used to save the classication model, that al-
ready got from learn method. But, remember, the classication model must
be quite good enough from the accuracy rate. If the model just gives bare
4.1. Implementation 79
50%for the accuracy rate, then maybe we should get another model to do the
job. And make sure the classication model does not have overtting prob-
lem. Model could be saved with using writeObject from ObjectOutputStream, to
write the classication model into a le. The le format is user-dened, but
in our system, we want to save the model as
*
.model le format, so the model
could be easily loaded into WEKA framework, at classication class.
4.1.4 Classication class
This class that performs classication process of this system, from the clas-
sication model that have been created and stored on the training/learning
class. A le would be loaded in this class is the classication model le and
the unlabeled data set. This class is divided into three (3) method:
4.1.4.1 loadNewset method
This method is to load the unlabeled data set le. The unlabeled data set
le is already in ARFF le format, and already saved in the specic directory.
So with DataSource function, system load the unlabeled data set as newTestData
instance. But before system could use the data set, the attribute class index
of data set must be congured again with newTestData.numAttributes()-1, so
the attribute class index would be in the right index again. The snippets of
code is below:
1 DataSource sourceNewset = new DataSource ( newDataFolder +
newDataArff ) ;
2 newTestData = sourceNewset . get Dat aSet () ;
3
4 i nt cIdxNewset = newTestData . numAt t ri but es () 1;
5 newTestData . s et Cl as s I ndex ( cIdxNewset ) ;
6
7 System . out . pr i nt l n ( " \n=== Loadi ng new t e s t data ===" ) ;
8 System . out . pr i nt l n ( "=== F i l e name: " + newDataArff ) ;
9 System . out . pr i nt l n ( "=== From di r e c t t or y : " + newDataFolder ) ;
4.1.4.2 loadModel method
This method is to load the classication model that already made and saved
from Learning class. The classication model is saved with
*
.model extension,
4.1. Implementation 80
and system needs the model to do the classication to predict the label of
unlabeled data set.
1 Obj ect I nput St ream i n = new Obj ect I nput St ream(new Fi l eI nput St r eam(
pat hDi r ) ) ;
2 Obj ect tmp = i n . readObj ect () ;
3
4 c l a s s i f i e r = ( Mul t i Cl a s s Cl a s s i f i e r ) tmp;
5
6 i n . c l os e () ;
7 System . out . pr i nt l n ( " \n=== Loadi ng model ===" ) ;
8 System . out . pr i nt l n ( "=== Cl a s s i f i e r : " + c l a s s i f i e r . get Cl as s () .
getName () + " " + Ut i l s . j oi nOpt i ons ( c l a s s i f i e r . get Opt i ons () ) ) ;
Model is loaded with using ObjectInputStream object selector. As already
explained before, that the model le contains the classication conguration
information that using the training data set as the data set, MultiClassClassifier
as the multi-class classier, and NaiveBayes classier as the base classier.
classifier = (MultiClassClassifier) tmp is the function to dene once again
that our classication model is using MultiClassClassifier classier to perform
classication process in multi-class data set. After every a reading session
with ObjectInputStream, we need to close the session with .close(). To get the
classier name information fromour model, we could use .getClass().getName()
in our classier. To get the ltering options that already applied in our clas-
sication model, we could use .getOptions() in our classier.
4.1.4.3 classifyDocument method
This method is the main function of Classication class. In this method,
system would do classication into the unlabeled data set. This snippets of
code is shown the function to get (once again) our classier name, and the
ltering options of our classication model:
1 System . out . pr i nt l n ( "=== Setup ===" ) ;
2 System . out . pr i nt l n ( "=== Cl a s s i f i e r : " + c l a s s i f i e r . get Cl as s () .
getName () + " " + Ut i l s . j oi nOpt i ons ( c l a s s i f i e r . get Opt i ons () ) ) ;
3 System . out . pr i nt l n ( "=== Dat aset Rel at i on : " + newTestData .
rel ati onName () ) ;
newTestData.relationName() would give the relation name of an instance,
that in this snippets of code, the instance newTestData is our unlabeled data
4.1. Implementation 81
set. Before do classication process, why dont we show the data that con-
tained in our unlabeled data set. The data is contained in the data set, is
already converted into word vector type. To show the data that contained in
our unlabeled data set, we could use these below snippets of code, perform
looping as many as the data, we always could say data in data set as instance,
to show each instance in a index, we could use newTestData.instance(a).
1 f or ( i nt a = 0; a < newTestData . numInstances () ; a++){
2 System . out . pr i nt (( a+1) + " . " ) ;
3 System . out . pr i nt l n ( newTestData . i ns t anc e ( a) ) ;
4 }
The classication process could be done with using .classifyInstance()
with our classication model, and unlabeled data set for each instance that
contained in our unlabeled data set. The snipperts of code below show the
label prediciton of each instance of unlabeled data set.
1 System . out . pr i nt l n ( "# ac t ual pr edi ct ed " ) ;
2 f or ( i nt i = 0; i < newTestData . numInstances () ; i ++){
3 double pred = c l a s s i f i e r . c l a s s i f y I ns t a nc e ( newTestData . i ns t anc e
( i ) ) ;
4
5 System . out . pr i nt (( i +1)) ;
6 System . out . pr i nt ( " " ) ;
7
8 System . out . pr i nt ( " " + newTestData . i ns t anc e ( i ) . t oSt r i ng (
newTestData . c l as s I ndex () )+ " " ) ;
9
10 System . out . pr i nt ( " " ) ;
11 System . out . pr i nt ( " " + newTestData . c l a s s At t r i b ut e () . val ue (( i nt )
pred ) ) ;
12 System . out . pr i nt l n ( " F i l e " + ( i +1) +" . di pr e di ks i sebagai kel as
" + newTestData . c l a s s At t r i b ut e () . val ue (( i nt ) pred ) ) ;
13 System . out . pr i nt l n () ;
14 }
For each instance that contained in the unlabeled data set (newTestData),
systemclassify/predict the label that seems suit of the instance. Systemcould
do the prediction based on the word occurrence pattern of our classication
model. System would show the actual class that is unknown (?) and the
predicted class, so we could evaluate whether the predicted class is shown is
right or not.
4.1. Implementation 82
4.1.5 ARFF File Converter class
This method has a function to prepare every data set from text le into ARFF
le format. This method contains four (4) methods, as following:
4.1.5.1 convertText2Arff method
convertText2Arff method aims to convert the rsum document that has been
obtained fromData Crawling process into ARFF le format, with TextDirectoryLoader
function from WEKA. All the rsum documents would be converted into a
single ARFF le, with the name of job category in the folder as the label of
each documents. System saves the data into ARFF le with using ArffSaver
function from WEKA. The snippets of code as following:
1 Text Di r ect or yLoader di r Loader = new Text Di r ect or yLoader () ;
2 di r Loader . s e t Di r e c t or y (new F i l e ( resumeFol der ) ) ;
3 dataRaw = di r Loader . get Dat aSet () ;
4
5 dataRawSaver = new Ar f f Saver () ;
6 dataRawSaver . s e t I ns t anc e s ( dataRaw) ;
7 dataRawSaver . s e t F i l e (new F i l e ( r oot Fol der + dataRawArffName ) ) ;
8 dataRawSaver . wr i t eBat ch () ;
9
10 System . out . pr i nt l n ( l i ne Pr i nt ) ;
11 System . out . pr i nt l n ( "=== Data resume has al r eady convert ed i nt o " +
" ARFF f i l e s . . . ===" ) ;
12 System . out . pr i nt l n ( "=== Dat aset source : " + resumeFol der + " ===" )
;
13 System . out . pr i nt l n ( "=== With name: " + dataRawArffName + " ===" ) ;
14 System . out . pr i nt l n ( "=== Locat i on : " + r oot Fol der + " ===" ) ;
15 System . out . pr i nt l n ( l i ne Pr i nt ) ;
4.1.5.2 checkNLoadFile method
This method is used to convert unlabeled documents into ARFF le. The
unlabeled documents are needed for testing process of the classication sys-
tem. All unlabeled documents should be stored in a folder together, and then
later, the system would read the contents of les one by one with snippets of
code as below:
4.1. Implementation 83
1 F i l e f = new F i l e ( newDataFolder ) ;
2 F i l e [] f i l e s = f . l i s t F i l e s () ;
3 St r i ng [] t e xt = new St r i ng [ f i l e s . l engt h ] ;
4 St r i ng l i ne ;
Converting unlabeled documents into ARFF le could not be done with
TextDirectoryLoader function, but we could make a new ARFF data set with
manual way. System must declare a ARFF data set by declaring attributes
and instances. Attribute is the number of label/class of documents, in our
case, the maximum number of class attribute is ve (5). The declaration of
attributes and instances could be done with using FastVector function from
WEKA. For an amount of documents, each document would be converted as
an instance at the data set. If all documents are already created in the data
set, then ARFF le could be made with ARFF saver function. Snippets of
code below is showing the declaration of our new data set, with ve class
attribute, and the instance as text.
1 Fas t Vect or c l as s Val = new Fas t Vect or (5) ;
2 c l as s Val . addElement ( " c l as s 1 " ) ;
3 c l as s Val . addElement ( " c l as s 2 " ) ;
4 c l as s Val . addElement ( " c l as s 3 " ) ;
5 c l as s Val . addElement ( " c l as s 4 " ) ;
6 c l as s Val . addElement ( " c l as s 5 " ) ;
7
8 At t r i but e at t r i but e 1 = new At t r i but e ( " t e xt " , ( Fas t Vect or ) nul l ) ;
9 At t r i but e at t r i but e 2 = new At t r i but e ( "@@class@@" , c l as s Val ) ;
10
11 Fas t Vect or f vWEKAAt t ri but es = new Fas t Vect or (2) ;
12 f vWEKAAt t ri but es . addElement ( a t t r i but e 1 ) ;
13 f vWEKAAt t ri but es . addElement ( a t t r i but e 2 ) ;
14
15 newData = new I ns t anc es ( " new_data " , fvWEKAAttri butes , 0) ;
16 newData . s et Cl as s I ndex ( newData . numAt t ri but es () 1) ;
17
18 DenseI nst ance i ns t anc e = new DenseI nst ance ( newData . numAt t ri but es ()
) ;
19
20 f or ( i nt i = 0; i < f i l e s . l engt h ; i ++){
21 Buf f eredReader f i l eReader = new Buf f eredReader (new Fi l eReader (
newDataFolder +" / "+f i l e s [ i ] . getName () ) ) ;
22 while (( l i ne = f i l eReader . readLi ne () ) != nul l ) {
4.1. Implementation 84
23 l i ne = l i ne ;
24 t e xt [ i ] = l i ne ;
25 }
26 i ns t anc e . s et Val ue ( at t r i but e1 , t e xt [ i ]) ;
27 newData . add( i ns t anc e ) ;
28 f i l eReader . c l os e () ;
29 }
4.1.5.3 lterFunction method
At this stage, we have obtained the labeled and unlabeled data set in ARFF
le format. But, the instance in ARFF le still in string form, and we need to
convert this string type into Word Vector type, so the classier algorithm could
be applied. StringToWordVector function from WEKA could convert string type
into word vector type. First we need to do Randomize to the data set, Randomize
function would random the data set, so it would be fair to divide the labeled
data set into training and testing data set. The snippets of code below is
showing the randomize function. The number of random seed is just to set
the randomize generator, but we would not bother it, and we just follow the
default option, 42.
1 Randomize rand = new Randomize () ;
2 rand . setRandomSeed(42) ;
3 rand . set I nput Format ( i ns t anc es ) ;
4 i ns t anc es = rand . us e F i l t e r ( i ns t ances , rand) ;
StringToWordvector function could apply any other lters that could help
to make the data set better. In our case, we use OutputWordCounts (to set the
instance relation to show the number of word occurrence in the data set),
AttributeIndices as firs-last, and Tokenizer with NGram (to split instances
into N-gram words). And as explained before, we already talked about Batch
Filtering, so we would apply same lter for both unlabeled and labeled data
set. So at this state, we already have the unlabeled documents to test the
classication model. The snippets of code is shown as following:
1 Stri ngToWordVector s t r i n g F i l t e r = new Stri ngToWordVector () ;
2 s t r i n g F i l t e r . setOutputWordCounts ( true ) ;
3 s t r i n g F i l t e r . s e t At t r i but e I ndi c e s ( " f i r s t l a s t " ) ;
4 s t r i n g F i l t e r . s et Tokeni zer (new NGramTokenizer () ) ;
5 s t r i n g F i l t e r . setWordsToKeep(1000) ;
4.1. Implementation 85
6 s t r i n g F i l t e r . set I nput Format ( i ns t anc es ) ;
7 I ns t anc es newTrain = F i l t e r . us e F i l t e r ( i ns t ances , s t r i n g F i l t e r ) ;
8 I ns t anc es newTest = F i l t e r . us e F i l t e r ( newData , s t r i n g F i l t e r ) ;
Then, the only problem occurs after we applied StringToWordVector to our
data set, that the class attribute index would change to the 0 or the rst
index, while the classication system could only accept the data set with the
class attribute in lowest index. Thats why we need to apply one more lter,
that is Reorder function. Set the parameter of AttributeIndices as 2-last,1
would make the index of the class attribute to the lowest. The snippets of
code is shown as below:
1 Reorder order = new Reorder () ;
2 order . s e t At t r i but e I ndi c e s ( " 2l as t , 1 " ) ;
3 order . set I nput Format ( newTrain) ;
4 I ns t anc es newerTrai n = F i l t e r . us e F i l t e r ( newTrain , order ) ;
5 I ns t anc es newerTest = F i l t e r . us e F i l t e r ( newTest , order ) ;
4.1.5.4 prepareDataTrainTest method
prepareDataTrainTest method aims to divide the labeled data set into two parts,
training and testing data set. Distribution scheme is using RemovePercentage
function with parameter setPercentage (percent) and setInvertSelection (true
or false). Suppose we want to divide training and testing data set with ratio
75% and 25%. 75% for training data set, and 25% for testing data set, then
we would set the percentage value of setPercentage for training data set with
75% and true value for setInvertSelection. RemovePercentage would remove
the amount of data set by the percentage, but if the setInvertSelection is set
true, then RemovePercentage would remove the invert percentage of data set,
the invert of 75% is 25%. Thats why for testing data we set the value of
setInvertSelection as false. After divide the data set into training and testing
data set, then system would save the both of data set into new ARFF le. The
snippets of code is shown below:
1 RemovePercentage di vi deTr ai nDat a = new RemovePercentage () ;
2 di vi deTr ai nDat a . s et Per cent age (75) ;
3 di vi deTr ai nDat a . s e t I nv e r t Se l e c t i on ( true ) ;
4 di vi deTr ai nDat a . set I nput Format ( t r ai nDat a ) ;
5 t r ai nDat a = di vi deTr ai nDat a . us e F i l t e r ( t r ai nDat a , di vi deTr ai nDat a ) ;
6 dat aTr ai nSaver = new Ar f f Saver () ;
4.1. Implementation 86
7 dat aTr ai nSaver . s e t I ns t anc e s ( t r ai nDat a ) ;
8 dat aTr ai nSaver . s e t F i l e (new F i l e ( r oot Fol der + dataTrai nArffName ) ) ;
9 dat aTr ai nSaver . wr i t eBat ch () ;
10
11 RemovePercentage di vi deTes t Dat a = new RemovePercentage () ;
12 di vi deTes t Dat a . s et Per cent age (75) ;
13 di vi deTes t Dat a . s e t I nv e r t Se l e c t i on ( f al s e ) ;
14 di vi deTes t Dat a . set I nput Format ( t es t Dat a ) ;
15 t es t Dat a = di vi deTes t Dat a . us e F i l t e r ( t es t Dat a , di vi deTes t Dat a ) ;
16 dat aTest Saver = new Ar f f Saver () ;
17 dat aTest Saver . s e t I ns t anc e s ( t es t Dat a ) ;
18 dat aTest Saver . s e t F i l e (new F i l e ( r oot Fol der + dataTestArff Name ) ) ;
19 dat aTest Saver . wr i t eBat ch () ;
4.1.6 Timer class
Timer class is actually just an additional function that is intentionally added
to the system. This class serves to inform an estimation time or an amount of
time for a process from start to nish. The reason we add this class into the
system, is to give us estimation time information needed of a process from
start to nish. In the learning, creating model, and classication process, it
is important to know the estimation time of process. Timer class is declared
as stopwatch class in system. Here are the snippets of code from Timer class.
1 publ i c cl ass Stopwatch {
2 pri vat e long s t ar t Ti me = 0;
3 pri vat e long stopTi me = 0;
4 pri vat e boolean runni ng = f al s e ;
5
6 publ i c void s t a r t () {
7 t hi s . s t ar t Ti me = System . c ur r ent Ti meMi l l i s () ;
8 t hi s . runni ng = true ;
9 }
10
11 publ i c void st op () {
12 t hi s . stopTi me = System . c ur r ent Ti meMi l l i s () ;
13 t hi s . runni ng = f al s e ;
14 }
15
16 publ i c long getEl apsedTi me () {
17 long el apsed ;
18 i f ( runni ng ) {
19 el apsed = ( System . c ur r ent Ti meMi l l i s () s t ar t Ti me ) ;
4.2. Testing Result 87
20 } el se {
21 el apsed = ( stopTi me s t ar t Ti me ) ;
22 }
23 return el apsed ;
24 }
25
26 publ i c long get El apsedTi meSecs () {
27 long el apsed ;
28 i f ( runni ng ) {
29 el apsed = (( System . c ur r ent Ti meMi l l i s () s t ar t Ti me ) /1000) ;
30 } el se {
31 el apsed = (( stopTi me s t ar t Ti me ) / 1000) ;
32 }
33 return el apsed ;
34 }
35 }
Timer class is declared as a public class that could be called/imported to
be used in the other class. There is a start() as initial of process or start of
timer function. And there is a stop() as the end of process, or the end of the
timer function. For the calculation of time, start() position should be placed
before the initialization of certain functions, and the position of stop() should
be placed after the line of code. Timer class uses System.currentTime in second
or millisecond. Timer class is used in Training and Classication class, which
is a class with process that takes quite long time.
4.2 Testing Result
This internet job rsum document classication system has been tested with
the scenario of testing in subsection 3.3.2. And the objective of testing to see:
1. A report that shows how data crawling has been going
2. A report that shows how classication process has been going
3. Result table that shows accuracy rate of classication model
4. Result table that shows class prediction result
Object of testing is job rsum document that is obtained from Indeed.com.
The object is divided into two groups, Atypical types of job, and Typical types
of job. Each group consists ve (5) types of job that would be class/label
4.2. Testing Result 88
from each certain document. Refers to table 3.1 to see the types of job from
each group.
Testing is divided into three (3) phase, based on amount of data set, to
see the comparison of accuracy rate from different amount of data set. The
amount of data set:
1. 1000 documents
2. 2500 documents
3. 5000 documents
For each phase, we would test the system to see the accuracy rate value at
Training stage from Atypical and Typical group (with and without proportion
of training and testing data set), and we would test each classication model
to predict the unlabeled rsum document.
4.2.1 Report of Data Crawling process
We tested the data crawling process for thrice time. First, we tested the
data crawling process to get 200 documents for each class, and the process
worked well, without problem. Next, we tested the data crawling process to
get 500 documents for each class, and till no problem, we got 500 documents
for each class. And nally, we tested to get 1000 documents for each class,
but in the end of process, we did not get 1000 documents, but only around
850 documents. We analyzed the problem, and we got the reason. This
happened because everytime the system crawls a link list page of Indeed,
Indeed always gives randomly rsum page link list, and there is a chance,
that Indeed would give the rsum page link that already given previously.
In other side, our treshold only counts the amount of rsum page that we
counter, even it is the same page as previous page, and when the treshold
value is fullled, the crawling process stops. And that already explained
before, the maximum link list page that provided by Indeed is only 99 pages,
not even close to 100 pages. So to get the rest documents, we might start
the crawling process once again, till we get 1000 documents.
Data preprocessing process worked well, and without problem, but we
found out that the result of our classication system, may be better if we add
stemming process to reduce the word features.
4.2. Testing Result 89
Figure 4.1 Output that shows start of crawling process
Figure 4.2 Output that shows end of crawling process
In the output shows the estimation time of crawling process for 200
documents each class
Figure 4.3 Output that shows data preprocessing is done
In the output shows the estimation time of data preprocessing
4.2. Testing Result 90
4.2.2 Report of Document Classication process
We already tested the ARFF le converting, training, and classication pro-
cess without aw. The output ARFF format le, for raw, training, testing,
and even unlabeled data set was produced as we expected. The training and
classication process worked well without problem, apart from the not good
classication result, both of accuracy rate and prediction result, that fully
caused by a not good data set, or not suitable classier. Accuracy rate for
each model comes from Correctly Classied Instances value from the output.
4.2.3 Result of Atypical Types of Job
4.2.3.1 With Proportion of Training and Testing data set
1. 1000 documents
Here are the result of classication process. Atypical group with 1000
documents has the best accuracy rate around 64% when 75% of data
set is given, or around 750 data from 1000 data. From the comparison
of accuracy rate at evaluation and validation phase, the accuracy of
the model is pretty close, 64.13% and 64.4%, so we could assume that
this classication model would not break down with unlabeled data, or
there is no overtting problem occurs. Even the rate is above 50%, it is
still lower than 70%, It is not very good model, but we could try to use
this to classify the unlabeled document, and see the prediction result.
From the prediction result, we could see that surprisingly, the model
with 50% proportion predicted all the unlabeled document with cor-
rectly, since the 75% proportion could only predict 8 right from 10. 6
th
document predicted as Accounting instead Manager, because the r-
sum document contains many accounting term even the rsum major
is manager.
4.2. Testing Result 91
Table 4.1 Result Accuracy Rate of Atypical group with 1000 documents
Atypical Types of Job - 1000 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 63.33 64 44
75% 25% 64.13 64.4 28
50% 50% 62 61.2 22
25% 75% 60 62.4 15
10% 90% 56 58 12
Table 4.2 Result Class Prediction of Atypical group with 1000 documents
Atypical Types of Job - 1000 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Accounting Accounting Accounting Accounting Accounting Accounting
2 Accounting Accounting Accounting Accounting Accounting Accounting
3 ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT
4 ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT
5 Manager Manager Manager Manager Accounting Accounting
6 Manager ComputerIT Accounting Manager Manager Manager
7 Medical Medical Medical Medical Medical Medical
8 Medical Medical Medical Medical Medical Medical
9 Technician Manager Technician Technician Accounting ComputerIT
10 Technician Medical Medical Technician Technician Medical
Total Data
Predicted
Correctly
7 8 10 8 7
2. 2500 documents
Atypical group 2500 documents has the best accuracy rate around 67%
when 75% of data set is given, or around 1875 data from 2500 data.
From the comparison of accuracy rate at evaluation and validation
phase, the accuracy of the model is separated by 3%, but it is still
close, that we could say there is no overtting problem occurs. The
rate is close enough to 70%, higher than the model with 1000 docu-
ments
From the prediction result, the result is worse than the model with
1000 documents, and the best prediction result is from the model with
4.2. Testing Result 92
50% and 25% proportion, that could predict 7 right from 10.
Table 4.3 Result Accuracy Rate of Atypical group with 2500 documents
Atypical Types of Job - 2500 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 65.87 66 91
75% 25% 67.09 64 75
50% 50% 61.68 65.2 52
25% 75% 65.23 65.23 51
10% 90% 58.93 58.93 27
Table 4.4 Result Class Prediction of Atypical group with 2500 documents
Atypical Types of Job - 2500 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Accounting Accounting Accounting Accounting Accounting Accounting
2 Accounting Accounting Accounting Accounting Accounting Accounting
3 ComputerIT Accounting Accounting ComputerIT Accounting Accounting
4 ComputerIT ComputerIT ComputerIT Accounting Accounting Accounting
5 Manager Accounting Accounting Manager Manager Accounting
6 Manager ComputerIT Accounting Manager Manager Manager
7 Medical Medical Medical Medical Medical Accounting
8 Medical Medical Medical Medical Medical Medical
9 Technician Manager Manager Manager Manager Technician
10 Technician Medical Medical Medical Medical Medical
Total Data
Predicted
Correctly
5 5
...
7
6 5
3. 5000 documents
Atypical group with 5000 documents has the best accuracy rate around
67.96% when 50% of data set is given, or around 2500 data from 5000
data. This rate is around 68%, and high enough. The comparison of
accuracy set at evaluation and validation phase is very close, and again
for this model, there is no overtting problem occurs. There is two best
coulddidate model in here, that 75% and 50% proportion.
4.2. Testing Result 93
From the prediction result, the result of 75% and 50% proportion is
good enough, because both could predict 9 from 10 correctly. And the
error is same error of the model with 1000 documents, that is 6
th
docu-
ment, that even the major is manager, but it contains many accounting
terms. So this model would be good enough to be used in classication,
of course better than the model with only 1000 and 2500 documents,
but still considering the accuracy rate is only around 68%, we could
not really depend on the model.
Table 4.5 Result Accuracy Rate of Atypical group with 5000 documents
Atypical Types of Job - 5000 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 66.42 68 217
75% 25% 67.07 67.68 170
50% 50% 67.96 65.92 102
25% 75% 67.04 64.11 59
10% 90% 66.2 66.07 41
Table 4.6 Result Class Prediction of Atypical group with 5000 documents
Atypical Types of Job - 5000 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Accounting Accounting Accounting Accounting Accounting Accounting
2 Accounting Accounting Accounting Accounting Accounting Accounting
3 ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT
4 ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT ComputerIT
5 Manager Manager Manager Manager Manager Accounting
6 Manager ComputerIT Accounting Accounting Accounting Accounting
7 Medical Medical Medical Medical Medical Medical
8 Medical Medical Medical Medical Medical Medical
9 Technician Manager Technician Technician Manager Technician
10 Technician Technician Technician Technician Technician Technician
Total Data
Predicted
Correctly
8 9
...
9
8 8
4.2. Testing Result 94
4.2.3.2 Without Proportion of Training and Testing data set (100%
data set)
Here are the result of classication process with the data set. It is different
from the previous test, since in this test we did training stage of the data set
without proportion of training and testing set, so the classication model is
created from 100% amount of data set. We also use a testing data set for
validation phase to check whether the model has overtting problem or not,
we used 10% of data set for testing set. Table 4.7 shows the accuracy rate
result. The accuracy rate from these model (with 100% data set) is not so far
different from the accuracy rate that we got from the models with proportion
of training and testing set, but the fact that we could see here, the accuracy
rate is increased when the amount of data set is increased too. Besides the
model with 2500 and 5000 documents, we could see that the comparison of
accuracy set at evaluation and validation phase for 1000 documents model
is separated around 10%, but it could still indicates the model does not have
overtting problem.
Table 4.7 Result Accuracy Rate of Atypical group with whole data set
Atypical Types of Job
Total Amount
of Documents
(data set)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
1000 62.6 72 40
2500 66.24 66.8 129
5000 67.3 69.6 354
From the prediction result at table 4.8, the model with 2500 and 5000
documents could predict 9 from 10 correctly. The error is in 6
th
document,
that always predicted as ComputerIT even the class is Manager. This error
is the one that occured in the previous test with proportion of data set. The
document contains many ComputerIT term, even the major is Manager.
4.2. Testing Result 95
Table 4.8 Result Class Prediction of Atypical group with whole data set
Atypical Types of Job
No Real Class
Total Amount of Documents (data set)
1000 2500 5000
1 Accounting Accounting Accounting Accounting
2 Accounting Accounting Accounting Accounting
3 ComputerIT ComputerIT ComputerIT ComputerIT
4 ComputerIT ComputerIT ComputerIT ComputerIT
5 Manager Manager Manager Manager
6 Manager ComputerIT ComputerIT ComputerIT
7 Medical Medical Medical Medical
8 Medical Medical Medical Medical
9 Technician Manager Technician Technician
10 Technician Manager Technician Technician
Total Data
Predicted
Correctly
7 9 9
Although the total amount of documents is increased by 10% from earlier
testing, these Atypical group modelss accuracy rate could not reach 70%.
The highest accuracy rate is 67% when 5000 documents are given (with
1000 documents for each class). So from these result of two method (with
and without proportion of training and testing data set), conrms that we
could not really depend on this Atypical group model, since the accuracy
rate is not very good.
4.2.4 Result of Typical Types of Job
4.2.4.1 With Proportion of Training and Testing data set
1. 1000 documents
Here are the classication process of Typical types of job. Typical group
with 1000 documents has the best accuracy rate around 46% when
90% of data set is given, or around 900 data from 1000 data. The
accuracy rate is bad, because below from 50%. From the comparison
of accuracy rate at evaluation and validation phase, the accuracy of
the model is separated by 5%, it is not too high, but we could say the
model has a chance (even small chance) that the model would break
down with unlabeled data, and gives error of prediction.
From the prediction result, the model with 90% proportion could only
4.2. Testing Result 96
predict 5 from 10, it is about 50:50, and it is bad model that is not wise
to use this model to do classication.
Table 4.9 Result Accuracy Rate of Typical group with 1000 documents
Typical Types of Job - 1000 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 46.44 51 45
75% 25% 45.6 47.6 31
50% 50% 43 48.6 22
25% 75% 37.6 48.2 18
10% 90% 35 42.6 15
Table 4.10 Result Class Prediction of Typical group with 1000 documents
Typical Types of Job - 1000 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Web Analyst Web
Designer
Web
Designer
Web
Developer
Web
Developer
Web
Designer
2 Web Analyst Web Analyst Web Analyst Web Analyst Web Analyst Web Master
3 Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
4 Web
Designer
Web
Designer
Web Analyst Web Analyst Web Analyst Web
Designer
5 Web
Developer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
6 Web
Developer
Web
Developer
Web
Developer
Web
Developer
Web Analyst Web Master
7 Web Master Web Master Web Master Web
Designer
Web Master Web Analyst
8 Web Master Web
Designer
Web
Designer
Web
Developer
Web
Developer
Web
Developer
9 Web Writer Web
Designer
Web
Designer
Web
Designer
Web Analyst Web Analyst
10 Web Writer Web
Designer
Web
Designer
Web Writer Web Writer Web Writer
Total Data
Predicted
Correctly
5 4 4 5 3
4.2. Testing Result 97
2. 2500 documents
From previous model with 1000 documents, the accuracy rate is bad.
And Typical group with 2500 documents has the best accuracy rate
around 49.6% when only 10% of data set is given, or around 250 data
from 2500 data. And the best two is around 48% when 50% of data set
is given, or around 1250 data from 2500 data. Considering this rate,
that seems the amount of data set does not effect the accuracy rate.
From comparison of accuracy rate at evaluation and validation phase,
we could say that both of this model do not have overtting problem.
From the prediction result, the model with 90% proportion could pre-
dict better than the model with 10% and 50% proportion. The model
with 90% proportion even has the accuracy only 45% (the lowest), it
could predict 6 from 10. But considering the accuracy rate is below
than 50%, it is not wise to use this model.
Table 4.11 Result Accuracy Rate of Typical group with 2500 documents
Typical Types of Job - 2500 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 45.73 46.8 218
75% 25% 47.79 44.48 98
50% 50% 48.08 48.32 63
25% 75% 46.88 45.65 37
10% 90% 49.6 46.49 26
4.2. Testing Result 98
Table 4.12 Result Class Prediction of Typical group with 2500 documents
Typical Types of Job - 2500 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Web Analyst Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
2 Web Analyst Web Analyst Web Analyst Web Analyst Web Analyst Wen Analyst
3 Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
4 Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web Writer Web Analyst
5 Web
Developer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
6 Web
Developer
Web
Developer
Web
Developer
Web
Developer
Web
Developer
Web
Developer
7 Web Master Web Writer Web Writer Web Writer Web Writer Web Writer
8 Web Master Web Master Web Master Web Master Web
Designer
Web Master
9 Web Writer Web Writer Web
Designer
Web
Designer
Web
Designer
Web
Designer
10 Web Writer Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Total Data
Predicted
Correctly
6 5 5 3 4
3. 5000 documents
So far, Typical group could not give a good model with high enough
accuracy rate, and good prediction result. Typical group with 5000
documents has the best accuracy rate around 48% when only 25% of
data set is given, or around 1250 data from 5000 data. The accuracy
rate is bare 50%, that we could say this is bad model. Even with 5000
documents, there is no a model that has 50% for accuracy rate.
From the prediction result, the model with 25% proportion could only
predict 4 from 10. Even the model with 10% proportion could pre-
dict a little better, 5 from 10. After all, the model that created from
Typical group documents is bad model, and could not be used to do
classication, even with the high amount of data. This is very clear,
caused of the not good data set. Types of job in typical group has sim-
ilarity of each other, or belongs in the same eld of job,so we could
4.2. Testing Result 99
not cover the possibility that same word features would occur in doc-
uments, even the class of documents is different. Since word features
is the main factor of a pattern of a document, the same words occur
in different document with high occurence, would bias the pattern of
each class of document.
Table 4.13 Result Accuracy Rate of Typical group with 5000 documents
Typical Types of Job - 5000 documents
Amount of
Training Data
set (%)
Amount of
Testing Data
set (%)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
90% 10% 44.64 42.8 328
75% 25% 44.82 42.96 281
50% 50% 44.8 44.72 207
25% 75% 47.92 47.01 120
10% 90% 46.6 45.38 53
Table 4.14 Result Class Prediction of Typical group with 5000 documents
Typical Types of Job - 5000 documents
No Real Class
With % of training data, predicted as
90% 75% 50% 25% 10%
1 Web Analyst Web Writer Web
Designer
Web
Designer
Web
Designer
Web Writer
2 Web Analyst Web Analyst Web Analyst Web Analyst Web Analyst Web Analyst
3 Web
Designer
Web Writer Web
Designer
Web Writer Web Master Web
Designer
4 Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
5 Web
Developer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
6 Web
Developer
Web
Developer
Web Analyst Web
Developer
Web
Developer
Web Analyst
7 Web Master Web Writer Web Writer Web Writer Web Writer Web Master
8 Web Master Web
Designer
Web
Designer
Web
Designer
Web Master Web Master
9 Web Writer Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web Writer
10 Web Writer Web
Designer
Web
Designer
Web
Designer
Web
Designer
Web
Designer
Total Data
Predicted
Correctly
3 3 3 4 6
4.2. Testing Result 100
4.2.4.2 Without Proportion of Training and Testing data set (100%
data set)
Here are the result of classication process with the data set. It is different
from the previous test, since in this test we did training stage of the data set
without proportion of training and testing set, so the classication model is
created from 100% amount of data set. We also use a testing data set for
validation phase to check whether the model has overtting problem or not,
we used 10% of data set for testing set. Table 4.15 shows the accuracy rate
result. The accuracy rate from these model (with 100% data set) is not so far
different from the accuracy rate that we got from the models with proportion
of training and testing set, but interest fact here, the accuracy rate for model
with 5000 documents has the accuracy that lower than the model with 1000
and 2500 documents. And once again, we were shown by the fact even
when we use whole data set to build the model, the accuracy rate of Typical
group model is very bad. We could only get 45% accuracy rate when 5000
documents are provided.
Table 4.15 Result Accuracy Rate of Typical group with whole data set
Typical Types of Job
Total Amount
of Documents
(data set)
Accuracy
Rate at
Evaluation
Phase (%)
Accuracy
Rate at
Validation
Phase (%)
Estimation
Time
(second)
1000 45.2 47 53
2500 45.6 51.6 154
5000 44.84 45.6 359
From the prediction result at table 4.16, the models could only predict
as 50:50 of 10 documents, the model with 1000 and 2500 documents could
predict only 5 from 10, and the model with 5000 is even worse, 3 from 10.
4.2. Testing Result 101
Table 4.16 Result Class Prediction of Typical group with whole data set
Typical Types of Job
No Real Class
Total Amount of Documents (data set)
1000 2500 5000
1 Web Analyst Web
Designer
Web
Designer
Web
Designer
2 Web Analyst Web
Analyst
Web
Analyst
Web
Analyst
3 Web
Designer
Web Writer Web
Designer
Web Writer
4 Web
Designer
Web
Designer
Web
Designer
Web
Designer
5 Web
Developer
Web
Designer
Web
Designer
Web
Designer
6 Web
Developer
Web
Developer
Web
Developer
Web
Developer
7 Web Master Web Master Web Writer Web Writer
8 Web Master Web Master Web Master Web
Designer
9 Web Writer Web
Designer
Web
Designer
Web
Designer
10 Web Writer Web
Designer
Web
Designer
Web
Designer
Total Data
Predicted
Correctly
5 5 3
Although the total amount of documents is increased by 10% from earlier
testing, the result is showing us that the Typical group models accuracy rate
could only stay below 50%. Once again, the model that created from Typical
group documents is bad model, and could not be used to do classication,
even with the high amount of data. This is very clear, caused of the not good
data set. Types of job in typical group has similarity of each other, or belongs
in the same eld of job, so we could not cover the possibility that same
word features would occur in documents, even the class of documents is
different. Since word features is the main factor of a pattern of a document,
the same words occur in different document with high occurence, would bias
the pattern of each class of document.
4.2. Testing Result 102
4.2.5 Comparison Atypical and Typical types of job
From the testing result, with proportion of training and testing data set, and
without the proportion, there is no overtting problem occurs for both result.
So we could conclude one thing that Cross Validation technique would train
and build the model without overtting problem, so we do not have to divide
the data set into training and testing set with percentage proportion.
Classication model that comes from Atypical data set is better in accu-
racy rate and prediction rate than Typical data set. Types of job in typical
group data set has similarity of each other, or belongs in the same eld of
job, so we could not cover the possibility that same word features would
occur in documents, even the class of documents is different. Since word
features is the main factor of a pattern of a document, the same words occur
in different document with high occurence, would bias the pattern of each
class of document, and gives error at prediction. But even the Atypical data
set only reach 68% as the highest of accuracy rate, in 5000 documents, and
could not reach or pass 70% level. So we could assume that there is a risk
too to use classication model that comes from Atypical data set. With the
model from atypical data set, we would always worry about the prediction
result, because there is 30% chance that prediction would give error result.
Table 4.17 Comparison Table between Atypical and Typical types of job
Comparison between Atypical and Typical types of job
Amount of documents
1000 2500 5000
Accuracy Rate (%)
Atypical 62.6 66.24 67.3
Typical 45.2 45.6 44.84
Prediction Result (n from 10)
Atypical 7 9 9
Typical 5 5 3
Chapter 5
Concluding Remarks
5.1 Conclusion
Conclusion that we got from this thesis, as follows:
1. Internet Rsum Document Classication System have been implemen-
ted, with features; Data crawling process to get rsum documents
from online rsum directory; and Nave Bayes algorithm as classi-
er to train the training data set, build the classication model, and
perform classication. Implementation of Data crawling process to get
rsum documents from online rsum directory - Indeed, have been
implemented as well in the system, however, there is still a lack of data
fetching with vast amount of document. This is caused of the rsum
directory site, Indeed, because everytime the system crawls a link list
page of Indeed, Indeed always gives randomly rsum page link list,
and there is a chance, that Indeed would give the rsum page link
that already given previously. So, instead of getting the amount of
document that we expected, we only get the amount that is fulllled
treshold limit.
2. Implementation of Nave Bayes classier in job rsum document clas-
sication produced a not very good model for Atypical types of job
rsum, and a bad model for Typical types of job rsum. The highest
accuracy rate that we got from Correctly Classied Instances value, is
67.3% for Atypical types of job when 5000 documents is given, and low
44.84% for Typical types of job when 5000 documents is given. The
classication model of Atypical types of job could predict 9 correctly
prediction from 10, it is not a perfect prediction because the accuracy
103
5.2. Suggestion 104
rate is only 67.3%, and the 1 error is caused of the document contains
too many term that belongs to another class. The classication model
of Typical types of job could not be usable to do classication, since
even the model with 5000 documents could only do 50:50 prediction.
3. It is hard to get a good model from Typical types of job rsum, since
each class has similarity (word features) of each other. In our case, the
job types of Typical group is belongs to same job eld, Web Engineer,
and of course there are more than one similar word feature. Since
Nave Bayes classier really depends on the word occurences of each
document, we should not create a classication model from Typical
group problem.
4. After a discussion about the reason why the accuracy rate could not be
higher, we assume that the problem does not come from the classier,
but from the quality of data set. There is discussion about Nave Bayes
classier that could give better accuracy and prediction, if the amount
of labeled documents is increased, or if the labeled documents are pro-
vided in vast amount. But in our testing, we tried to create model from
5000 documents, 1000 documents for each class, and we think that is
already large enough. And from the comparison between the model
created with 1000, 2500, and 5000 documents given, the increase of
accuracy rate is not very signicouldt. So, we assume that the problem
is from the data set. And yes, we admit it, because all rsum docu-
ments that we have got, is not properly labeled. From data crawling
process, we want to get rsum documents from crawling based on a
keyword, and Indeed searching service do the rsum searching based
on the keyword that occurs in a certain rsum document as a job title,
so there is a chance that a different class rsum document could be
categorized as a class that we expected.
5.2 Suggestion
Suggestions to improve the system (future work):
1. Job rsum document that we get from data crawling process, need
to be examined carefully, whether the document class is the class that
we expected or no. Before use all documents, rst we need to check
manually each document, to make sure the label/class is correct.
5.2. Suggestion 105
2. To enhance the system result, later on, we could add stemming process
to get better result of feature extraction, since stemming could also
help to reduce the weigh load of system, without decrease the quality
of pattern discovery. We could also use the more precise stopwords list,
to delete stopwords from rsum document, since for this research, we
used stopwords list that contains around 600+ stopwords, there are
stopwords list that contains more specic stopwords.
3. There are lot of lter options that we do not use in our current system,
such as TF-IDF lter to reect how important a word is to a document
in a data set, we could also apply Attribute Selection lter to lter
word features based on the frequencies in documents, so we could just
remove the word features with very low frequency.
4. There is four method options that could be used in MultiClassClassier
classier to set the method that used for transforming the multi-class
problem into several binary (2) class problems. There is 1-against-all,
1-against-1, Random correction code, and Exhaustive correction code.
In our current system, we just use the default option that is 1-against-
all to do the classication. For enhance the system result, we could do
comparison between the result with these method, to see which one
with the better result.
5. To enhance the system, we should not divide the data set into training
and testing set, since with Cross Validation, there would not be any
overtting problem occurs. So we could just use the data set to train
and build classier.
6. We could do evaluation with comparison between Nave Bayes classi-
er, and other classier algorithm than just do evaluation with compar-
ison between different amount of document. For future work, we could
do comparison with Support-Vector Machine, Random Forest, Decision
Tree or IBk (Nearest Neighbour) classier to see which classier that
could give better result.
Bibliography
(2008a). Properties of naive bayes. [Online]. Available: http://nlp.stanford.
edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html [Accessed on
February 2014].
(2008b). Text classication and naive bayes. [Online].
Available: http://nlp.stanford.edu/IR-book/html/htmledition/
text-classification-and-naive-bayes-1.html [Accessed on February 2014].
(2009a). Attribute-relation le format (arff). [Online]. Available: http://www.
cs.waikato.ac.nz/ml/weka/arff.html [Accessed on February 2014].
(2009b). Batch ltering. [Online]. Available: http://weka.wikispaces.com/
Batch+filtering [Accessed on February 2014].
(2013). Document classication. [Online]. Available: http://en.wikipedia.
org/wiki/Document
_
classification [Accessed on February 2014].
(2013). Javascript html dom. [Online]. Available: http://www.w3schools.com/
js/js
_
htmldom.asp [Accessed on February 2014].
(2014). Document object model. [Online]. Available: http://en.wikipedia.
org/wiki/Document
_
Object
_
Model [Accessed on February 2014].
Abernethy, M. (2010). Data mining with weka, part 2: Classication
clustering. [Online]. Available: http://www.ibm.com/developerworks/library/
os-weka2/ [Accessed on February 2014].
Al-Azmi, A.-A. R. (2013). Data, text, and web mining for business intel-
ligence: A survey. In International Journal of Data Mining & Knowledge
Management Process (IJDKP), volume 3.
Chakrabarti, S. (2003). Mining the Web: Discovering Knowledge from Hyper-
text Data. Morgan Kaufmann Publishers.
106
Bibliography 107
Haruechaiyasak, C. (2008). A Tutorial on Naive Bayes Classication.
Hearst, M. (2003). What is text mining? [Online]. Available: http://people.
ischool.berkeley.edu/~hearst/text-mining.html [Accessed on February 2014].
Hedley, J. (2013). jsoup cookbook. [Online]. Available: http://jsoup.org/
cookbook/ [Accessed on February 2014].
Koch, P.-P. (2001). The document object model: an introduction. [On-
line]. Available: http://www.digital-web.com/articles/the
_
document
_
object
_
model/ [Accessed on February 2014].
Kotsiantis, S. B. (2007). Supervised machine learning: A review of classi-
cation techniques. Informatica, 31:249268.
Li, Y. H. and Jain, A. K. (2000). Classication of text documents. The Com-
puter Journal, 41(8):537546.
Mooney, R. J. (2007). Text categorization. Machine Learning Course. Ma-
chine Learning Course Presentation.
Nedelcu, A. (2012). How to build a naive bayes classier.
[Online]. Available: https://www.bionicspirit.com/blog/2012/02/09/
howto-build-naive-bayes-classifier.html [Accessed on February 2014].
Novak, P. K. (2009). Hands on weka: Classication in weka.
Olston, C. and Najork, M. (2010). Web crawling. Foundation and Trends in
Information Retrieval, 4(3):175246.
Peshave, M. (2010). How Search Engines Work and A Web Crawler Applica-
tion.
Segaran, T. (2007). Programming Collective Intelligence - Building Smart Web
2.0 Applications. OReilly Media, Inc.
Sorensen, A. S. N. (2012). Evaluating the use of learning algorithms in cat-
egorization of text. Masters thesis, Norwegian University of Science and
Technology.
Tala, F. Z. (2003). A study of stemming effects on information retrieval in
bahasa indonesia. Masters thesis, Universiteit van Amsterdam.
Bibliography 108
Vryniotis, V. (2013). Machine learning tutorial: The naive bayes
text classier. [Online]. Available: http://blog.datumbox.com/
machine-learning-tutorial-the-naive-bayes-text-classifier/ [Accessed
on February 2014].
Werbach, K. and Jahja, K. (1996). The bare bones guide to html. [On-
line]. Available: http://werbach.com/barebones/barebone
_
id.html [Accessed on
February 2014].
Witten, I. H., Frank, E., and Hall, M. (2011). Data Mining: Practical Ma-
chine Learning Tools and Techniques, Second Edition. Morgan Kaufmann
Publishers.
Witten, I. H., Hall, M., Reutemann, P., Frank, E., Holmes, G., and Pfahringer,
B. (2009). The weka data mining software: An update. SIGKDD Explo-
rations, 11(1):1018.
Appendix A
Code Listing
Data Crawling Class
(Crawl.java)
1 import org . j soup . Jsoup ;
2 import org . j soup . nodes . Document ;
3 import org . j soup . nodes . Element ;
4 import org . j soup . s e l e c t . El ements ;
5
6 import j ava . i o . ;
7 import j ava . i o . I OExcept i on ;
8 import j ava . u t i l . Ar r ayLi s t ;
9
10 /
11 @author damai007 & septi awan_moge
12 /
13
14 publ i c cl ass Crawl {
15
16 s t at i c Ar r ayLi s t <St r i ng> f et c hed_ur l = new Ar r ayLi s t <St r i ng >() ;
17
18 pri vat e s t at i c St r i ng r oot _f ol der = "D: / Cr awl er _Pr oj ect /
Fi nal Mai nt ai n / " ;
19
20 pri vat e s t at i c void cr eat eFi l eFr omI ndeed ( St r i ng url , St r i ng
f ol der Pat h ) throws I OExcept i on {
21 pr i nt ( " \n> Fet chi ng Resume %s . . . " , ur l ) ;
22
23 St r i ng [] words , word_end ;
24
25 St r i ng ws , wt , wc , wd;
109
110
26 St r i ng i d ;
27 St r i ng l i ne ;
28 St r i ng l i ne s = " " ;
29
30 Boolean check ;
31 Boolean i sWorkTi t l eEmpt y= true ;
32
33 Document resume = Jsoup . connect ( ur l ) . t i meout (0) . get () ;
34 Element name = resume . s e l e c t ( " . f n " ) . f i r s t () ;
35 Element l oc at i on = resume . s e l e c t ( "#headl i ne_l oc at i on " ) . f i r s t ()
;
36 El ements work_sect i on = resume . s e l e c t ( " di v . workexperi ence
s ec t i on " ) ;
37 El ements wor k_t i t l e = resume . s e l e c t ( " . wor k_t i t l e " ) ;
38 El ements work_company = resume . s e l e c t ( " di v . work_company " ) ;
39 El ements wor k_des cr i pt i on = resume . s e l e c t ( " . wor k_des cr i pt i on " )
;
40
41 Element educat i on = resume . s e l e c t ( "#educati oni t ems " ) . f i r s t () ;
42 Element add_i nf o = resume . s e l e c t ( "#addi t i onal i nf o i t ems " ) .
f i r s t () ;
43
44 i f ( l oc at i on != nul l ) {
45 l i ne = name . t e xt () + " " + l oc at i on . t ext () ;
46 l i ne s += l i ne ;
47 } el se {
48 l i ne = name . t e xt () ;
49 l i ne s += l i ne ;
50 }
51
52 f or ( Element workSecti on : work_sect i on ) {
53 ws = workSecti on . t e xt () ;
54
55 f or ( Element wor kTi t l e : wor k_t i t l e ) {
56 wt = wor kTi t l e . t e xt () ;
57
58 check = ws . cont ai ns (wt ) ;
59
60 i f ( check ) {
61 l i ne = " " + wt ;
62 l i ne s += l i ne ;
63 i sWorkTi t l eEmpt y = f al s e ;
64 break ;
65 }
66 }
111
67
68 i f ( i sWorkTi t l eEmpt y ) {
69 }
70
71 f or ( Element workCompany : work_company) {
72 wc = workCompany . t e xt () ;
73
74 check = ws . cont ai ns (wc) ;
75
76 i f ( check ) {
77
78 l i ne = " " + wc ;
79 l i ne s += l i ne ;
80
81 break ;
82 }
83 }
84
85 f or ( Element workDescri pt i on : wor k_des cr i pt i on ) {
86 wd = workDescri pt i on . t e xt () ;
87
88 check = ws . cont ai ns (wd) ;
89
90 i f ( check ) {
91
92 l i ne = " " + wd;
93 l i ne s += l i ne ;
94
95 break ;
96 }
97 }
98 }
99
100 i f ( educat i on != nul l ) {
101 l i ne = " " + educat i on . t ext () ;
102 l i ne s += l i ne ;
103 }
104
105 i f ( add_i nf o != nul l ) {
106 l i ne = " " + add_i nf o . t ext () ;
107 l i ne s += l i ne ;
108 }
109
110 words = ur l . s p l i t ( " / " ) ;
111 i nt l ength_words = words . l engt h 1;
112
112 word_end = words [ l ength_words ] . s p l i t ( " \\? " ) ;
113 i d = word_end [ 0] ;
114
115 St r i ng path = f ol der Pat h + " / " + i d ;
116 F i l e f i l e = new F i l e ( path ) ;
117 f i l e . creat eNewFi l e () ;
118
119 Wr i t er wr i t er = nul l ;
120
121 t r y {
122 wr i t er = new Buf f er edWr i t er (new OutputStreamWri ter (new
Fi l eOut put St ream ( path ) , " ut f 8" ) ) ;
123 wr i t er . wr i t e ( l i ne s ) ;
124 } catch ( I OExcept i on ex ) {
125
126 } f i nal l y {
127 t ry { wr i t er . c l os e () ; } catch ( Except i on ex ) {}
128 }
129 }
130
131 pri vat e s t at i c void f et chI ndeed ( St r i ng rootURL , i nt t hr eshol d ,
St r i ng f ol der Pat h ) throws I OExcept i on {
132
133 Ar r ayLi s t <St r i ng> f or PageI ndi c at or = new Ar r ayLi s t <St r i ng >() ;
134
135 St r i ng ur l _ s t a r t e r = rootURL ;
136 St r i ng ur l _c ons t ant = ur l _ s t a r t e r ;
137 St r i ng url _resume = " " ;
138 St r i ng pageAt t r = " " ;
139
140 i nt resume_page_t hreshol d = t hr es hol d ;
141 i nt resume_counter = 0;
142 i nt page_counter = 0;
143
144 Boolean keepSearchi ng = true ;
145
146 /
147 I ni adal ah program utamanya .
148 /
149 while ( keepSearchi ng ) {
150 pr i nt ( " \nFETCHING ALL URLs FROM %s . . . " , ur l _ s t a r t e r ) ;
151 System . out . pr i nt l n ( " Page " + ( page_counter+1)) ;
152 Document doc = Jsoup . connect ( ur l _ s t a r t e r ) . t i meout (0) . get ()
;
153 El ements l i nk_resume = doc . s e l e c t ( " . app_l i nk " ) ;
113
154 El ements page_i ndi cat or = doc . s e l e c t ( "#pagi nat i on > . i n s t l ~
[ r e l=nof ol l ow] " ) ;
155
156 f or ( Element page : page_i ndi cat or ) {
157 pageAt t r = page . a t t r ( " hr ef " ) ;
158
159 i f ( f or PageI ndi c at or . cont ai ns ( pageAt t r ) ) {
160 continue ;
161 } el se {
162 f or PageI ndi c at or . add( pageAt t r ) ;
163 }
164 }
165
166 f or ( Element l i nk : l i nk_resume ) {
167 url _resume= " ht t p : //www. i ndeed . com" + l i nk . a t t r ( " hr ef "
) ;
168
169 i f ( ! f et c hed_ur l . cont ai ns ( url _resume ) ) {
170 System . out . pr i nt ( " \n" + ( resume_counter+1)) ;
171 cr eat eFi l eFr omI ndeed ( url _resume , f ol der Pat h ) ;
172 f et c hed_ur l . add( url _resume ) ;
173 resume_counter+=1;
174 }
175
176 i f ( resume_counter >= resume_page_t hreshol d ) {
177 keepSearchi ng = f al s e ;
178 break ;
179 }
180 }
181
182 i f ( page_counter < f or PageI ndi c at or . s i z e () ) {
183 ur l _ s t a r t e r = ur l _c ons t ant + f or PageI ndi c at or . get (
page_counter ) ;
184 } el se {
185 break ;
186 }
187
188 page_counter+=1;
189 }
190 }
191
192 pri vat e s t at i c void pr i nt ( St r i ng msg , Obj ect . . . ar gs ) {
193 System . out . pr i nt l n ( St r i ng . format (msg , ar gs ) ) ;
194 }
195
114
196 publ i c s t at i c void main ( St r i ng [] ar gs ) throws I OExcept i on {
197 Stopwatch stopwatch = new Stopwatch () ;
198 stopwatch . s t a r t () ;
199
200 // 1. Crawl i ng a c t i v i t y f o r ge ne r al j ob f i e l d , account i ng ,
c omput er i t , manager , medi cal , t e c hni c i an
201 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes / account i ng " , 999,
r oot _f ol der + " gener al Fi el d /01RawData/ account i ng / " ) ;
202 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /computeri t " , 400,
r oot _f ol der + " gener al Fi el d /01RawData/ comput eri t / " ) ;
203 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /manager " , 999,
r oot _f ol der + " gener al Fi el d /01RawData/manager/ " ) ;
204 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes / medi cal " , 999,
r oot _f ol der + " gener al Fi el d /01RawData/ medi cal / " ) ;
205 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes / t ec hni c i an " , 300,
r oot _f ol der + " gener al Fi el d /01RawData/ t ec hni c i an / " ) ;
206
207 f et c hed_ur l . c l e ar () ;
208
209 // 2. Crawl i ng a c t i v i t y f o r s p e s i f i c j ob f i e l d , webanal ys t ,
webde s i gne r , webde ve l ope r , webmaster , webwr i t e r
210 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /webanal ys t " , 999,
r oot _f ol der + " s p e c i f i c F i e l d /01RawData/ webanal yst / " ) ;
211 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /webdes i gner " ,
999, r oot _f ol der + " s p e c i f i c F i e l d /01RawData/ webdesi gner / " ) ;
212 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /webdevel oper " ,
999, r oot _f ol der + " s p e c i f i c F i e l d /01RawData/ webdevel oper / " )
; \
213 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /webmaster " , 999,
r oot _f ol der + " s p e c i f i c F i e l d /01RawData/webmaster / " ) ;
214 f et chI ndeed ( " ht t p : //www. i ndeed . com/resumes /webwr i t er " , 999,
r oot _f ol der + " s p e c i f i c F i e l d /01RawData/ webwri ter / " ) ;
215
216 stopwatch . st op () ;
217 System . out . pr i nt l n ( " El apsed ti me i n mi l l i s ec onds : " +
stopwatch . getEl apsedTi me () ) ;
218 System . out . pr i nt l n ( " El apsed ti me i n seconds : " + stopwatch .
get El apsedTi meSecs () ) ;
219
220 }
221
222 }
115
Data Preprocessing Class
(Preprocessing.java)
1 import j ava . i o . ;
2 import j ava . u t i l . L i s t ;
3 import j ava . u t i l . Ar r ayLi s t ;
4
5 import WebCrawling . Stopwatch ;
6
7 /
8 @author damai007 & septi awan_moge
9 /
10
11 publ i c cl ass Pr epr oces s i ng {
12 pri vat e s t at i c St r i ng r oot _f ol der = "D: / Cr awl er _Pr oj ect /
Fi nal Mai nt ai n / " ;
13 pri vat e s t at i c Li s t <St r i ng> stopWords ;
14 // untuk menampung semua s t op words
15
16 pri vat e s t at i c void i ni t St opWords () {
17 stopWords = new Ar r ayLi s t <St r i ng >() ;
18
19 t r y {
20 Fi l eI nput St r eam s t o p _ f i l e = new Fi l eI nput St r eam ( r oot _f ol der
+ " stopwords . t x t " ) ;
21 // ambi l f i l e t e x t yang bernama words . t x t
22 DataInputStream i n = new DataInputStream ( s t o p _ f i l e ) ;
23 Buf f eredReader br = new Buf f eredReader (new
InputStreamReader ( i n , " ut f 8" ) ) ;
24 // pakai enc odi ng ut f 8
25
26 St r i ng l i ne ;
27 while (( l i ne = br . readLi ne () ) != nul l ) {
28 i f ( l i ne . charAt (0) == 32) {
29 // j i k a di kar akt e r awal ada s p as i maka engga di masuki n
ke stopWords
30 // kar akt e r s p as i engga di apaapai n
31 } el se {
32 // s e l a i n dar i i t u maka masukin ke stopWords
33 stopWords . add( l i ne ) ;
34 }
35 }
36 } catch ( Except i on e) {
37 e . pr i nt St ackTr ace () ;
116
38 }
39 }
40
41 publ i c s t at i c void s t a r t ( St r i ng rootCrawl , St r i ng
r oot Pr oces s ed ) {
42 Boolean s t i l l _ p r o c e s s = true ;
43 Boolean f i ni s he d = f al s e ;
44
45 F i l e r oot _cr awl = new F i l e ( root Crawl ) ;
46 // path f o l d e r dar i ha s i l c r awl e r
47 F i l e r oot _pr oces s = new F i l e ( r oot Pr oces s ed ) ;
48 // path f o l d e r untuk ha s i l pemrosesan prep r o c e s s i ng
49
50 while ( s t i l l _ p r o c e s s ) {
51 f or ( St r i ng f 1 : r oot _cr awl . l i s t () ) {
52 // j e l a j a h i s at u per s at u untuk semua f i l e / f o l d e r di path
r oot
53 F i l e f ol der 1 = new F i l e ( r oot _cr awl + " / " + f 1 ) ;
54 f or ( St r i ng f i l e s : f ol der 1 . l i s t () ) {
55 t r y {
56 Fi l eI nput St r eam f i l e = new Fi l eI nput St r eam ( r oot _cr awl
+ " / " + f 1 + " / " + f i l e s ) ;
57 DataInputStream i n = new DataInputStream ( f i l e ) ;
58 Buf f eredReader br = new Buf f eredReader (new
InputStreamReader ( i n ) ) ;
59
60 t ry {
61 Fi l eWr i t er pr oc e s s e d_f i l e = new Fi l eWr i t er (
r oot _pr oces s + " / " + f 1 + " / " + f i l e s + " . t x t " )
;
62 Buf f er edWr i t er out = new Buf f er edWr i t er (
pr oc e s s e d_f i l e ) ;
63 St r i ng l i ne ;
64 St r i ng l i ne s = " " ;
65 while (( l i ne = br . readLi ne () ) != nul l ) {
66 i f ( ! l i ne . equal sI gnor eCase ( " \n" ) ) {
67 // sel ama t i dak ketemu tanda Ent er
68 l i ne s += l i ne ;
69 // maka masukkan s e l ur uh kal i mat pada l i n e
t e r s e b ut
70 }
71 }
72
73 St r i ng processed_out put =
removeNotLettersAndStopWords ( l i ne s ) ;
117
74 // l akukan pr os e s penghi l angan kat a yg bukan huruf
dan sama dgn s t op words
75 out . wr i t e ( processed_out put ) ;
76 out . c l os e () ;
77 } catch ( Except i on e) {
78 // tangkap j i k a t e r j a d i e r r or dalam pe nul i s an
79 e . pr i nt St ackTr ace () ;
80 s t i l l _ p r o c e s s = f al s e ;
81 }
82 } catch ( Except i on e) {
83 // tangkap j i k a t e r j a d i e r r or dalam pembacaan
84 e . pr i nt St ackTr ace () ;
85 s t i l l _ p r o c e s s = f al s e ;
86 }
87 } // S e l e s a i perul angan FOR untuk St r i ng f i l e s
88 s t i l l _ p r o c e s s = f al s e ;
89 } // S e l e s a i perul angan FOR untuk St r i ng f 1
90 f i ni s he d = true ;
91 } // S e l e s a i perul angan WHILE
92
93 i f ( f i ni s he d ) {
94 System . out . pr i nt l n ( "DONE! " ) ;
95 }
96 }
97
98 pri vat e s t at i c St r i ng removeNotLettersAndStopWords ( St r i ng
t ext ) {
99 St r i ng l owercase = t e xt . toLowerCase () ;
100 // s e l ur uh t eks nya d i j a d i i n huruf k e c i l
101 St r i ng al l owed_s t r i ng = " " ;
102 // untuk menampung s t r i n g yang di pe r bol e hkan
103
104 f or ( i nt i =0; i <l owercase . l engt h () ; i ++) {
105 i nt number_of _st ri ng = ( i nt ) l owercase . charAt ( i ) ;
106 // konv e r s i s t r i n g ke bent uk i nt e g e r , agar mudah di p r o s e s
107
108 i f (( number_of _st ri ng >= 97 && number_of _st ri ng <= 122) ||
109 // untuk huruf / kar akt e r s t andar
110 ( number_of _st ri ng == 32) ) {
111 // untuk kar akt e r s p as i
112 al l owed_s t r i ng += l owercase . charAt ( i ) ;
113 } el se {
114 al l owed_s t r i ng += " " ;
115 // s e l a i n dar i kedua di at as i t u , masukkan s aj a s p as i ke
v ar i ab e l al l owe d_s t r i ng
118
116 }
117 }
118
119 St r i ng [] words = al l owed_s t r i ng . s p l i t ( " \\ s+" ) ;
120 // kal i mat di pi s ahpi s ah j i k a ketemu tanda s p as i
121 Li s t <St r i ng> wor d_l i s t = new Ar r ayLi s t <St r i ng >() ;
122
123 f or ( i nt j = 0; j < words . l engt h ; j ++) {
124 i f ( ! stopWords . cont ai ns ( words [ j ]) && words [ j ] . l engt h () > 1) {
// hapus kat a yg sama dgn st opword dan kat a yg
cuma s e hur uf
125 wor d_l i s t . add( words [ j ]) ;
126 }
127 }
128
129 St r i ng output = wor d_l i s t . t oSt r i ng () ;
130 output = output . r e pl ac e Al l ( " , " , " " ) ;
131 // hapus semua tanda koma
132 output = output . r e pl ac e Al l ( " \\[|\\] " , " " ) ;
133 // hapus tanda " [ " dan " ] " h a s i l dar i pr i nt L i s t yg bernama
l i s t _ wor d
134 return out put ;
135 }
136
137 publ i c s t at i c void main ( St r i ng [] ar gs ) throws Except i on {
138 i ni t St opWords () ;
139 // l akukan i n i s i a l i s a s i untuk memasukkan s t op word
140 Stopwatch stopwatch = new Stopwatch () ;
141
142 System . out . pr i nt l n ( " St ar t Pr oces s i ng . . . " ) ;
143 stopwatch . s t a r t () ;
144
145 s t a r t ( r oot _f ol der + " gener al Fi el d /01RawData/ " , r oot _f ol der + "
gener al Fi el d /02PreprocessedDat a / " ) ;
146 s t a r t ( r oot _f ol der + " s p e c i f i c F i e l d /01RawData/ " , r oot _f ol der +
" s p e c i f i c F i e l d /02PreprocessedDat a / " ) ;
147
148 stopwatch . st op () ;
149 System . out . pr i nt l n ( " el apsed ti me i n mi l l i s ec onds : " +
stopwatch . getEl apsedTi me () ) ;
150 System . out . pr i nt l n ( " el apsed ti me i n seconds : " + stopwatch .
get El apsedTi meSecs () ) ;
151 }
152 }
119
ARFF File Converter Class
(FileArffPreparation.java)
1 import weka . core . ;
2 import weka . core . I ns t anc es ;
3 import weka . core . At t r i but e ;
4 import weka . core . DenseI nst ance ;
5 import weka . core . Fas t Vect or ;
6 import weka . f i l t e r s . F i l t e r ;
7 import weka . f i l t e r s . unsupervi sed . a t t r i b ut e . Reorder ;
8 import weka . f i l t e r s . unsupervi sed . a t t r i b ut e . Stri ngToWordVector ;
9 import weka . f i l t e r s . unsupervi sed . i ns t anc e . Randomize ;
10 import weka . f i l t e r s . unsupervi sed . i ns t anc e . RemovePercentage ;
11 import weka . core . t okeni zer s . NGramTokenizer ;
12 import weka . core . c onver t er s . Conver t er Ut i l s . DataSource ;
13 import weka . core . c onver t er s . Ar f f Saver ;
14 import weka . core . c onver t er s . Text Di r ect or yLoader ;
15
16 import j ava . i o . ;
17
18 /
19 @author damai007 & septi awan_moge
20 /
21
22 publ i c cl ass Fi l e Ar f f Pr e par at i on {
23
24 I ns t anc es dataRaw , i ns t ances , t r ai nDat a , t es t Dat a , newData ;
25 Ar f f Saver dataRawSaver , dat aFi l t er edSaver , dat aNewFi l t erSaver ,
dat aTrai nSaver , dat aTest Saver , dataNewSaver ;
26
27 St r i ng l i ne Pr i nt = "
=======================================================" ;
28 // St r i ng r oot Fol de r = "D: / Cr awl e r _Pr oj e c t / Fi nal Mai nt ai n /
g e ne r a l F i e l d / " ;
29 St r i ng r oot Fol der = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
s p e c i f i c F i e l d / " ;
30 // St r i ng r es umeFol der = "D: / Cr awl e r _Pr oj e c t / Fi nal Mai nt ai n /
g e ne r a l F i e l d /02 Pr e pr oc e s s e dDat a / " ;
31 St r i ng resumeFol der = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
s p e c i f i c F i e l d /02PreprocessedDat a / " ;
32 St r i ng dataRawArffName = " dataResume . a r f f " ;
33 St r i ng dat aFi l t er edAr f f Name = " dataResume . processed . a r f f " ;
34 St r i ng dataTrai nArf fName = " dataResume . t r a i n . a r f f " ;
35 St r i ng dataTestArff Name = " dataResume . t e s t . a r f f " ;
120
36 St r i ng newDataFolder = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
t e s t Cl a s s i f i c a t i o n /newResume/ " ;
37 St r i ng newDataFolder2 = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
t e s t Cl a s s i f i c a t i o n / processedDat a / " ;
38 St r i ng newDataRawArff = " nol bl Dat a . a r f f " ;
39 St r i ng newDat aFl t r dAr f f = " nol bl Dat a . processed . a r f f " ;
40
41 publ i c void conver t Text 2Ar f f () {
42
43 t ry {
44 Text Di r ect or yLoader di r Loader = new
Text Di r ect or yLoader () ;
45 di r Loader . s e t Di r e c t or y (new F i l e ( resumeFol der ) ) ;
46 dataRaw = di r Loader . get Dat aSet () ;
47
48 dataRawSaver = new Ar f f Saver () ;
49 dataRawSaver . s e t I ns t anc e s ( dataRaw) ;
50 dataRawSaver . s e t F i l e (new F i l e ( r oot Fol der +
dataRawArffName ) ) ;
51 dataRawSaver . wr i t eBat ch () ; ;
52
53 System . out . pr i nt l n ( l i ne Pr i nt ) ;
54 System . out . pr i nt l n ( "=== Data resume has al r eady
convert ed i nt o " +
55 " ARFF f i l e s . . . ===" ) ;
56 System . out . pr i nt l n ( "=== Dat aset source : " +
resumeFol der + " ===" ) ;
57 System . out . pr i nt l n ( "=== With name: " + dataRawArffName
+ " ===" ) ;
58 System . out . pr i nt l n ( "=== Locat i on : " + r oot Fol der + "
===" ) ;
59 System . out . pr i nt l n ( l i ne Pr i nt ) ;
60
61 } catch ( Except i on e) {
62 System . out . pr i nt l n ( " There i s problem when execut e
conver t method! " ) ;
63 }
64
65 }
66
67 publ i c void checkNLoadFi l e () {
68
69 t ry {
70 F i l e f = new F i l e ( newDataFolder ) ;
71 F i l e [] f i l e s = f . l i s t F i l e s () ;
121
72 St r i ng [] t e xt = new St r i ng [ f i l e s . l engt h ] ;
73 St r i ng l i ne ;
74
75 i f ( f i l e s != nul l ) {
76 Fas t Vect or c l as s Val = new Fas t Vect or (5) ;
77 c l as s Val . addElement ( " c l as s 1 " ) ;
78 c l as s Val . addElement ( " c l as s 2 " ) ;
79 c l as s Val . addElement ( " c l as s 3 " ) ;
80 c l as s Val . addElement ( " c l as s 4 " ) ;
81 c l as s Val . addElement ( " c l as s 5 " ) ;
82
83 At t r i but e a t t r i but e 1 = new At t r i but e ( " t e xt " , (
Fas t Vect or ) nul l ) ;
84 At t r i but e a t t r i but e 2 = new At t r i but e ( "@@class@@" ,
c l as s Val ) ;
85
86 Fas t Vect or f vWekaAt t r i but es = new Fas t Vect or (2) ;
87 f vWekaAt t r i but es . addElement ( at t r i but e 1 ) ;
88 f vWekaAt t r i but es . addElement ( at t r i but e 2 ) ;
89 newData = new I ns t anc es ( " new_data " ,
f vWekaAt t ri but es , 0) ;
90 newData . s et Cl as s I ndex ( newData . numAt t ri but es () 1) ;
91
92 DenseI nst ance i ns t anc e = new DenseI nst ance ( newData .
numAt t ri but es () ) ;
93
94 f or ( i nt i = 0; i < f i l e s . l engt h ; i ++){
95 Buf f eredReader f i l eReader = new Buf f eredReader (new
Fi l eReader ( newDataFolder+" / "+f i l e s [ i ] . getName () ) ) ;
96 while (( l i ne = f i l eReader . readLi ne () ) != nul l ) {
97 l i ne = l i ne ;
98 t e xt [ i ] = l i ne ;
99 }
100
101 i ns t anc e . s et Val ue ( at t r i but e1 , t ext [ i ]) ;
102 newData . add( i ns t anc e ) ;
103 f i l eReader . c l os e () ;
104 }
105 System . out . pr i nt l n ( l i ne Pr i nt ) ;
106 System . out . pr i nt l n ( "=== The ARFF f i l e format ===" )
;
107 System . out . pr i nt l n ( newData) ;
108 System . out . pr i nt l n ( l i ne Pr i nt ) ;
109 System . out . pr i nt l n () ;
110 }
122
111
112 dataRawSaver = new Ar f f Saver () ;
113 dataRawSaver . s e t I ns t anc e s ( newData) ;
114 dataRawSaver . s e t F i l e (new F i l e ( newDataFolder2 +
newDataRawArff ) ) ;
115 dataRawSaver . wr i t eBat ch () ;
116
117 System . out . pr i nt l n ( l i ne Pr i nt ) ;
118 System . out . pr i nt l n ( "=== New unl abel ed resume data has
al r eady convert ed i nt o " + " ARFF f i l e s . . . ===" ) ;
119 System . out . pr i nt l n ( "=== Data source : " + newDataFolder
+ " ===" ) ;
120 System . out . pr i nt l n ( "=== With name: " + newDataRawArff
+ " ===" ) ;
121 System . out . pr i nt l n ( "=== Locat i on : " + newDataFolder2 +
" ===" ) ;
122 System . out . pr i nt l n ( l i ne Pr i nt ) ;
123
124 } catch ( Except i on e) {
125 System . out . pr i nt l n ( " There i s problem when execut e t he
method! " ) ;
126 }
127
128 }
129
130 publ i c void f i l t e r F unc t i o n () throws Except i on{
131 DataSource sourceRawData = new DataSource ( r oot Fol der +
dataRawArffName ) ;
132 i ns t anc es = sourceRawData . get Dat aSet () ;
133
134 Randomize rand = new Randomize () ;
135 rand . setRandomSeed(42) ;
136 rand . set I nput Format ( i ns t anc es ) ;
137 i ns t anc es = rand . us e F i l t e r ( i ns t ances , rand) ;
138
139 Stri ngToWordVector s t r i n g F i l t e r = new Stri ngToWordVector ()
;
140 s t r i n g F i l t e r . setOutputWordCounts ( true ) ;
141 s t r i n g F i l t e r . s e t At t r i but e I ndi c e s ( " f i r s t l a s t " ) ;
142 s t r i n g F i l t e r . s et Tokeni zer (new NGramTokenizer () ) ;
143 s t r i n g F i l t e r . setWordsToKeep(1000) ;
144 s t r i n g F i l t e r . set I nput Format ( i ns t anc es ) ;
145 I ns t anc es newTrain = F i l t e r . us e F i l t e r ( i ns t ances ,
s t r i n g F i l t e r ) ;
123
146 I ns t anc es newTest = F i l t e r . us e F i l t e r ( newData , s t r i n g F i l t e r
) ;
147
148 Reorder order = new Reorder () ;
149 order . s e t At t r i but e I ndi c e s ( " 2l as t , 1 " ) ;
150 order . set I nput Format ( newTrain) ;
151 I ns t anc es newerTrai n = F i l t e r . us e F i l t e r ( newTrain , order ) ;
152 I ns t anc es newerTest = F i l t e r . us e F i l t e r ( newTest , order ) ;
153
154 dat aFi l t er edSaver = new Ar f f Saver () ;
155 dat aFi l t er edSaver . s e t I ns t anc e s ( newerTrai n ) ;
156 dat aFi l t er edSaver . s e t F i l e (new F i l e ( r oot Fol der +
dat aFi l t er edAr f f Name ) ) ;
157 dat aFi l t er edSaver . wr i t eBat ch () ;
158
159 System . out . pr i nt l n ( l i ne Pr i nt ) ;
160 System . out . pr i nt l n ( "=== F i l e raw ARFF has al r eady f i l t e r e d
. . . ===" ) ;
161 System . out . pr i nt l n ( "=== Source : " + dataRawArffName + "
===" ) ;
162 System . out . pr i nt l n ( "=== With name: " +
dat aFi l t er edAr f f Name + " ===" ) ;
163 System . out . pr i nt l n ( "=== Locat i on : " + r oot Fol der + " ===" )
;
164 System . out . pr i nt l n ( l i ne Pr i nt ) ;
165
166 dat aNewFi l t er Saver = new Ar f f Saver () ;
167 dat aNewFi l t er Saver . s e t I ns t anc e s ( newerTest ) ;
168 dat aNewFi l t er Saver . s e t F i l e (new F i l e ( newDataFolder2 +
newDat aFl t r dAr f f ) ) ;
169 dat aNewFi l t er Saver . wr i t eBat ch () ;
170
171 System . out . pr i nt l n ( l i ne Pr i nt ) ;
172 System . out . pr i nt l n ( "=== F i l e new unl abel ed data ARFF has
al r eady f i l t e r e d . . . ===" ) ;
173 System . out . pr i nt l n ( "=== Source : " + newDataRawArff + " ===
" ) ;
174 System . out . pr i nt l n ( "=== With name: " + newDat aFl t r dAr f f +
" ===" ) ;
175 System . out . pr i nt l n ( "=== Locat i on : " + newDataFolder2+ "
===" ) ;
176 System . out . pr i nt l n ( l i ne Pr i nt ) ;
177 }
178
179 publ i c void pr epar eDat aTr ai nTest () {
124
180
181 t ry {
182 DataSource s our c eFi l t er edDat a = new DataSource ( r oot Fol der +
dat aFi l t er edAr f f Name ) ;
183 t r ai nDat a = s our c eFi l t er edDat a . get Dat aSet () ;
184 DataSource s our ceFi l t er edDat a2 = new DataSource ( r oot Fol der +
dat aFi l t er edAr f f Name ) ;
185 t es t Dat a = s our ceFi l t er edDat a2 . get Dat aSet () ;
186 i nt mi nI nst ances = 25;
187
188 i f ( t r ai nDat a . numInstances () > mi nI nst ances ) {
189 RemovePercentage di vi deTr ai nDat a = new
RemovePercentage () ;
190 di vi deTr ai nDat a . s et Per cent age (10) ;
191 di vi deTr ai nDat a . s e t I nv e r t Se l e c t i on ( true ) ;
192 di vi deTr ai nDat a . set I nput Format ( t r ai nDat a ) ;
193
194 t r ai nDat a = di vi deTr ai nDat a . us e F i l t e r ( t r ai nDat a ,
di vi deTr ai nDat a ) ;
195
196 dat aTr ai nSaver = new Ar f f Saver () ;
197 dat aTr ai nSaver . s e t I ns t anc e s ( t r ai nDat a ) ;
198 dat aTr ai nSaver . s e t F i l e (new F i l e ( r oot Fol der +
dataTrai nArff Name ) ) ;
199 dat aTr ai nSaver . wr i t eBat ch () ;
200
201 System . out . pr i nt l n ( l i ne Pr i nt ) ;
202 System . out . pr i nt l n ( "=== F i l e ARFF f or t r ai ni ng data was
b ui l t s uc c e s s f ul l . . . ===" ) ;
203 System . out . pr i nt l n ( "=== Source : " + dat aFi l t er edAr f f Name +
" ===" ) ;
204 System . out . pr i nt l n ( "=== With name: " + dataTrai nArff Name +
" ===" ) ;
205 System . out . pr i nt l n ( "=== Locat i on : " + r oot Fol der +
" ===" ) ;
206 System . out . pr i nt l n ( l i ne Pr i nt ) ;
207
208 RemovePercentage di vi deTes t Dat a = new
RemovePercentage () ;
209 di vi deTes t Dat a . s et Per cent age (10) ;
210 di vi deTes t Dat a . s e t I nv e r t Se l e c t i on ( f al s e ) ;
211 di vi deTes t Dat a . set I nput Format ( t es t Dat a ) ;
212 t es t Dat a = di vi deTes t Dat a . us e F i l t e r ( t es t Dat a ,
di vi deTes t Dat a ) ;
213 dat aTest Saver = new Ar f f Saver () ;
125
214 dat aTest Saver . s e t I ns t anc e s ( t es t Dat a ) ;
215 dat aTest Saver . s e t F i l e (new F i l e ( r oot Fol der +
dataTestArff Name ) ) ;
216 dat aTest Saver . wr i t eBat ch () ;
217
218 System . out . pr i nt l n ( l i ne Pr i nt ) ;
219 System . out . pr i nt l n ( "=== F i l e ARFF f or t e s t i ng data was
b ui l t s uc c e s s f ul l . . . ===" ) ;
220 System . out . pr i nt l n ( "=== Source : " + dat aFi l t er edAr f f Name +
" ===" ) ;
221 System . out . pr i nt l n ( "=== With name: " + dataTestArf f Name +
" ===" ) ;
222 System . out . pr i nt l n ( "=== Locat i on : " + r oot Fol der +
" ===" ) ;
223 System . out . pr i nt l n ( l i ne Pr i nt ) ;
224
225 } el se {
226 System . out . pr i nt l n ( " You have i ns t anc e ( s ) equal or
l e s s "
227 + " than " + mi nI nst ances +
228 " ! \nYou need more i ns t anc es t o be val i d ! "
) ;
229 }
230
231 } catch ( Except i on e) {
232 System . out . pr i nt l n ( " There i s problem when l oad f i l e
ARFF! " ) ;
233 }
234
235 }
236
237 publ i c s t at i c void main( St r i ng [] ar gs ) throws Except i on{
238
239 Fi l e Ar f f Pr e pa r a t i on ar f f Pr e par at i on = new
Fi l e Ar f f Pr e par at i on () ;
240 // ar f f Pr e p ar at i o n . c o nv e r t Te x t 2Ar f f () ;
241 // ar f f Pr e p ar at i o n . c hec kNLoadFi l e () ;
242 // ar f f Pr e p ar at i o n . f i l t e r F u n c t i o n () ;
243 ar f f Pr e par at i on . pr epar eDat aTr ai nTest () ;
244 System . out . pr i nt l n ( " \n=== Pr epar at i on pr oces s of F i l e ARFF was
done s uc c e s s f ul l . . " ) ;
245 }
246
247 }
126
Training Class
(MultiClassLearner.java)
1 import weka . core . ;
2 import weka . c l a s s i f i e r s . meta . Mul t i Cl a s s Cl a s s i f i e r ;
3 import weka . c l a s s i f i e r s . bayes . Nai veBayes ;
4 import weka . c l a s s i f i e r s . Eval uat i on ;
5 import weka . core . c onver t er s . Conver t er Ut i l s . DataSource ;
6
7 import j ava . i o . ;
8 import j ava . u t i l . Random;
9 import WebCrawling . Stopwatch ;
10
11 /
12 @author damai007 & septi awan_moge
13 /
14
15 publ i c cl ass Mul t i Cl as s Lear ner {
16 I ns t anc es t r ai nDat a , t es t Dat a ;
17 // Obj ek buat simpan data t r ai ni ng & Obj ek buat simpan data t e s t
18
19 Mul t i Cl a s s Cl a s s i f i e r c l a s s i f i e r ;
20 // Ke l as C l a s s i f i e r yang akan di pakai
21
22 // St r i ng r oot Fol de r = "D: / Cr awl e r _Pr oj e c t / Fi nal Mai nt ai n /
g e ne r a l F i e l d / " ;
23 St r i ng r oot Fol der = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
s p e c i f i c F i e l d / " ;
24 St r i ng dataTrai nArf fName = " dataResume . t r a i n . a r f f " ;
25 St r i ng dataTestArf f Name = " dataResume . t e s t . a r f f " ;
26
27 Stopwatch stopwatch ;
28
29 // Ke l as untuk mengambil data dengan f ung s i DataSource , f i l e a r f f
sudah dalam bent uk v e c t o r dengan mul t i c l a s s
30 publ i c void l oadDat as et () throws Except i on{
31
32 DataSource s our ceTr ai n = new DataSource ( r oot Fol der +
dataTrai nArff Name ) ;
33 DataSource s our ceTes t = new DataSource ( r oot Fol der +
dataTestArf f Name ) ;
34 // ambi l dat as e t dar i dalam f i l e a r f f
35 t r ai nDat a = s our ceTr ai n . get Dat aSet () ;
36 t es t Dat a = s our ceTes t . get Dat aSet () ;
127
37
38 //2 l i n e pe r i nt ah i n i b e r f ung s i untuk me nde f i ni s i kan agar
Ke l as dat as e t dar i data t r ai n t i dak ne g a t i f
39 i nt cI dxTr ai n = t r ai nDat a . numAt t ri but es () 1;
40 t r ai nDat a . s et Cl as s I ndex ( cI dxTr ai n ) ;
41
42 //2 l i n e pe r i nt ah i n i b e r f ung s i untuk me nde f i ni s i kan agar
Ke l as dat as e t dar i data t e s t t i dak ne g a t i f
43 i nt cI dxTes t = t es t Dat a . numAt t ri but es () 1;
44 t es t Dat a . s et Cl as s I ndex ( cI dxTes t ) ;
45
46 System . out . pr i nt l n ( " Data Loaded . . . " ) ;
47 System . out . pr i nt l n ( "=== Tr ai ni ng data : " + dataTrai nArff Name ) ;
48 System . out . pr i nt l n ( "=== Tes t i ng data : " + dataTestArf f Name ) ;
49 System . out . pr i nt l n ( "=== Source l oc at i on : " + r oot Fol der ) ;
50 System . out . pr i nt l n () ;
51
52 }
53
54 // Ke l as Eval uat e
55 publ i c void eval uat e () throws Except i on{
56
57 // Pe nde f i ni s i an dar i c l a s s i f i e r yang di gunakan
58 //Menggunakan Mu l t i Cl a s s Cl a s s i f i e r
59 c l a s s i f i e r = new Mul t i Cl a s s Cl a s s i f i e r () ;
60
61 // Pe nde f i ni s i an dar i bas e c l a s s i f i e r yang di gunakan
62 Nai veBayes b a s e Cl a s s i f i e r = new Nai veBayes () ;
63 c l a s s i f i e r . s e t Cl a s s i f i e r ( b a s e Cl a s s i f i e r ) ;
64
65 //Pembuatan C l a s s i f i e r dengan data t r ai n
66 c l a s s i f i e r . b u i l d Cl a s s i f i e r ( t r ai nDat a ) ;
67
68 // Pe nde f i ni s i an e v al uas i dar i C l a s s i f i e r
69 Eval uat i on eval = new Eval uat i on ( t r ai nDat a ) ;
70 eval . cr os s Val i dat eModel ( c l a s s i f i e r , t r ai nDat a , 10, new
Random(1) ) ;
71 // Pr i nt ha s i l e v al uas i
72 System . out . pr i nt l n ( "=== Summary of Eval uat i on of Tr ai ni ng
dat as et ===" ) ;
73 System . out . pr i nt l n ( eval . toSummaryString () ) ;
74 System . out . pr i nt l n ( eval . t oCl a s s De t a i l s St r i ng () ) ;
75 System . out . pr i nt l n ( eval . t oMat r i xSt r i ng () ) ;
76 System . out . pr i nt l n ( "=== Eval uat i ng on t r ai ni ng dat as et
done ====" ) ;
128
77 System . out . pr i nt l n () ;
78
79 }
80
81 publ i c void l ear n () throws Except i on{
82 Eval uat i on eval uat i on = new Eval uat i on ( t r ai nDat a ) ;
83 eval uat i on . eval uateModel ( c l a s s i f i e r , t es t Dat a ) ;
84 System . out . pr i nt l n ( "=== Summary of Tr ai ni ng Test dat as et
===" ) ;
85 System . out . pr i nt l n ( eval uat i on . toSummaryString () ) ;
86 System . out . pr i nt l n ( eval uat i on . t oCl a s s De t a i l s St r i ng () ) ;
87 System . out . pr i nt l n ( eval uat i on . t oMat r i xSt r i ng () ) ;
88 System . out . pr i nt l n ( "=== Tr ai ni ng / t e s t i ng on t e s t i ng
dat as et done ====" ) ;
89 System . out . pr i nt l n () ;
90
91 }
92
93 // F i t ur s ave model ke f i l e
94 publ i c void saveModel ( St r i ng fi l eName ) throws Except i on{
95 Obj ectOutputStream out = new Obj ectOutputStream(new
Fi l eOut put St ream( fi l eName ) ) ;
96 //Simpan f ormat model ke f i l e
97 out . wr i t eObj ect ( c l a s s i f i e r ) ;
98 out . c l os e () ;
99 System . out . pr i nt l n ( "=== Cl a s s i f i e r Model has been saved ===" ) ;
100 System . out . pr i nt l n () ;
101 }
102
103 publ i c s t at i c void main( St r i ng [] ar gs ) throws Except i on{
104 Stopwatch stopwatch = new Stopwatch () ;
105 Mul t i Cl as s Lear ner l ear ner = new Mul t i Cl as s Lear ner () ;
106 stopwatch . s t a r t () ;
107 l ear ner . l oadDat as et () ;
108 l ear ner . eval uat e () ;
109 l ear ner . l ear n () ;
110 stopwatch . st op () ;
111 System . out . pr i nt l n ( " el apsed ti me i n mi l l i s ec onds : " +
stopwatch . getEl apsedTi me () ) ;
112 System . out . pr i nt l n ( " el apsed ti me i n seconds : " + stopwatch .
get El apsedTi meSecs () ) ;
113
114 // l e ar ne r . saveModel ( " D: / Cr awl e r _Pr oj e c t / Fi nal Mai nt ai n /
g e ne r a l F i e l d / ml t i Cl as s NvBaye s Ge ne r al . model " ) ;
115 // Mi sal kan simpan di f i l e . model
129
116
117 l ear ner . saveModel ( "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
s p e c i f i c F i e l d / ml t i Cl as s NvBayes Spec i f i c . model " ) ;
118 // Mi sal kan simpan di f i l e . model
119 }
120
121 }
130
Classification Class
(MultiClassClassify.java)
1 import weka . core . ;
2 import weka . c l a s s i f i e r s . meta . Mul t i Cl a s s Cl a s s i f i e r ;
3 import weka . core . c onver t er s . Conver t er Ut i l s . DataSource ;
4
5 import j ava . i o . ;
6 import j ava . u t i l . L i s t ;
7 import j ava . u t i l . Ar r ayLi s t ;
8 import WebCrawling . Stopwatch ;
9
10 /
11 @author damai007 & septi awan_moge
12 /
13
14 publ i c cl ass Mul t i Cl a s s Cl a s s i f y {
15
16 I ns t anc es newTestData ; // Obj ek buat simpan data a r f f baru yang
unl abe l e d
17
18 Mul t i Cl a s s Cl a s s i f i e r c l a s s i f i e r ; // Ke l as C l a s s i f i e r yang akan
di pakai , s e bagai obj e k dar i l oad model
19
20 St r i ng newDataFolder = "D: / Cr awl er _Pr oj ect / Fi nal Mai nt ai n /
t e s t Cl a s s i f i c a t i o n / processedDat a / " ;
21 St r i ng newDataArff = " nol bl Dat a . processed . a r f f " ;
22
23 // Ke l as untuk mengambil data dengan f ung s i DataSource , f i l e a r f f
sudah dalam bent uk v e kt or t api unl abe l e d (?)
24 publ i c void loadNewset () throws Except i on{
25
26 DataSource sourceNewset = new DataSource ( newDataFolder +
newDataArff ) ;
27 // ambi l dat as e t dar i f i l e a r f f
28 newTestData = sourceNewset . get Dat aSet () ;
29
30 //2 l i n e pe r i nt ah i n i b e r f ung s i untuk me nde f i ni s i kan agar
Ke l as dat as e t dar i data t r ai n t i dak ne g a t i f
31 i nt cIdxNewset = newTestData . numAt t ri but es () 1;
32 newTestData . s et Cl as s I ndex ( cIdxNewset ) ;
33
34 System . out . pr i nt l n ( " \n=== Loadi ng new t e s t data ===" ) ;
35 System . out . pr i nt l n ( "=== F i l e name: " + newDataArff ) ;
131
36 System . out . pr i nt l n ( "=== From di r e c t t or y : " + newDataFolder ) ;
37 System . out . pr i nt l n ( "========================================="
) ;
38 }
39
40 // Ke l as untuk mel oad f i l e model ( . model ) c l a s s i f i e r yang t e l ah
di si mpan dar i ke l as Mul t i Cl as s Le ar ne r
41 publ i c void loadModel () {
42 t r y {
43 // Load f i l e model dar i d i r e k t o r i
44 // Obj ec t I nput St r eam i n = new Obj ec t I nput St r eam (new
Fi l e I nput St r e am ( " D: / Cr awl e r _Pr oj e c t /" + " Fi nal Mai nt ai n /
g e ne r a l F i e l d / ml t i Cl as s NvBaye s Ge ne r al . model " ) ) ;
45 Obj ect I nput St ream i n = new Obj ect I nput St ream(new
Fi l eI nput St r eam( "D: / Cr awl er _Pr oj ect / " + " Fi nal Mai nt ai n /
s p e c i f i c F i e l d / ml t i Cl as s NvBayes Spec i f i c . model " ) ) ;
46 Obj ect tmp = i n . readObj ect () ;
47 // Load f i l e model ke dalam Weka s e bagai Mu l t i Cl a s s Cl a s s i f i e r
48 c l a s s i f i e r = ( Mul t i Cl a s s Cl a s s i f i e r ) tmp;
49 i n . c l os e () ;
50
51 System . out . pr i nt l n ( " \n=== Loadi ng model ===" ) ;
52 System . out . pr i nt l n ( "=== Cl a s s i f i e r : " + c l a s s i f i e r . get Cl as s
() . getName () + " " + Ut i l s . j oi nOpt i ons ( c l a s s i f i e r .
get Opt i ons () ) ) ;
53 // Cet ak j e n i s c l a s s i f i e r yang di gunakan b e s e r t a
pengaturannya ( d e f a ul t )
54 System . out . pr i nt l n ( "
=========================================" ) ;
55 } catch ( Except i on e) {
56 System . out . pr i nt l n ( " \nProblem found when readi ng model ! ! " ) ;
57 }
58 }
59
60 // Ke l as untuk mempredi ksi kan/mer e v al uat e dat as e t unl abe l e d
t erhadap model c l a s s i f i e r
61 publ i c void cl assi f yDocument () {
62 t r y {
63 System . out . pr i nt l n () ;
64 System . out . pr i nt l n ( "=== Setup ===" ) ;
65 System . out . pr i nt l n ( "=== Cl a s s i f i e r : " + c l a s s i f i e r . get Cl as s
() . getName () + " " + Ut i l s . j oi nOpt i ons ( c l a s s i f i e r .
get Opt i ons () ) ) ;
66 // Cet ak j e n i s c l a s s i f i e r yang di gunakan b e s e r t a
pengaturannya ( d e f a ul t )
132
67 System . out . pr i nt l n ( "=== Dat aset Rel at i on : " + newTestData .
rel ati onName () ) ;
68 // Cet ak nama r e l a s i @Rel ati on dar i f i l e dat as e t
69 System . out . pr i nt l n () ;
70 System . out . pr i nt l n ( "=== Data i n dat as et : " ) ;
71
72 // Perul angan untuk c e t ak i s i data dar i dat as e t
73 f or ( i nt a = 0; a < newTestData . numInstances () ; a++){
74 System . out . pr i nt (( a+1) + " . " ) ;
75 // c e t ak nomor urut
76 System . out . pr i nt l n ( newTestData . i ns t anc e ( a) ) ;
77 // c e t ak i s i data dar i dat as e t per i ndex a
78 }
79
80 System . out . pr i nt l n () ;
81
82 // Perul angan untuk c e t ak r e s u l t p r e d i c t e d dar i semua i ndex
83 System . out . pr i nt l n ( "# ac t ual pr edi ct ed " ) ;
84 f or ( i nt i = 0; i < newTestData . numInstances () ; i ++){
85 // Perul angan s epanj ang j uml ah i ndex dar i dat as e t
86 double pred = c l a s s i f i e r . c l a s s i f y I ns t a nc e ( newTestData .
i ns t anc e ( i ) ) ;
87 // Pr os e s Kl a s i f i k a s i c l a s s i f i e r t erhadap kes el ur uhan
dat as e t
88
89 System . out . pr i nt (( i +1)) ;
90 // Cet ak nomor urut
91 System . out . pr i nt ( " " ) ;
92 System . out . pr i nt ( " " + newTestData . i ns t anc e ( i ) . t oSt r i ng (
newTestData . c l as s I ndex () )+ " " ) ;
93 // c e t ak c l a s s dar i dat as e t , karena emang unl abe l e d (?)
j a d i has i l nya (?)
94 System . out . pr i nt ( " " ) ;
95 System . out . pr i nt ( " " + newTestData . c l a s s At t r i b ut e () .
val ue (( i nt ) pred ) ) ;
96 // c e t ak ha s i l p r e d i c t e d c l a s s berdas arkan i ndex
97 System . out . pr i nt l n ( " F i l e " + ( i +1) +" . di pr e di ks i sebagai
kel as " + newTestData . c l a s s At t r i b ut e () . val ue (( i nt )
pred ) ) ;
98 System . out . pr i nt l n () ;
99
100 }
101 System . out . pr i nt l n ( "=== Cl a s s i f i c a t i o n has been done
s uc c e s s f ul l y . . . " ) ;
102 } catch ( Except i on e) {
133
103 System . out . pr i nt l n ( " Problem found when c l a s s i f y i n g new data
s et or t he data i s not val i d ! ! " ) ;
104 }
105 }
106
107 publ i c s t at i c void main ( St r i ng [] ar gs ) throws Except i on{
108 Stopwatch stopwatch = new Stopwatch () ;
109 Mul t i Cl a s s Cl a s s i f y c l a s s i f i e r = new Mul t i Cl a s s Cl a s s i f y () ;
110 stopwatch . s t a r t () ;
111 c l a s s i f i e r . loadNewset () ;
112 c l a s s i f i e r . loadModel () ;
113 c l a s s i f i e r . cl assi f yDocument () ;
114 stopwatch . st op () ;
115 System . out . pr i nt l n ( " el apsed ti me i n mi l l i s ec onds : " +
stopwatch . getEl apsedTi me () ) ;
116 System . out . pr i nt l n ( " el apsed ti me i n seconds : " + stopwatch .
get El apsedTi meSecs () ) ;
117 }
118
119 }
134
Timer Class
(Stopwatch.java)
1 /
2 @author damai007 & septi awan_moge
3 /
4
5 publ i c cl ass Stopwatch {
6 pri vat e long s t ar t Ti me = 0;
7 pri vat e long stopTi me = 0;
8 pri vat e boolean runni ng = f al s e ;
9 publ i c void s t a r t () {
10 t hi s . s t ar t Ti me = System . c ur r ent Ti meMi l l i s () ;
11 t hi s . runni ng = true ;
12 }
13 publ i c void st op () {
14 t hi s . stopTi me = System . c ur r ent Ti meMi l l i s () ;
15 t hi s . runni ng = f al s e ;
16 }
17 // e l as ps e d t i me i n mi l l i s e c o nd s
18 publ i c long getEl apsedTi me () {
19 long el apsed ;
20 i f ( runni ng ) {
21 el apsed = ( System . c ur r ent Ti meMi l l i s () s t ar t Ti me ) ;
22 }
23 el se {
24 el apsed = ( stopTi me s t ar t Ti me ) ;
25 }
26 return el apsed ;
27 }
28
29 // e l as ps e d t i me i n s e c onds
30 publ i c long get El apsedTi meSecs () {
31 long el apsed ;
32 i f ( runni ng ) {
33 el apsed = (( System . c ur r ent Ti meMi l l i s () s t ar t Ti me ) / 1000) ;
34 }
35 el se {
36 el apsed = (( stopTi me s t ar t Ti me ) / 1000) ;
37 }
38 return el apsed ;
39 }
40 }
Appendix B
Stopwords List
No. Word No. Word No. Word No. Word
1 able 24 already 47 appropriate 70 been
2 about 25 also 48 are 71 before
3 above 26 although 49 arent 72 beforehand
4 abroad 27 always 50 around 73 begin
5 according 28 am 51 as 74 behind
6 accordingly 29 amid 52 as 75 being
7 across 30 amidst 53 aside 76 believe
8 actually 31 among 54 ask 77 below
9 adj 32 amongst 55 asking 78 beside
10 after 33 an 56 associated 79 besides
11 afterwards 34 and 57 at 80 best
12 again 35 another 58 available 81 better
13 against 36 any 59 away 82 between
14 ago 37 anybody 60 awfully 83 beyond
15 ahead 38 anyhow 61 back 84 both
16 aint 39 anyone 62 backward 85 brief
17 all 40 anything 63 backwards 86 but
18 allow 41 anyway 64 be 87 by
19 allows 42 anyways 65 became 88 came
20 almost 43 anywhere 66 because 89 can
21 alone 44 apart 67 become 90 cannot
22 along 45 appear 68 becomes 91 cant
23 alongside 46 appreciate 69 becoming 92 cant
135
136
No. Word No. Word No. Word No. Word
93 caption 128 do 163 except 198 had
94 cause 129 does 164 fairly 199 hadnt
95 causes 130 doesnt 165 far 200 half
96 certain 131 doing 166 farther 201 happens
97 certainly 132 done 167 few 202 hardly
98 changes 133 dont 168 fewer 203 has
99 clearly 134 down 169 fth 204 hasnt
100 cmon 135 downwards 170 rst 205 have
101 co 136 during 171 ve 206 havent
102 co. 137 each 172 followed 207 having
103 com 138 edu 173 following 208 he
104 come 139 eg 174 follows 209 hed
105 comes 140 eight 175 for 210 hell
106 concerning 141 eighty 176 forever 211 hello
107 consequently 142 either 177 former 212 help
108 consider 143 else 178 formerly 213 hence
109 considering 144 elsewhere 179 forth 214 her
110 contain 145 end 180 forward 215 here
111 containing 146 ending 181 found 216 hereafter
112 contains 147 enough 182 four 217 hereby
113 corresponding 148 entirely 183 from 218 herein
114 could 149 especially 184 further 219 heres
115 coulnt 150 et 185 furthermore 220 hereupon
116 course 151 etc 186 get 221 hers
117 cs 152 even 187 gets 222 herself
118 currently 153 ever 188 getting 223 hes
119 dare 154 evermore 189 given 224 hi
120 darent 155 every 190 gives 225 him
121 denitely 156 everybody 191 go 226 himself
122 described 157 everyone 192 goes 227 his
123 despite 158 everything 193 going 228 hither
124 did 159 everywhere 194 gone 229 hopefully
125 didnt 160 ex 195 got 230 how
126 dierent 161 exactly 196 gotten 231 howbeit
127 directly 162 example 197 greetings 232 however
137
No. Word No. Word No. Word No. Word
233 hundred 268 kept 303 meantime 338 nine
234 id 269 know 304 meanwhile 339 ninety
235 ie 270 known 305 merely 340 no
236 if 271 knows 306 might 341 nobody
237 ignored 272 last 307 mightnt 342 non
238 ill 273 lately 308 mine 343 none
239 im 274 later 309 minus 344 nonetheless
240 immediate 275 latter 310 miss 345 noone
241 in 276 latterly 311 more 346 no-one
242 inasmuch 277 least 312 moreover 347 nor
243 inc 278 less 313 most 348 normally
244 inc. 279 lest 314 mostly 349 not
245 indeed 280 let 315 mr 350 nothing
246 indicate 281 lets 316 mrs 351 notwithstanding
247 indicated 282 like 317 much 352 novel
248 indicates 283 liked 318 must 353 now
249 inner 284 likely 319 mustnt 354 nowhere
250 inside 285 likewise 320 my 355 obviously
251 insofar 286 little 321 myself 356 of
252 instead 287 look 322 name 357 o
253 into 288 looking 323 namely 358 often
254 inward 289 looks 324 nd 359 oh
255 is 290 low 325 near 360 ok
256 isnt 291 lower 326 nearly 361 okay
257 it 292 ltd 327 necessary 362 old
258 itd 293 made 328 need 363 on
259 itll 294 mainly 329 neednt 364 once
260 its 295 make 330 needs 365 one
261 its 296 makes 331 neither 366 ones
262 itself 297 many 332 never 367 ones
263 ive 298 may 333 neverf 368 only
264 just 299 maybe 334 neverless 369 onto
265 k 300 maynt 335 nevertheless 370 opposite
266 keep 301 me 336 new 371 or
267 keeps 302 mean 337 next 372 other
138
No. Word No. Word No. Word No. Word
373 others 408 regarding 443 shes 478 thanx
374 otherwise 409 regardless 444 should 479 that
375 ought 410 regards 445 shouldnt 480 thatll
376 oughtnt 411 relatively 446 since 481 thats
377 our 412 respectively 447 six 482 thats
378 ours 413 right 448 so 483 thatve
379 ourselves 414 round 449 some 484 the
380 out 415 said 450 somebody 485 their
381 outside 416 same 451 someday 486 theirs
382 over 417 saw 452 somehow 487 them
383 overall 418 say 453 someone 488 themselves
384 own 419 saying 454 something 489 then
385 particular 420 says 455 sometime 490 thence
386 particularly 421 second 456 sometimes 491 there
387 past 422 secondly 457 somewhat 492 thereafter
388 per 423 see 458 somewhere 493 thereby
389 perhaps 424 seeing 459 soon 494 thered
390 placed 425 seem 460 sorry 495 therefore
391 please 426 seemed 461 specied 496 therein
392 plus 427 seeming 462 specify 497 therell
392 possible 428 seems 463 specifying 498 therere
394 presumably 429 seen 464 still 499 theres
395 probably 430 self 465 sub 500 theres
396 provided 431 selves 466 such 501 thereupon
397 provides 432 sensible 467 sup 502 thereve
398 que 433 sent 468 sure 503 these
399 quite 434 serious 469 take 504 they
400 qv 435 seriously 470 taken 505 theyd
401 rather 436 seven 471 taking 506 theyll
402 rd 437 several 472 tell 507 theyre
403 re 438 shall 473 tends 508 theyve
404 really 439 shant 474 th 509 thing
405 reasonably 440 she 475 than 510 things
406 recent 441 shed 476 thank 511 think
407 recently 442 shell 477 thanks 512 third
139
No. Word No. Word No. Word No. Word
513 thirty 548 unto 583 whatever 618 within
514 this 549 up 584 whatll 619 without
515 thorough 550 upon 585 whats 620 wonder
516 thoroughly 551 upwards 586 whatve 621 wont
517 those 552 us 587 when 622 would
518 though 553 use 588 whence 623 wouldnt
519 three 554 used 589 whenever 624 yes
520 through 555 useful 590 where 625 yes
521 throughout 556 uses 591 whereafter 626 you
522 thru 557 using 592 whereas 627 youd
523 thus 558 usually 593 whereby 628 youll
524 till 559 v 594 wherein 629 your
525 tol 560 value 595 wheres 630 youre
526 together 561 various 596 whereupon 631 yours
527 too 562 versus 597 wherever 632 yourself
528 took 563 very 598 whether 633 yourselves
529 toward 564 via 599 which 634 youve
530 towards 565 viz 600 whichever 635 zero
531 tried 566 vs 601 while 636 http
532 tries 567 want 602 whilst 637 www
533 truly 568 wants 603 whither 638 css
534 try 569 was 604 who 639 html
535 trying 570 wasnt 605 whod 640 com
536 ts 571 way 606 whoever 641 nv
537 twice 572 we 607 whole 642 ca
538 two 573 wed 608 wholl
539 un 574 welcome 609 whom
540 under 575 well 610 whomever
541 underneath 576 well 611 whos
542 undoing 577 went 612 whose
543 unfortunately 578 were 613 why
544 unless 579 were 614 will
545 unlike 580 werent 615 willing
546 unlikely 581 weve 616 wish
547 until 582 what 617 with

You might also like