You are on page 1of 12

Resume parserX Implementation

08-Nov-2012
Yogesh Thakur
Sambe Software 2
www.sambesoftware.com
TABLE OF CONTENTS

Contents
Purpose ............................................................................................................................................................. 3
Introduction ...................................................................................................................................................... 3
Technical Development Platform ................................................................................................................... 3
What is zone? .................................................................................................................................................... 3

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 3
www.sambesoftware.com

Purpose
The purpose of this document is to describe the basic implementation process of parserX component.

Introduction
Now days we have so many CVs forms different-2 recourses but the big issue we will face how to retrieve
the information from these CVs with quality data. parserX component is a DLL file which will take the
CVs in different formats as input and will provide the structured information with quality after some
processing. Its a self independent component (DLL) which will be integrated with any windows or web
application and that application will able to parse the CVs with the help of parserX.

Technical Development Platform


1. VS 2008 (Framework 3.5)
2. C# language for code
3. Using some third party component to read the text from different resources.
4. MS office 2007 component to get the XML formatted data from DOCX files.

What is zone?
Zone is nothing but a property of a CVs. As of now we are going to find out below listed zones from CV

1. Personal Info
i) Name
a. First Name
b. Middle Name
c. Last Name
ii) Date Of Birth
iii) Age (Will be calculated on basis of Date of birth)
iv) Gender
v) Marital Status
vi) Father Name
vii) Mother Name
viii) Spouse Name
ix) No Of Children
x) Nationality
xi) Passport Number
xii) Passport Issue Place
xiii) Driver License

2. Contact Info
i) Address
a. Current Address
A. Current Address
B. Current City
C. Current State
D. Current Country
E. Current Zip
b. Permanent Address
A. Permanent Address
By Yogesh Thakur, Senior Software Engineer, Sambe Software
Sambe Software 4
www.sambesoftware.com
B. Permanent City
C. Permanent State
D. Permanent Country
E. Permanent Zip
ii) Email
a. Primary Email
b. Alternate Email
iii) Telephone

3. Professional Info
i) Objective
ii) Summary
iii) Skills
iv) Current Yearly Salary
v) Total Experience
vi) Functional Category
vii) Industry Type

4. Employment History
i) From Date
ii) To Date
iii) Job Title
iv) Company/Employer
v) Position

5. Qualification Info History


i) Year Of Passing
ii) Degree
iii) College/School/University
iv) Percentage/ Grade

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 5
www.sambesoftware.com

How to Parse the CVs


Following are the steps to parse the CV
1. Check login credential
2. Config File Preparation
i) Check if file already exists if not then next step else go to step 3.
ii) Download the encrypted config file through webservice.
3. Data Loader (Load Application setting & meta data from Config file)
4. ResumePreProcess
5. Document Type Conversion
6. Resume Verification
7. Resume Text Cleanup
8. Splitting whole resume text in three blocks (Header, footer & body)
9. Parse the quality data from Header & Footer blocks with strong rule for some zone (ex. Name, email, contact
no. & Address).
10. Parse the remaining zones with strong/week rule from the body.
11. Parse data verification for some zone like Name, email & experience.

1) Check Login Credential (As per currrent scenario,not implementing)

Class Name: Authenticate

Method Name: IsAuthenticate public bool IsAuthenticate(string userName, string password)


This method accepts Username
& password & returns true if
authentication is successful else
returns false.

2) Config File Preparation: This is a config file (Not started)

3) Data Loader: (Not started)


ClassName:DataLoader(Static class)

Method Name: ParserXToolConfigDataSet Public static DataSet ParserXToolConfigDataSet()


This will first call decrypt method to decrypt
the configuration xml file & then reads the
file & loads all the data into dataset &
returns a dataset & at last encrypt the xml
file again by calling encrypt method

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 6
www.sambesoftware.com
Method Name: Public static DataTable ParserXToolConfigDataTable
ParserXToolConfigDataSet (string tableName):
This method will accept a table name as
string & returns a datatable from
datasets.
Method Name: GetRuleRegX
Method Name: GetXMLResumeFolder i) public static string GetRuleRegX(Rulesa rule)
public static string GetXMLResumeFolder()

Properties: public static DataTable Countries


{
Countries ii) get
dsConfigAll {return ParserXToolConfigDataTable("Countries");
EmploymentDurationType }}
Educational
Gender
public static DataSet dsConfigAll =
Functional ParserXToolConfigDataSetNew();
Industry
MaritalStatus public static DataTable EmploymentDurationType
SkillLookup {
WorkAuthorizationType get
NumberOfResumesPerHttpCall {return
ThreadPoolSize ParserXToolConfigDataTable("EmploymentDurationType");
}}
TimeToParsePerResume
TimeToSavePerResume public static DataTable EmploymentDurationType
{
get
{return
ParserXToolConfigDataTable("EmploymentDurationType");
}}

public static DataTable Educational


{
get
{return ParserXToolConfigDataTable("Educational");
}}

public static DataTable Gender


{
get
{return ParserXToolConfigDataTable("Gender ");
}}

public static DataTable Functional


{
get
{return ParserXToolConfigDataTable("Functional");
}}

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 7
www.sambesoftware.com
public static DataTable Industry
{
get
{return ParserXToolConfigDataTable("Industry");
}}

public static DataTable MaritalStatus


{
get
{return ParserXToolConfigDataTable("MaritalStatus");
}}

public static DataTable SkillLookup


{
get
{return ParserXToolConfigDataTable("SkillLookup");
}}

public static DataTable WorkAuthorizationType


{
get
{return
ParserXToolConfigDataTable("WorkAuthorizationType");
}}

public static int NumberOfResumesPerHttpCall


{
get
{ return GetInteger("NumberOfResumesPerHttpCall");}}

public static int ThreadPoolSize


{
get
{ return GetInteger("ThreadPoolSize");}}

public static int TimeToParsePerResume


{
get
{ return GetInteger("TimeToParsePerResume ");}}

public static int TimeToSavePerResume


{
get
{ return GetInteger("TimeToSavePerResume");}}

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 8
www.sambesoftware.com

4) ResumePreprocess: (started working on it but not yet completed)

ClassName: ResumePreprocess

Method Name: GetFileInfo public FileInfo GetFileInfo(string filepath)


{
This Method accepts filepath & return Fileinfo return FileInfo;
object. }

ClassName: FileInfo

PropertyName: public string Name


{
Name get;
Type set;
Size }

public string Type


{
get;
set;
}

public string Size


{
get;
set;
}

5) Document Type Conversion:


Classification of Document Types:

(i) WORD (Docs, DocX, rtf, txt) (Done)


(ii) PDF (Done )
(iii) Scan Documents (images (Not started))
(iv) HTML (Not started)

For resume conversion we will use one common function Document Conversion (string filePath, FileInfo
FileInfo), where we will pass Resume Path & FileInfo in Parameters.

Classification of Document Type Conversion:


(i) If Resume format is .Docx then we will convert docx to xml & processed further
accordingly.

Need to do research on this. Implementation is to be done in a future release.

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 9
www.sambesoftware.com
(ii) It Resume format is Word then we will use DocumentConversion.dll to convert Resume
in Plain Text. A method which is going to be used in such case is DoConversion (filePath,
"PLAIN_TEXT");
(iii) If Resume format is PDF then we will use two dlls, PDFBox.dll and ExpertPDF.dll.
Initially we will use PDFBox.dll to get the plain text; in case we dont get the desired
output then we will use ExpertPDF.dll.

For PDFBox.dll we will use below code.

//Parse using PDFBox


PDDocument document = PDDocument.load(filePath);
PDFTextStripper textStripper = new PDFTextStripper();
string result = textStripper.getText(document);

For ExpertPDF.dll we will use below code.


PdfToTextConverter converter = new PdfToTextConverter();
string result = converter.ConvertToText(stream)

After getting it convert to Plain Text, and then we take the count of line by using \n.

(iv) If Resume Format is a scan document either in .pdf or image format.

Need to do research on this. Implementation is to be done in a future release.

ClassName: TypeConversion

MethodName: DocumentConversion public string DocumentConversion(string


resumePath, FileInfo fileInfo)
This method will be main method which
accepts Resumepath & file info object &
based on file type select appropriate
methods to convert resume in plain text.
MethodName: ConvertPDFtotext public string ConvertPDFToText(Stream
filedat)
This method will use to convert pdf resume public string ConvertPdfToText(byte[]
in plain text. filedat, out string htmlString)

MethodName: ConvertDoctotext public string ConvertDocToText(Stream


filedat)
This method will use to convert pdf resume
in plain text. public string ConvertDocToText(Stream
filedat, out string htmlString)

Document Language Test: We need to test the language of the resume

6) Resume Verification: (Partially Done)

Verification of the resume will be done on following parameters.


(i) File Type.

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 10
www.sambesoftware.com
(ii) Check Line Count
(iii) By Keywords
We will check if the resume contains words such as email, resume,
CV, Education, Experience

ClassName: ResumeVerification
MethodName: IsResumeFile Public bool
IsResumeFile(FileInfo fileInfo)
This method will accept FileInfo object & checks
for the resume type .If resume type will pdf,
doc, docx, html, text.rtf then return true else
return false.
MethodName: IsResumeText public bool IsResumeText(string
resumetext)
This method will accept ResumeText & checks
for the valid resume by checking the LineCount
& by keywords like
email,resume,CV,Education,Experience.

7) Text Clean up and Prepare Text: (Partially Done)

ClassName: ResumeCleanup

Method: RemoveExcessiveSpaces
This method is used to remove
excessive spaces

In Resume Cleanup process we remove white spaces


Replace equal to sign (=) with blank space.
It picks up each line and then trims it.
DeleteWorthlessStartLines
ConvertDrawnLinesToBlankLines
CompressVerticalWhitespace
CollapseExpandedWords
FixBadHyphensAndWeirdChars
FixSlashRSlashRAndOther
ReplaceDrawnLinesWithSpaces
BreakTextIfMissingMostLineBreaks
RemoveExcessSpaceBeforeColons
ConcatenateLinesIfSplitAcrossPrepositionalPhrase

8) Splitting the whole CV in three blocks & retrieve the data : (Not started)

We will divide resume in three sections - Header, Footer and Body.


For Body:
By Yogesh Thakur, Senior Software Engineer, Sambe Software
Sambe Software 11
www.sambesoftware.com
Body further will be divided in sub sections called as Zone:
Objective
Summary
Hobbies
Achievement
Skill
Education
Work Experience
Reference
Seminars

For Header:
1) To get the header from plain text, We will search "" in resume and if we will get the data
we will use it.
2) IF we will not get desired output from step 1, then we will take first 5 lines from resume & treat as
header.

For Footer:
1) To get the footer from plain text, we will search " " in resume and if we will get the data
we will use it.
2) IF we will not get desired output from step 1 , then we will take last 7 lines from resume & treat
as header

9) Parse for Personnel data:

In this process, we will parse and extract personnel data such as Date of Birth, Nationality, Gender,
Marital Status, Driving License, Current Location, Preferred Location, Willing to Relocate, Fathers Name,
Mothers Maiden Name, Visa, Passport Number, Current Salary, Required Salary, SSN, Resume Id.

By Yogesh Thakur, Senior Software Engineer, Sambe Software


Sambe Software 12
www.sambesoftware.com
Start

Data Loader
1. Check XML File Version from Server Encrypted XML
2. Read Xml Config File. Meta Data &
Takes Login Credential 3. Decrypt Config file Text Application
4. Build Data Set Setting
5. Assign all the values to the basic variables
NO
Yes

If valid user

ParseType()?

Folder Iterate Outlook(Email) Iterate Single File Parse

Doc, Docx. PDF,


GetResumePath()? HTML, TEXT, RTF

Initializing Basis Objects of


IsResumeFile()? Yes
parserX

Get File Info

IsResumeText()?
NO
NO (Strong Rule Only)

Parse 1st Attempt ( By Strong Rule ) Resume Cleanup


1. Get Info By Rules (RegEx) (By Strong Rule)
2. Get Info By Zone Wise

Check Individual Resume Object one


by one
(Cross Verification)

Parse 2nd Attempt ( By Week Rule )


Verified? No 1. Get Info By Rules (RegEx)
2. Get Info By Zone Wise

Log Exception

XML Output

Parse
Resume
Object Single Line
CSV Output
Error Output

HTML Output
By Yogesh Thakur, Senior Software Engineer, Sambe Software