Arabic Keyphrase Extraction

Benha University Faculty of Engineering at Shoubra Computer engineering department
Arabic Keyphrase Extraction

Graduation Project
Supervised by:
Prof. Dr. Abdulwahab Al-Sammak
Prepared by: Ahmed Ali Ahmed Mostafa Mohammed Ahmed Rashad Basiouny Mohab Tarek El-Shishtawy Mostafa Mahmoud El-Abady Sherif Mohammed Nasr
A graduation project submitted to the Computer Engineering Department in fulfillment of the requirements for the degree of B.Sc. in Computer Engineering
Cairo, Egypt June 14, 2012
Acknowledgement
In the name of ALLAH, most Gracious, most Merciful First of all, thanks Allah for the power and the ability he gave us to make this project real. Then, would like to express our gratitude to all those who gave us the possibility to complete this project. Prof. Dr. Abdulwahab Al-sammak, for his advice, guidance, support and encouragement. Prof. Dr. Tarek Elshishtawy, for his advice, effort and his support. The Stanford NLP Group, for their valuable resources. The Linguistic Data Consortium (LDC), for their valuable resources.
Our parents, brothers and sisters who endured this time with us and were always a great source of encouragement.
Table of Contents:
1. Introduction
2. Data Mining 2.1 Introduction 2.2 The Scope of Data Mining 2.3 Background 2.4 KDD Process: 2.5 The Cross-Industry Standard Process for Data Mining 2.6 Simplified process in KDD
6 12
3. KeyPhrase Extraction 3.1 introduction 3.2 Supervised Machine Learning Techniques 3.2.1 C4.5 decision tree induction algorithm 3.2.2 GenEx (Genitor and extractor) 3.2.2.1 Extractor 3.2.2.2 Genitor 3.2.3 Sakhr 3.2.4 Kea 3.2.5 using Linguistic knowledge and Machine Learning Techniques 3.3 unsupervised Machine Learning Techniques 3.3.1 KP-Miner 3.3.1.1 System Overview 3.3.1.2 Candidate keyphrase selection 3.3.1.3Candidate keyphrases weight calculation 3.3.1.4Final Candidate Phrase List Refinement 3.3.1.5 Evaluation and Drawbacks 4. Proposed System 4.1 Introduction 4.2 Pre-Processing Phase
4
19
37
4.3 Segmentation 4.4 POS Tagging Phase 4.4.1 Training data supplied to POS Tagger 4.4.2 POS Tag Set 4.4.3 Lemmatization 4.5 Candidate key phrase 4.6 Feature Extraction phase 5. Results and Future Work 5.1 Results 5.2 Future work Appendix 54
59
Chapter 1 Introduction
Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, they are called keyphrases. There is a wide variety of tasks for which keyphrases are useful.
Background and Related Work:

The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands, as the amount of electronic textual content grows fast, keyphrases can contribute to manage the process of handling these large amounts of textual information. Keyphrases play an important role in digital libraries, web contents, and content management systems, especially in cataloging and information retrieval purposes. The limited number of documents that have author-assigned keyphrases as metadata description raises the need for a tool that can automatically extract keyphrases from text. Such a tool can enable many different types of information retrieval and analysis systems. It can provide the automation of: Generating metadata that gives a high-level description of a document's contents. This provides tools for text-mining related tasks such as document and Web page retrieval purposes. Summarizing documents for prospective readers. Keyphrases can represent a highly condensed summary of the document in question (Avanzo & Magnini, 2005). Highlighting important topics within the body of the text, to facilitate speed reading (skimming), which allows deciding whether it is relevant or not. Measuring the similarity between documents, making it possible to cluster and categorize documents (Karanikolas & Skourlas, 2006). Searching: more precise upon using them as the basis for search indexes or as a way of browsing a collection of documents.
Many remarkable efforts have been proposed and implemented for automatically extracting keyphrases for English documents and other languages. In contrast, little efforts are achieved for documents written in Arabic language. Although, some
researchers applied their keyphrase extraction systems to Arabic documents, but the proven efficiency of the extracted keyphrases was not satisfactory. Work on automatic keyphrase extraction started fairly recently. First attempts to approach this task were purely based on heuristics (Krulwich and Burkey, 1996). However keyphrases generated by this approach, failed to map well to author assigned keywords indicating that the applied heuristics were poor ones (Turney, 1999). Motivated by the spectrum of potential applications of accurate keyphrase extraction and the failings of the heuristic model, Peter Turney devised a powerful, machine learning based, Keyphrase extraction system called GenEx (Turney, 1999, Turney, 2000). In building this system, Turney was the first to approach the task of keyphrase extraction as a supervised learning problem. Turney uses the degree of statistical association determined through the use of web mining techniques in order to determine semantic relatedness. The major drawback of this work is that it takes up a lot of time in order to calculate the coherence feature (almost 15 mins per document) (Turney, 2003). In addition, there are a number of other systems were specifically for extracting keyphrases from web documents such as those presented in (Chen et al., 2005) and (Kelleher and Luz, 2005). Kea (Frank et al., 1999; Witten et al., 1999, 2000) is another remarkable effort in this area, identifies candidate keyphrases in the same manner as Extractor. Kea then uses the Nave Bayes algorithm to classify the candidate phrases as keyphrases or not. In Kea, candidate phrases are classified using only two features: (i) the TFxIDF, and (ii) the relative distance. The TFxIDF (term frequency times inverse document frequency) method which captures a word's frequency in a single document compared to its rarity in the whole document collection. It is used to assign a high value to a phrase that is relatively frequent in the input document (TF component), yet relatively rare in other documents (IDF component). The relative distance feature of a phrase in a given document is defined as the number of words that precede the first occurrence of the phrase divided by the number of words in the document. Kea uses the Nave Bayes algorithm to calculate the probability of membership in a class (the probability that the candidate phrase is a keyphrase). Kea ranks each of the candidate phrases by the estimated probability that they belong to the keyphrase class. If the user requests N phrases, then Kea gives the top N phrases with the highest estimated probability as output. KP-Miner (El-Beltagy & Rafea, 2008) is an unsupervised machine learning algorithm which uses the TFxIDF measures with two boosting factors. The first depends on phrase length, and the second depends on phrase position in the document. The KP-Miner system does not need to be trained on particular document set. It also has the advantage of being configurable, as the rules and heuristics adopted by the system are related to the general nature of documents and keyphrases. This implies that users can use their understanding of the input document to fine-tune the system to their particular needs.
8
Proposed System:
In this work, the automatic keyphrase extraction is treated as a supervised machine learning task. Two important issues are defined: how to define the candidate keyphrase terms, and what features of these terms are considered discriminative, i.e., how to represent the data, and consequently what is given as input to the learning algorithm. Our motivation is that adding linguistic knowledge (such as lexical features and syntactic rules) to the extraction process, rather than relying only on statistics, may obtain better results. Thus, the current work is based on combining the linguistic knowledge and the machine learning techniques to extract keyphrases from Arabic documents with reasonable accuracy. The Linguistic knowledge will play important roles in different stages of our proposed system: 1. Analysis stage, where the document is tokenized into sentences and words. Each word is analyzed to extract POS tag Lemma
2. Candidate keyphrase extraction stage, where set of syntactic rules are used to determine the allowed sequence of words of the generated n-gram terms according to their POS tags and Lemma. 3. Features Vector calculation stage, where some of the selected features of each candidate phrase are linguistic-based, in addition to the statistical-based features. The proposed system is based on three main steps: Document pre-processing, part of speech analysis, lemmatization, candidate phrases extraction, and feature vector calculation. The following sections describe these steps in details.
Document Preprocessing:
The input document is segmented at two levels. In the first level, the document is segmented into its constituent sentences based on the Arabic phrases delimiter characters such as comma, semicolon, colon, hyphen, and dot. This process is useful for calculating part of the features vector of the candidate terms. In the second level, each sentence is segmented into its constituent words based on the criteria that words are usually separated by spaces.
Part of Speech tagging:

part-of-speech tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context, and relationship with adjacent and related words in a phrase, sentence, or paragraph. This is the process of identification of words as nouns, verbs, adjectives, adverbs, etc.
Lemmatization:
The process of extracting the abstract form: it describes the basic form from which the given word is logically derived. Usually, this from differs from the word stem form which is obtained after removing the prefix and suffix parts of the word. For example, the stem of the word " "is "which represents a human being object. In contrast, the abstract form of the word is " ,"which represents the adjective of visual object. The abstract form of the given word is represented as follows: The single form for nouns. The single and male form for adjectives. The past form for verbs. The stem form for stop-words. The abstract form of the given word is extremely useful during the process of extracting candidate keyphrases.
Candidate Phrases Extraction:

We used following syntactic rules for extracting candidate phrases: 1- The candidate phrase can start only with some sort of nouns like general-noun, placenoun, proper-noun, and declined-noun. 2- The candidate phrase can end only with general-noun, place-noun, proper-noun, declined-noun, time-noun, augmented-noun, adjective, and adverb. 3- For three words phrase, the second word is allowed to be count-noun, conjunction, preposition, and comparison, in addition to those cited in rule 2. 4- We created two lists of Stop-words, stop-words that shouldnt appear in a Keyphrase and stop-words that can be the middle word of a three-word Keyphrase.
10
Feature Vector Calculation:

Each candidate phrase is assigned a number of features used to evaluate its importance. In our algorithm, three factors control the selection of features and their values. a) Normalized Phrase Words (NPW). b) The Phrase Relative Frequency (PRF). c) The Word Relative Frequency (WRF). d) Normalized Sentence Location (NSL). e) Normalized Phrase Location (NPL). f) Normalized Phrase Length (NPLen). g) Sentence Contain Verb (SCV). h) Is It Question (IIT). i) (Is-Key).
many authors; starting from Turney (1997, 1999, 2000), used the features (a), (b) and (c), the proposed algorithm uses different normalization technique to satisfy our hypothesis of feature importance. the original form of candidate abstract keyphrase form is retained for presentation to the user in case the phrase does turn out to be a keyphrase. This process is a straightforward operation. The proposed algorithm is computed for all candidates instead of unique stemmed keyphrases (KEA and Turney), which eliminates the need for selecting the most frequent keyphrase, when several different forms occur.
Results:
The program was tested on many documents in various fields and many authors. In order to evaluate the performance of the proposed system, many experiments were carried out to test the proposed system. A total of 25 documents were used. The first experiment aimed to measure the level of acceptance of the extracted keyphrases. Since there is no author-assigned keyphrases for these documents, a human judge was adopted to evaluate this level; we have compared the results with KP-miner, but we couldnt compare it with Sakhr because the output was inapplicable to be compared.
11
Chapter 2 Data Mining
12
2.1 Introduction
Data mining ( knowledge discovery) , is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were exhaustively time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining is a step in KDD process aimed at discovering patterns and relationships in preprocessed and transformed data.
2.2 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business information in a large database for example, finding linked products in gigabytes of store scanner data and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors
13
2.3 Background
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever larger data sets
2.4 KDD Process:

The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation
Lets examine the knowledge discovery process in the diagram above:
14
Data comes from variety of sources is integrated into a single data store called target data Data then is pre-processed and transformed into standard format. The data mining algorithms process the data to the output in form of patterns or rules. Then those patterns and rules are interpreted to new or useful knowledge or information. A wide range of organizations in various industries are making use of data mining including manufacturing, marketing, chemical, aerospace, etc., to take advantages over their competitors. The needs for a standard data mining therefore increased dramatically. The data mining process must be reliable and repeatable by business people with little knowledge or no data mining background. In 1990, a cross-industry standard process for data mining (CRISP-DM) first published after going through a lot of workshops, and contributions from over 300 organizations. Lets examine the cross-industry standard process for data mining in greater details.
2.5 The Cross-Industry Standard Process for Data Mining (CRISPDM)

Cross-Industry Standard Process for Data Mining (CRISP-DM) consists of six phases intended as a cyclical process as the following figure:
Cross-Industry Standard Process for Data Mining (CRISP-DM)
15
Business understanding - In the business understanding phase, first it is a must to understand business objectives clearly and make sure to find out what the client really want to achieve. Next, we have to assess the current situation by finding about the resources, assumptions, constraints and other important factors which should be considered. Then from the business objectives and current situations, we need to create data mining goals to achieve the business objective and within the current situation. Finally a good data mining plan has to be established to achieve both business and data mining goals. The plan should be as details as possible that have step-by-step to perform during the project including the initial selection of data mining techniques and tools. Data understanding - First, the data understanding phase starts with initial data collection that collects data from available sources to get familiar with data. Some important activities must be carried including data load and data integration in order to make the data collection successfully. Next, the gross or surface properties of acquired data need to be examined carefully and reported. Then, the data need to be explored by tackling the data mining questions, which can be addressed using querying, reporting and visualization. Finally, the data quality must be examined by answering some important questions such as Is the acquired data complete?, Is there any missing values in the acquired data? Data preparation - The data preparation normally consumes about 90% of the time. The outcome of the data preparation phase is the final data set. Once data sources available are identified, they need to be selected, cleaned, constructed and formatted into the desired form. The data exploration task at a greater depth may be carried during this phase to notice the patterns based on business understanding. Modeling - First, modeling techniques have to be selected to be used for the prepared dataset. Next, the test scenario must be generated to validate the models quality and validity. Then, one or more models are created by running the modeling tool on the prepared dataset. Last but not least, models need to be assessed carefully involving stakeholders to make sure that created models are meet business initiatives.
Evaluation - In the evaluation phase, the model results must be evaluated in the context of business objectives in the first phase. In this phase, new business requirements may be raised due to new patterns has been discovered in the model results or from other factors. Gaining business understanding is an iterative process in data mining. The go or no-go decision must be made in this step to move to the deployment phase.
16
Deployment - The knowledge or information that gain through data mining process needs to be presented in such a way that stakeholders can use it when they want it. Based on the business requirements, the deployment phase could be as simple as creating a report or as complex as a repeatable data mining process across the organization. In this phase, the deployment, maintained and monitoring plans have to be created for deployment and future supports. From project point of view, the final report of the project need to summary the project experiences and review the project to see what need to improved created learned lessons. The CRISP-DM offers a uniform framework for experience documentation and guidelines. In addition the CRISP-DM can apply in different industry with different type of data.
2.6 Simplified process in KDD:

This section will contain simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.
2.6.1 Pre-processing
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate datasets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.
2.6.2 Data Mining

Data mining involves six common classes of tasks:[1] Anomaly detection (Outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors and require further investigation. Association rule learning (Dependency modeling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
17
Classification is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation.
2.6.3 Results Validation

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set. This is called over fitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of emails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. A number of statistical methods may be used to evaluate the algorithm, such as ROC curves. If the learned patterns do not meet the desired standards, then it is necessary to reevaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
18
Chapter 3 Keyphrase Extraction
19
3.1 introduction
Several keyphrase extraction techniques have been proposed and implemented successfully in different context. Attempts on keyphrase extraction can be classified into two main streams, which are supervised machine learning (Most of the prior work in document keyphrases extraction problem is based on machine learning techniques.)Algorithms and unsupervised machine learning algorithms.
3.2 Supervised Machine Learning Techniques

We will Start Speaking about those two main streams, and we will start by speaking about Techniques that are based on supervised machine learning which are: Turney (1997, 1999, 2000) Sakhr
Turney was the first one to approach the problem of Keyphrase Extraction as a supervised Learning and presented two different two different machine learning algorithms for extracting keyphrases from a document, The first algorithm is based on the C4.5 decision tree classifier (Quinlan, 1993), and the second is the GenEx (Genitor and Extractor) algorithm (Turney, 1997, 1999, 2000).
3.2.1 C4.5 decision tree induction algorithm

C4.5 decision tree induction algorithm was used in order to classify phrases as positive or negative examples of keyphrases, in this section; we describe the feature vectors, the settings we used for C4.5s parameters, the bagging procedure, and the method for sampling the training data. The task of supervised learning is to learn how to assign cases (or examples) to classes. For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a positive or negative example of a keyphrase. We classify a case by examining its features. A feature can be any property of a case that is relevant for determining the class of the case. C4.5 can handle real-valued features, integer-valued features, and features with values that range over an arbitrary, fixed set of symbols. C4.5 takes as input a set of training data, in which cases are represented as feature vectors. In the training data, a teacher must assign a class to each feature vector (hence supervised learning). C4.5 generates as output a decision tree that models the relationships among the features and the classes (Quinlan, 1993). A decision tree is a rooted tree in which the internal vertices are labeled with tests on feature values and the leaf vertices are labeled with classes. The edges that leave an internal vertex are labeled with the possible outcomes of the test associated with that vertex. For example, a feature might be, the number of words in the given phrase, and a
20
test on a feature value might be, the number of words in the given phrase is less than two, which can have the outcomes true or false. A case is classified by beginning at the root of the tree and following a path to a leaf in the tree, based on the values of the features of the case. The label on the leaf is the predicted class for the given case. the documents have been converted into sets of feature vectors by first making a list of all phrases of one, two, or three consecutive non-stop words that appear in a given document, with no intervening punctuation. Iterated Lovins stemmer has been used to find the stemmed form of each of these phrases. For each unique stemmed phrase, we generated a feature vector, as described in Table 3.1 A description of the feature vectors used by C4.5.
Table 3.1
21
C4.5 has access to nine features (features 3 to 11) when building a decision tree. The leaves of the tree predict class (feature 12). When a decision tree predicts that the class of a vector is 1, then the phrase whole_phrase is a keyphrase, according to the tree. This phrase is suitable for output for a human reader. We used the stemmed form of the phrase, stemmed_phrase, for evaluating the performance of the tree. Table 3.2 shows the number of feature vectors that were generated for each corpus. The large majority of these vectors were negative examples of keyphrases (class 0).In a realworld application, the user would want to specify the desired number of output keyphrases for a given document. However, a standard decision tree does not let the user control the number of feature vectors that are classified as belonging in class 1. Therefore we ran C4.5 with the -p option, which generates soft-threshold decision trees (Carter and Catlett, 1987; Quinlan, 1987, 1990, 1993). Soft-threshold decision trees can generate a probability estimate for the class of each vector. For a given document, if the user specifies that K keyphrases are desired, then we select the K vectors that have the highest estimated probability of being in class 1.
Table 3.2 In addition to the -p option, we also used -c100 and -m1 (Quinlan, 1993). These two options maximize the bushiness of the trees. In our preliminary experiments, we found that these parameter settings appear to work well when used in conjunction with bagging. Bagging involves generating many different decision trees and allowing them to vote on the classification of each example (Breiman, 1996a, 1996b; Quinlan, 1996). In general, decision tree induction algorithms have low bias but high variance. Bagging multiple trees tends to improve performance by reducing variance. Bagging appears to have relatively little impact on bias. Because we used soft-threshold decision trees, we combined their probability estimates by averaging them, instead of voting. In preliminary experiments with the training documents, we obtained good results by bagging 50 decision trees. Adding more trees had no significant effect. The standard approach to bagging is to randomly sample the
22
training data, using sampling with replacement (Breiman, 1996a, 1996b; Quinlan, 1996). In preliminary experiments with the training data, we achieved good performance by training each of the 50 decision trees with a random sample of 1% of the training data. The standard approach to bagging is to ignore the class when sampling, so the distribution of classes in the sample tends to correspond to the distribution in the training data as a whole. In Table 3.2, we see that the positive examples constitute only 0.2% to 2.4% of the total number of examples. To compensate for this, we modified the random sampling procedure so that 50% of the sampled examples were in class 0 and the other 50% were in class 1. This appeared to improve performance in preliminary experiments on the training data. This strategy is called stratified sampling (Deming, 1978; Buntine, 1989; Catlett, 1991; Kubat et al.,1998). Kubat et al. (1998) found that stratified sampling significantly improved the performance of C4.5 on highly skewed data, but Catlett (1991) reported mixed results. Boosting is another popular technique for combining multiple decision trees (Freund and Schapire, 1996; Quinlan, 1996; Maclin and Opitz, 1997). We chose to use bagging instead of boosting, because the modifications to bagging that we use here (averaging softthreshold decision trees and stratified sampling) are simpler to apply to the bagging algorithm than to the more complicated boosting algorithm. We believe that analogous modifications would be required for boosting to perform well on this task.
3.2.2 GenEx (Genitor and extractor)

GenEx (Genitor and extractor) has two components, the Genitor genetic algorithm (Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999). Extractor takes a document as input and produces a list of keyphrases as output. Extractor has twelve parameters that determine how it processes the input text. In GenEx, the parameters of Extractor are tuned by the Genitor genetic algorithm (Whitley, 1989), to maximize performance (fitness) on training data. Genitor is used to tune Extractor, but Genitor is no longer needed once the training process is complete. When we know the best parameter values, we can discard Genitor. Thus the learning system is called GenEx (Genitor plus Extractor) and the trained system is called Extractor (GenEx minus Genitor). The GenEx algorithm originally was used to reduce the amount of negative training examples.
23
3.2.2.1 Extractor
What follows is a conceptual description of the Extractor algorithm. For clarity, we describe Extractor at an abstract level that ignores efficiency considerations. That is, the actual Extractor software is essentially an efficient implementation of the following algorithm.12 In the following, the twelve parameters appear in small capitals (see Table 3.3 for a list of the Parameters). There are ten steps to the Extractor algorithm: 1. Find Single Stems: Make a list of all of the words in the input text. Drop words with less than three characters. Drop stop words, using a given stop word list. Convert all remaining words to lower case. Stem the words by truncating them at STEM_LENGTH characters. The advantages of this simple form of stemming (stemming by truncation) are speed and flexibility. Stemming by truncation is much faster than either the Lovins (1968) or Porter (1980) stemming algorithms. The aggressiveness of the stemming can be adjusted by changing STEM_LENGTH. This gives Genitor control over the level of aggressiveness. 2. Score Single Stems: For each unique stem, count how often the stem appears in the text and note when it first appears. If the stem evolut first appears in the word Evolution, and Evolution first appears as the tenth word in the text, then the first appearance of evolut is said to be in position 10. Assign a score to each stem. The score is the number of times the stem appears in the text, multiplied by a factor. If the stem first appears before FIRST_LOW_THRESH, then multiply the frequency by FIRST_LOW_FACTOR. If the stem first appears after FIRST_HIGH_THRESH, then multiply the frequency by FIRST_HIGH_FACTOR. Typically FIRST_LOW_FACTOR is greater than one and FIRST_HIGH_FACTOR is less than one. Thus, early, frequent stems receive a high score and late, rare stems receive a low score. This gives Genitor control over the weight of early occurrence versus the weight of frequency. 3. Select Top Single Stems: Rank the stems in order of decreasing score and make a list of the top NUM_WORKING single stems. Cutting the list at NUM_WORKING, as opposed to allowing the list to have an arbitrary length, improves the efficiency of Extractor. It also acts as a filter for eliminating lower quality stems. 4. Find Stem Phrases: Make a list of all phrases in the input text. A phrase is defined as a sequence of one, two, or three words that appear consecutively in the text, with no intervening stop words or punctuation. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters. In our corpora, phrases of four or more words are relatively rare. Therefore Extractor only considers phrases of one, two, or three words.
24
5. Score Stem Phrases: For each stem phrase, count how often the stem phrase appears in the text and note when it first appears. Assign a score to each phrase, exactly as in step 2, using the parameters FIRST_LOW_FACTOR, FIRST_LOW_THRESH, FIRST_HIGH_FACTOR, and FIRST_HIGH_THRESH. Then make an adjustment to each score, based on the number of stems in the phrase. If there is only one stem in the phrase, do nothing. If there are two stems in the phrase, multiply the score by FACTOR_TWO_ONE. If there are three stems in the phrase, multiply the score by FACTOR_THREE_ONE. Typically FACTOR_TWO_ONE and FACTOR_THREE_ONE are greater than one, so this adjustment will increase the score of longer phrases. A phrase of two or three stems is necessarily never more frequent than the most frequent single stem contained in the phrase. The factors FACTOR_TWO_ONE and FACTOR_THREE_ONE are designed to boost the score of longer phrases, to compensate for the fact that longer phrases are expected to otherwise have lower scores than shorter phrases. 6. Expand Single Stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase of one, two, or three stems that contains the given single stem. The result is a list of NUM_WORKING stem phrases. Keep this list ordered by the scores calculated in step 2. Now that the single stems have been expanded to stem phrases, we no longer need the scores that were calculated in step 5. That is, the score for a stem phrase (step 5) is now replaced by the score for its corresponding single stem (step 2). The reason is that the adjustments to the score that were introduced in step 5 are useful for expanding the single stems to stem phrases, but they are not useful for comparing or ranking stem phrases. 7. Drop Duplicates: The list of the top NUM_WORKING stem phrases may contain duplicates. For example, two single stems may expand to the same two-word stem phrase. Delete duplicates from the ranked list of NUM_WORKING stem phrases, preserving the highest ranked phrase. For example, suppose that the stem evolu (e.g., evolution truncated at five characters) appears in the fifth position in the list of the top NUM_WORKING single stems and psych (e.g., psychology truncated at five characters) appears in the tenth position. When the single stems are expanded to stem phrases, we might find that evolu psych (e.g., evolutionary psychology truncated at five characters) appears in the fifth and tenth positions in the list of stem phrases. In this case, we delete the phrase in the tenth position. If there are duplicates, then the list now has fewer than NUM_WORKING stem phrases. 8. Add Suffixes: For each of the remaining stem phrases, find the most frequent corresponding whole phrase in the input text. For example, if evolutionary psychology appears ten times in the text and evolutionary psychologist appears three times, then
25
evolutionary psychology is the more frequent corresponding whole phrase for the stem phrase evolu psych. When counting the frequency of whole phrases, if a phrase has an ending that indicates a possible adjective, then the frequency for that whole phrase is set to zero. An ending such as al, ic, ible, etc., indicates a possible adjective. Adjectives in the middle of a phrase (for example, the second word in a three-word phrase) are acceptable; only phrases that end in adjectives are penalized. Also, if a phrase contains a verb, the frequency for that phrase is set to zero. To check for verbs, we use a list of common verbs. A word that might be either a noun or a verb is included in this list only when it is much more common for the word to appear as a verb than as a noun. For example, suppose the input text contains manage, managerial, and management. If STEM_LENGTH is, say, five, the stem manag will be expanded to management (a noun), because the frequency of managerial will be set to zero (because it is an adjective, ending in al) and the frequency of manage will be set to zero (because it is a verb, appearing in the list of common verbs). Although manage and managerial would not be output, their presence in the input text helps to boost the score of the stem manag (as measured in step 2), and thereby increase the likelihood that management will be output. 9. Add Capitals: For each of the whole phrases (phrases with suffixes added), find the best capitalization, where best is defined as follows. For each word in a phrase, find the capitalization with the least number of capitals. For a one-word phrase, this is the best capitalization. For a two-word or three-word phrase, this is the best capitalization, unless the capitalization is inconsistent. The capitalization is said to be inconsistent when one of the words has the capitalization pattern of a proper noun but another of the words does not appear to be a proper noun (e.g., Turing test). When the capitalization is inconsistent, see whether it can be made consistent by using the capitalization with the second lowest number of capitals (e.g., Turing Test). If it cannot be made consistent, use the inconsistent capitalization. If it can be made consistent, use the consistent capitalization. For example, given the phrase psychological association, the word association might appear in the text only as Association, whereas the word psychological might appear in the text as PSYCHOLOGICAL, Psychological, and psychological. Using the least number of capitals, we get psychological Association, which is inconsistent. However, it can be made consistent, as Psychological Association. 10. Final Output: We now have an ordered list of mixed-case (upper and lower case, if appropriate) phrases with suffixes added. The list is ordered by the scores calculated in step 2. That is, the score of each whole phrase is based on the score of the highest scoring single stem that appears in the phrase. The length of the list is at most NUM_WORKING,
26
and is likely less, due to step 7. We now form the final output list, which will have at most NUM_PHRASES phrases. We go through the list of phrases in order, starting with the topranked phrase, and output each phrase that passes the following tests, until either NUM_PHRASES phrases have been output or we reach the end of the list. The tests are (1) the phrase should not have the capitalization of a proper noun, unless the flag SUPPRESS_PROPER is set to 0 (if 0 then allow proper nouns; if 1 then suppress proper nouns); (2) the phrase should not have an ending that indicates a possible adjective; (3) the phrase should be longer than MIN_LENGTH_LOW_RANK, where the length is measured by the ratio of the number of characters in the candidate phrase to the number of characters in the average phrase, where the average is calculated for all phrases in the input text that consist of one to three consecutive non-stop words; (4) if the phrase is shorter than MIN_LENGTH_LOW_RANK, it may still be acceptable, if its rank in the list of candidate phrases is better than (closer to the top of the list than) MIN_RANK_LOW_LENGTH; (5) if the phrase fails both tests (3) and (4), it may still be acceptable, if its capitalization pattern indicates that it is probably an abbreviation; (6) the phrase should not contain any words that are most commonly used as verbs; (7) the phrase should not match any phrases in a given list of stop phrases (where match means equal strings, ignoring case, but including suffixes). That is, a phrase must pass tests (1), (2), (6), (7), and at least one of tests (3), (4), and (5). Although our experimental procedure does not consider capitalization or suffixes when comparing machine-generated keyphrases to human-generated keyphrases, steps 8 and 9 are still useful, because some of the screening tests in step 10 are based on capitalization and suffixes. Of course, steps 8 and 9 are essential when the output is for human readers.
3.2.2.2 Genitor
A genetic algorithm may be viewed as a method for optimizing a string of bits, using techniques that are inspired by biological evolution. A genetic algorithm works with a set of bit strings, called a population of individuals. The initial population is usually randomly generated. New individuals (new bit strings) are created by randomly changing existing individuals (this operation is called mutation) and by combining substrings from parents to make new children (this operation is called crossover). Each individual is assigned a score (called its fitness) based on some measure of the quality of the bit string, with respect to a given task. Fitter individuals get to have more children than less fit individuals. As the genetic algorithm runs, new individuals tend to be increasingly fit, up to some asymptote. Genitor is a steady-state genetic algorithm (Whitley, 1989), in contrast to many other genetic algorithms, such as Genesis (Grefenstette 1983, 1986), which are generational. A generational genetic algorithm updates its entire population in one batch, resulting in a sequence of distinct generations. A steady-state genetic algorithm updates its population
27
one individual at a time, resulting in a continuously changing population, with no distinct generations. Typically a new individual replaces the least fit individual in the current population. Whitley (1989) suggests that steady-state genetic algorithms tend to be more aggressive (they have greater selective pressure) than generational genetic algorithms.
3.2.2.3 GenEx
The parameters in Extractor are set using the standard machine learning paradigm of supervised learning. The algorithm is tuned with a dataset, consisting of documents paired with target lists of keyphrases. The dataset is divided into training and testing subsets. The learning process involves adjusting the parameters to maximize the match between the output of Extractor and the target keyphrase lists, using the training data. The success of the learning process is measured by examining the match using the testing data. We assume that the user sets the value of NUM_PHRASES, the desired number of phrases, to a value between five and fifteen. We then set NUM_WORKING to . The remaining ten parameters are set by Genitor. Genitor uses a binary string of 72 bits to represent the ten parameters, as shown in Table 3.3. We run Genitor with a population size of 50 for 1050 trials (these are default settings). Each trial consists of running Extractor with the parameter settings specified in the given binary string, processing the entire training set. The fitness measure for the binary string is based on the average precision for the whole training set. The final output of Genitor is the highest scoring binary string. Ties are broken by choosing the earlier string. We first tried to use the average precision on the training set as the fitness measure, but GenEx discovered that it could achieve high average precision by adjusting the parameters so that less than NUM_PHRASES phrases were output. This is not desirable, so we modified the fitness measure to penalize GenEx when less than NUM_PHRASES phrases were output: total_matches = total number of matches between GenEx and human (1) total_machine_phrases = total number of phrases output by GenEx (2) precision = total_matches total_machine_phrases (3) num_docs = number of documents in training set (4) total_desired = num_docs NUM_PHRASES (5) penalty = ( total_machine_phrases x total_desired ) 2 (6) fitness = precision penalty (7)
28
The penalty factor varies between 0 and 1. It has no effect (i.e., it is 1) when the number of phrases output by GenEx equals the desired number of phrases. The penalty grows (i.e., it approaches 0) with the square of the gap between the desired number of phrases and the actual number of phrases. Preliminary experiments on the training data confirmed that this fitness measure led GenEx to find parameter values with high average precision while ensuring that NUM_PHRASES phrases were output. The twelve parameters of Extractor, with types and ranges.
Table 3.3
Since STEM_LENGTH is modified by Genitor during the GenEx learning process, the fitness measure used by Genitor is not based on stemming by truncation. If the fitness measure were based on stemming by truncation, a change in STEM_LENGTH would change the apparent fitness, even if the actual output keyphrase list remained constant. Therefore fitness is measured with the Iterated Lovins stemmer. We ran Genitor with a Selection Bias of 2.0 and a Mutation Rate of 0.2. These are the default settings for Genitor. We used the Adaptive Mutation operator and the Reduced Surrogate Crossover operator (Whitley, 1989). Adaptive Mutation determines the appropriate level of mutation for a child according to the hamming distance between its two parents; the less the difference, the higher the mutation rate. Reduced Surrogate Crossover first identifies all positions in which the parent strings differ. Crossover points are only allowed to occur in these positions.
29
3.2.3 Sakhr
Sakhr is another remarkable effort in Keyphrase extraction filed , but its a closed source so no official information about how it works and which algorithems used in extracting keyphrases , but from our usage of the their online application ( which they dont put any sign for it in their website ) we found that it take a very long time in processing which show us that they almost depend on a huge database on extracting keyphrases from the input documents .
3.2.4 Kea
Kea (Frank et al., 1999; Witten et al., 1999, 2000) is another remarkable effort in this area, identifies candidate keyphrases in the same manner as Extractor. Kea then uses the Nave Bayes algorithm to classify the candidate phrases as keyphrases or not. In Kea, candidate phrases are classified using only two features: (i) the TFxIDF, and (ii) the relative distance. The TFxIDF (term frequency times inverse document frequency) method which captures a word's frequency in a single document compared to its rarity in the whole document collection. It is used to assign a high value to a phrase that is relatively frequent in the input document (TF component), yet relatively rare in other documents (IDF component). The relative distance feature of a phrase in a given document is defined as the number of words that precede the first occurrence of the phrase divided by the number of words in the document. Kea uses the Nave Bayes algorithm to calculate the probability of membership in a class (the probability that the candidate phrase is a keyphrase). Kea ranks each of the candidate phrases by the estimated probability that they belong to the keyphrase class. If the user requests N phrases, then Kea gives the top N phrases with the highest estimated probability as output.
3.2.5 Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques
This system was done by (El-shishtawy T.A. & Al-sammak A.K.) A supervised learning technique for extracting keyphrases of Arabic documents is presented. The extractor is supplied with linguistic knowledge to enhance its efficiency instead of relying only on statistical information such as term frequency and distance. During analysis, an annotated Arabic corpus is used to extract the required lexical features of the document words. The knowledge also includes syntactic rules based on part of speech tags and allowed word sequences to extract the candidate keyphrases. In this work, the abstract form of Arabic words is used instead of its stem form to represent the candidate terms. The Abstract form hides most of the inflections found in Arabic words. The paper introduces new features of keyphrases based on linguistic knowledge, to capture titles and subtitles of a document. A simple ANOVA test is used to evaluate the validity of selected features.
30
Then, the learning model is built using the LDA - Linear Discriminant Analysis and training documents. The automatic keyphrase extraction is treated as a supervised machine learning task. Two important issues are defined: how to define the candidate keyphrase terms, and what features of these terms are considered discriminative, i.e., how to represent the data, and consequently what is given as input to the learning algorithm. Our motivation is that adding linguistic knowledge (such as lexical features and syntactic rules) to the extraction process, rather than relying only on statistics, may obtain better results. Thus, the current work is based on combining the linguistic knowledge and the machine learning techniques to extract keyphrases from Arabic documents with reasonable accuracy. The Linguistic knowledge will play important roles in different stages of our proposed system: 1. Analysis stage, where the document is tokenized into sentences and words. Each word is analyzed using an annotated Arabic corpus to extract its POS tags, category, and abstract form. 2. Candidate keyphrase extraction stage, where set of syntactic rules are used to determine the allowed sequence of words of the generated n-gram terms according to their POS tags and categories. 3. Features Vector calculation stage, where some of the selected features of each candidate phrase are linguistic-based, in addition to the statistical-based features. Although the system was trained with a train set from the IT-Field , it give a good results in other domains like (politics) , and although it give an acceptable outputs , but it have a drawback that it depend on a corpus which make it inapplicable if we decided to convert it into a web application because the size of the corpus is very large compared to other systems like Kp-miner which will be mentioned next .
3.3 unsupervised Machine Learning Techniques

In this part of the chapter we will give a hint about the systems that uses unspurvised Machine learning techinques in order to extract the candidate keyphrase of the input documents , the only system uses unsupervised Machine Learning Techniques to extract the keyphrases is Kp Miner (El-Beltagy & Rafea, 2008) .
31
3.3.1 KP-Miner
KP-Miner (El-Beltagy, 2006) (El-Beltagy, 2009) is a system for the extraction of keyphrases from English and Arabic documents. The keyphrase extraction process in KP-Miner is an un-supervised one.
3.3.1.1 System Overview

KP-miner is an unsupervised machine learning Algorithem used to get the keyphrase the system has the advantage of being configurable as the rules and heuristics adopted by the system are related to the general nature of documents and keyphrase. This implies that the users of this system can use their understanding of the document(s) being input into the system, to fine tune it to their particular needs. The work on KP-Miner was inspired by the nature of documents and keyphrases specially the next there points : The number of keyphrase in any given document will usually be less than that of single keywords. Effective keyphrase extraction is then dependant on the determination of an appropriate boosting factor for keyphrases. In this work, this boosting factor is related to the ratio of single to compound terms in each input document. Without the use of linguistic features, the extraction of meaningful keyphrases is dependant on the repetition of these within the document. Using IDF information in phrase weight calculation would bias the extraction towards unseen phrases. This would be unfair when building a general rather than a domain specific extractor as possible phrase combinations are much larger than what can be captured from a limited IDF training corpus. The position of the first occurrence of any given phrase is significant in two ways. The first is related to the fact that the more important a term is, the more likely it is to appear sooner in the document. The second is based on the observation that after a given threshold is passed in any given document, phrases occurring for the first time are highly unlikely to be keyphrases.
Keyphrase extraction in the KP-Miner system is a three step process: candidate Keyphrase selection, candidate Keyphrase weight calculation and finally keyphrase refinement. Each of these steps is explained in the following sub-sections.
32
3.3.1.2 Candidate keyphrase selection

In KP-Miner, a set of rules is employed in order to elicit candidate keyphrases. As a phrase will never be separated by punctuation marks within some given text and will rarely have stop words within it, the first condition a sequence of words has to display in order to be considered a candidate keyphrase, is that it is not be separated by punctuation marks or stop words. A total of 187 common stopwords (the, then, in, above, etc) are used in the candidate keyphrase extraction step. After applying this first condition on any given document, too many candidates will be generated; some of which will make no sense to a human reader. To filter these out, two further conditions are applied. The first condition states that a phrase has to have appeared at least n times in the document from which keyphrases are to be extracted, in order to be considered a candidate keyphrase. This is called the least allowable seen frequency (lasf) factor and in the English version of the system, this is set to 3. However, if a document is short, n is decremented depending on the length of the document. The second condition is related to the position where a candidate keyphrase first appears within an input document. Through observation as well as experimentation, it was found that in long documents, phrases occurring for the first time after a given threshold are very rarely keyphrases. So a cutoff constant CutOff is defined in terms of a number of words after which if a phrase appears for the first time, it is filtered out and ignored. The initial prototype of the KPMiner system (El-Beltagy, 2006), set this cutoff value to a constant (850). Further experimentation carried out in (ElBeltagy, 2009) revealed that an optimum value for this constant is 400. In 190 the implementation of the KP-Miner system, the phrase extraction step described above is carried out in two phases. In the first phase, words are scanned until either a punctuation mark or a stop word is encountered. The scanned sequence of words and all possible ngrams within the encountered sequence where n can vary from 1 to sequence length-1, are stemmed and stored in both their original and stemmed forms. If the phrase (in its stemmed or original form) or any of its sub-phrases, has been seen before, then the count of the previously seen term is incremented by one, otherwise the previously unseen term is assigned a count of one. Very weak stemming is performed in this step using only the first step of the Porter stemmer (Porter, 1980). In the second phase, the document is scanned again for the longest possible sequence that fulfills the conditions mentioned above. This is then considered as a candidate keyphrase. Unlike most of the other keyphrase extraction systems, the devised algorithm places no limit on the length of keyphrases, but it was found that extracted keyphrases rarely exceed three terms.
33
3.3.1.3Candidate keyphrases weight calculation

Single key features obtained from documents by models such as TF-IDF (Salton and Buckley, 1988) have already been shown to be representative of documents from which theyve been extracted as demonstrated by their wide and successful use in clustering and classification tasks. However, when applied to the task of keyphrase extraction, these same models performed very poorly (Turney, 1999). By looking at almost any document, it can be observed that the occurrence of phrases is much less frequent than the occurrence of single terms within the same document. So it can be concluded that one of the reasons that TF-IDF performs poorly on its own when applied to the task of keyphrase extraction, is that it does not take this fact into consideration which results in a bias towards single words as they occur in larger numbers. So, a boosting factor is needed for compound terms in order to balance this bias towards single terms. In this work for each input document d from which keyphrases are to be extracted, a boosting factor Bd is calculated as follows: Bd= |Nd| /(|Pd| *) and if Bd > s then Bd = s Here |Nd| is the number of all candidate terms in document d, |Pd| is the number of candidate terms whose length exceeds one in document d and and s are weight adjustment constants. The values used by the implemented system are 3 for s and 2.3 for . To calculate the weights of document terms, the TF-IDF model in conjunction with the introduced boosting factor, is used. However, another thing to consider when applying TF-IDF for a general application rather than a corpus specific one, is that keyphrase combinations do not occur as frequently within a document set as do single terms. In other words, while it is possible to collect frequency information for use by a general single keyword extractor from a moderately large set of random documents, the same is not true for keyphrase information. There are two possible approaches to address this observation. In the first, a very large corpus of a varied nature can be used to collect keyphrase related frequency information. In the second, which is adopted in this work, any encountered phrase is considered to have appeared only once in the corpus. This mean that for compound phrases, frequency within a document as well as the boosting factor are really what determine its weight as the idf value for all compound phrases will be a constant c determined by the size of the corpus used to build frequency information for single terms. If the position rules described in (El-Beltagy, 2009) are also employed, then the position factor is also used in the calculation for the term weights. In summary, the following
34
equation is used to calculate the weight of candidate keyphrases whether single or compound: wi j = tfi j* idf * Bi* Pf Where: wij = weight of term tj in Document Di tfi j = frequency of term tj in Document Di idf = log2 N/n where N is the number of documents in the collection and n is number of documents where term tj occurs at least once. If the term is compound, n is set to 1. Bi = the boosting factor associated with document Di Pf= the term position associated factor. If position rules are not used this is set to 1.
3.3.1.4 Final Candidate Phrase List Refinement

The KP-Miner system, allows the user to specify a number n of keyphrases s/he wants back and uses the sorted list to return the top n keyphrases requested by the user. The default number of n is five. As stated in step one, when generating candidate keyphrases, the longest possible sequence of words that are uninterrupted by possible phrase terminators, are sought and stored and so are sub-phrases contained within that sequence provided that they appear somewhere in the text on their own. For example, if the phrase excess body weight is encountered five times in a document, the phrase itself will be stored along with a count of five. If the sub-phrase , body weight, is also encountered on its own, than it will also be stored along with the number of times it appeared in the text including the number of times it appeared as part of the phrase excess body weight. This means that an overlap between the count of two or more phrases can exist. Aiming to eliminate this overlap in counting early on can contribute to the dominance of possibly noisy phrases or to overlooking potential keyphrases that are encountered as sub-phrases. However, once the weight calculation step has been performed and a clear picture of which phrases are most likely to be key ones is obtained, this overlap can be addressed through refinement. To refine results in the KP-Miner system, the top n keys are scanned to see if any of them is a sub-phrase of another. If any of them are, then its count is decremented by the frequency of the term of which it is a part. After this step is completed, weights are recalculated and a final list of phrases sorted by weight, is produced. The reason the top n keys rather than all candidates, are used in this step is so that lower weighted keywords do not affect the outcome of the final keyphrase list. It is important to note that the refinement step is an optional one, but
35
experiments have shown that in the English version of the system, omitting this step leads to the production of keyphrase lists that match better with author assigned keyword. In (El-Beltagy, 2009) the authors suggested that employing this step leads to the extraction of higher quality keyphrases. 3.3.1.5 Evaluation and Drawbacks Despite the fact that the KP-Miner was designed as a general purpose keyphrase extraction system, and despite the simplicity of the system and the fact that it requires no training to function, it seems to have performed relatively well when carrying out the task of keyphrase extraction from scientific documents. The keyphrases generated from KpMiner are not accurate for all cases, wrong output like verbs, stop-words, and depends on statistical features only.
36
Chapter 4 Proposed System
37
4.1 Introduction
As we talked in previous chapter about the importance of Data Mining in general and Key Phrase extraction in specific, Now we will move on to present the component of our the system and how it works and what we used as is and what we added and how we integrate all these components together to make our Arabic Key phrase Extraction System. In this work, the automatic key phrase extraction makes a processing on the original text and adopts it to deal with our modules and it is treated as a supervised machine learning task. Two important issues are defined: how to define the candidate key phrase terms, and what features of these terms are considered discriminative, i.e., how to represent the data, and consequently what is given as input to the learning algorithm. Our motivation is that adding linguistic knowledge (such as lexical features and syntactic rules) to the extraction process, rather than relying only on statistics, may obtain better results. Thus, the current work is based on combining the linguistic knowledge and the machine learning techniques to extract key phrase s from Arabic documents with reasonable accuracy and its a domain in specific as it can run in any environment without causing any problems. The Linguistic knowledge will play important roles in different stages of our proposed system: 1. Analysis and pre-processing stage, where the document is corrected and removing any non-Arabic characters and removing diactries and the text tokenized into words and sentences , and there is a sub stage called a Segmenter, where document get an appropriate preprocessing to be adopted with other stages like POS tagging to get more accurate results . 2. POS Tagging stage, where every word in text is assigned with its proper position in text (Noun, Verb, Adj, etc.) using Stanford POS tagger to be used in further processes. 3. Lemmatization stage, where every word gets its abstract form without all additional prefix or post-fixes this is done using AraMoroh module which we will talk in detail. 4. Candidate key phrase extraction stage, where set of syntactic rules are used to determine the allowed sequence of words of the generated n-gram terms according to their POS tags and categories. 5. Feature extraction stage, during this phase we calculate complex formulas and statistics for words, sentences and whole document to determine the weight of every candidate key phrase.
38
6. Machine Learning stage, where the train process happen to assign weights to all features calculated in last step and calculate a formula to determine is Key phrase or not. In the next section of this chapter of this book we will talk in details about all of these 6 steps to show how we implemented or used them.
4.2 Pre-Processing Phase

This is the first phase in our project and the main tasks of this phase are correcting the input document and then we start the Tokenization process. The correction process starts by finding the errors in the input document and then starts the correction process but before starting in how we are going to correct the errors in the input we should mention some of these errors , the errors come from using non-Arabic letters in the document like using Question Mark ? in Arabic document instead of this we correct it to the Arabic Question Mark using checking for the Unicode of the most common letters that are used in Arabic documents while it is non-Arabic one ( Table show the most common errors ) , but why we should do that ? This correction process help a lot in getting more accurate results in the next phases which totally depend on this process and the errors in this step will make other errors in the next phases Non Arabic Letters Unicode \u003f \u002c \u0021 Character ? , ! Table 4.1 During this phase we also remove the similar punctuation characters if you come after each other, i.e if we found or !!!! we remove one of them because it wont affect our processing on the input document. The next snapshot show an input String with non-Arabic Letters and the output after replacing this letters with Arabic ones. Unicode \u061f \u060c \u0021 Arabic Letters Character !
39
After we are done with Replacing non-Arabic Characters with Arabic one , we go to the next step which is removing Diactries from the input document , Although this is Diactries may give more accurate results but Almost its rarely used nowadays and also it make a conflict with our other modules getting lemma of every word so we are remove it to avoid this problems . (Next Table shows the Diactries we are removing and the Unicode of them).
Diactries Table 4.2
Unicode \u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652
The next snapshot show an input String with Diactries and the output after removing it.
Figure 4.1 After removing Diactries and to be sure that there wont be errors we replace spaces between words into only one space between every Two words , then we remove spaces form the start and the end of the input document .
40
Now and after we replaced non-Arabic letters with Arabic Ones and also removing of Diactries , we are going to one of the most important tasks in this phase which is Tokenization . Tokenization process in based on segmenting the input document at two levels the first one is the segmentation of the input document into sentences and the other level in the segmentation of this sentences into words. In the first level which is the segmentation of the input document into sentences, we segment the input document into its constituent sentences based on the Arabic phrases delimiter characters such as comma, semicolon, colon, hyphen, and dot, to make this we used the next regex (figure) which splits the input document into sentences based on Arabic phrases delimiter characters.
([^\u002d\u003a\u003f\u061f\u0021\u002e\u060c\u061B\u0686\u0698\u06 AF\u0621-\u0636\u0637-\u0643\u0644\u0645-\u0648\u0649-\u064B\u064E\u064F\u0650\u0651\u0652]+)
figure 4.2 This process is useful for calculating part of the features vector of the candidate terms, such as Normalized Sentence Location (NSL), Normalized Phrase Location (NPL), Normalized Phrase Length (NPLen), and Sentence Contain Verb (SCV). We will explain in more details these words when we start to speak about Machine learning phase. Next snapshot show the output after segmenting the input into sentences.
Figure 4.3
Now we have segmented the input document into sentences based on Arabic Phrases delimiter, we are going to the second level which is the segmentation of these sentences into words; each sentence is segmented into its constituent words based on the criteria that words are usually separated by spaces. In this phase we created a method to check for every character in the word if the first character of the word isnt an Arabic letter we
41
discard it else we take it into our accounts. Next snapshot show the output after segment the sentences into Words based on spaces.
Figure 4.4
During our work in the preprocessing phase we used some useful which helped us in make a more accurate correction and segmentations, this methods was a method called isArabic which check if the input letter is Arabic or not and also we have a method that check for the punctuation letters and a method check if the input character is a stop punctuation or not and this was helpful in segmentation process we mentioned earlier. The last thing we did in our preprocessing phase was some method which will help in our upcoming phases, this methods was a method called isNext which check if there is an input to read or not and another one called getNextSentence which return the next sentence to the caller object.
42
4.3 Segmentation:
It is sub stage of preprocessing stage and it is a Tokenization of raw text is a standard preprocessing step for many NLP tasks. It requires more extensive token pre-processing, which is usually called segmentation That Segmenter taken from Stanford, The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. We modified that Segmenter to support Arabic only and to be integrated with our preprocessing module. Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic Segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. We integrated that Segmenter to get an appropriate preprocessing to be adopted with the coming stage Called POS (Part Of Speach) to get more accurate results . Segmenter can separate the prefix and suffix of the word to be prepared For Example : if we have a sentence :
And after processing in the Segmenter it will be :
43
4.4 POS Tagging Phase:

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. During searching for a POS tagger we have found that Stanford University has a section in the college doing researches on Arabic and one of their products is POS Tagger, so after testing for a while we decided to use this tagger in our work as there is no open source projects doing the same job available for free although it has some weakness that we tried to avoid we used this system. This POS tagger module is big and generally it works the same way for several languages so it used for several languages through trained grammars specifically for each language available language are as follow Arabic, English, Chinese, French, German the trained data for Arabic called arabic-accurate.tagger. As a part of our work we tried to eliminate all unused libraries and modules from this tagger and letting only modules that relate to Arabic language making it faster to load and more light. The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). We do now have a software component for segmenting Arabic, but you have to download and run it first; it isn't included in the parser (see at the end of this answer). The Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn't written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus: Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" ( )is not separated off. Inflectional and derivational morphology is not separated off. [GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper]
44
Parentheses are rendered -LRB- and -RRBQuotes are rendered as (ASCII) straight single and double quotes (' and "), not as curly quotes or LaTeX-style quotes (unlike the Penn English Treebank). Dashes are represented with the ASCII hyphen character (U+002D). Non-break space is not used.
The parsers are trained on unvocalized Arabic. One grammar (atbP3FactoredBuckwalter.ser.gz or atb3FactoredBuckwalter.ser.gz) is trained on input represented exactly as it is found in the Penn Arabic Treebank. The other grammars (arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on a more normalized form of Arabic. This form deletes the tatweel character and other diacritics beyond the short vowel markers which are sometimes not written (Alef with hamza or madda becomes simply Alef, and Alef maksura becomes Yaa), and prefers ASCII characters (Arabic punctuation and number characters are mapped to corresponding ASCII characters). Your accuracy will suffer unless you normalize text in this way, because words are recognized simply based on string identity. [GALE ROSETTA: This is precisely the mapping that the IBM ar_normalize_v5.pl script does for you.]
4.4.1 Training data supplied to POS Tagger

The POS Tagger is required as a machine learning tool supplied with a previous training. The trained object has been serialized and stored as a training file which has to be supplied to the program each time its used. Stanford University provides two trained files included within the program under the names arabic-accurate.tagger and Arabicfast.tagger. However, the first file has shown relatively higher accuracy than the second one. Examples Parsing [sent. 1 len. 8]: . (ROOT (S (CC ) (VP (VBD ) (NP (DTNN )) (PP (IN ) (NP (NN ) (NP (NN ( )JJ ))))) (PUNC .)))
45
4.4.2 POS Tag Set

The parser uses an "augmented Bies" tag set. The so-called "Bies mapping" maps down the full morphological analyses from the Buckwalter analyzer to a subset of the POS tags used in the Penn English Treebank (but some with different meanings). We augment this set to represent which words have the determiner "Al" ( )cliticized to them. These extra tags start with "DT", and appear for all parts of speech that can be preceded by "Al", so we have DTNN, DTCD, etc. This is an early definition of the Bies mapping.
4.4.3 Lemmatization
Abstract form (Lemma): it describes the basic form from which the given word is logically derived. Usually, this from differs from the word stem form which is obtained after removing the prefix and suffix parts of the word. For example, the stem of the word "is ,"which represents a human being object. In contrast, the abstract form of the word is " ,"which represents the adjective of visual object. This abstract form can be used to represent many different words having the same logical meaning "Visual object" such as " ." " , " ," The abstract form of the given word is represented as follows: The single form for nouns. The single and male form for adjectives. The past form for verbs. The stem form for stop-words.
The abstract form of the given word is extremely useful during the process of extracting candidate key phrase s. For example, the words " "and " "have different word-stems defined as "and "respectively. But, their abstract forms are the same and defined as ."This abstract form is used for extracting candidate key phrase s by recommending a strong key-term like " to represent the terms " " and " " also For example the Abstract form of the word ( )is ( ,)and for ( )is ( )and for ( )is (.) This unified key-term cannot be achieved by using the word-stem form of the words. So this technique will improve the result but if we can find a right tool to extract Abstract Form or Lemma from the text, searching again leads us this time to open source project called AraMorph which is a Java port of the homonym product developed in Perl by Tim Buckwalter on behalf of the Linguistic Data Consortium (LDC).
46
The product includes Java classes for the morphological analysis of arabic text files, whatever their encoding. Aramorph or Arabic WordNet consists of 9228 synsets (6252 nominal, 2260 verbal, 606 adjectival, and 106 adverbial), containing 18,957 Arabic expressions. This number includes 1155 synsets that correspond to Named Entities which have been extracted automatically and are being checked by the lexicographers. This module able to different forms (all possible Lemma solutions) it could be many as it doesnt know the context which this word used in and due to ambiguity in Arabic Language because of diactires and misspelled words and with this forms it returns also for every solution (initial POS not depending on whole sentence only used for this word) and prefix and suffix for this word plus it glossed English word. For clarification we will input this module an Arabic word to see its output. Example Processing token : Transliteration : ktAb Token not yet processed. Token has direct solutions. SOLUTION #3 Lemma : kAtib Vocalized as : Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category : stem : NOUN Glossed as : stem : authors/writers
SOLUTION #1 Lemma : kitAb Vocalized as : Morphology : prefix : Pref-0 stem : Ndu suffix : Suff-0 Grammatical category : stem :
NOUN
47
Glossed as : stem : book
SOLUTION #2 Lemma : kut~Ab Vocalized as : Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category : stem : NOUN Glossed as : stem : kuttab (village school)/Quran school
As we can see word may have different meaning with the same spelling only changing diactries so it will return all possible solutions. From this different solutions for every word in sentence we had to choose from it by a passing parameter from previous module (Stanford POS Tagger) which will help us in choosing the right form suggested solutions if we still cannot determine the right solutions we sort them every time by a complex algorithms depending on their suffixes and prefixes and alphabetical order so we can choose one each time and limit randomness this may be not the best solution but its result not bad specially in large Arabic text. As we said it depend on large dataset for Arabic words and s formations so we can call it domain-unspecific but how this works from inside, well it contains three main dictionaries one for stem, other for all possible prefixes in language and last one for all possible suffixes in Arabic and it uses brute force algorithm trying to guess word so for word it try to do so
48
prefix ktAb ktA ktA kt kt kt k k k k
stem b Ab A tAb tA t ktAb ktA kt k b b Ab b Ab tAb b Ab tAb ktAb
suffix
Dictionaries also include the following one for prefix

w f wa fa Pref-Wa and <pos>wa/CONJ+</pos> Pref-Wa and;so <pos>fa/CONJ+</pos>
and this one for suffix

perfect verb, null suffix: banA-h, daEA-h h hu PVSuff-0ah he/it <verb> it/him <pos>+(null)/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS</pos> hmA humA PVSuff-0ah he/it <verb> them (both) <pos>+(null)/PVSUFF_SUBJ:3MS+humA/PVSUFF_DO:3D</pos> hm hum PVSuff-0ah he/it <verb> them <pos>+(null)/PVSUFF_SUBJ:3MS+hum/PVSUFF_DO:3MP</pos> hA hA PVSuff-0ah he/it <verb> it/them/her <pos>+(null)/PVSUFF_SUBJ:3MS+hA/PVSUFF_DO:3FS</pos> hn hun~a PVSuff-0ah he/it <verb> them <pos>+(null)/PVSUFF_SUBJ:3MS+hun~a/PVSUFF_DO:3FP</pos> k ka PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+ka/PVSUFF_DO:2MS</pos> k ki PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+ki/PVSUFF_DO:2FS</pos>
49
kmA kumA PVSuff-0ah he/it <verb> you (both) <pos>+(null)/PVSUFF_SUBJ:3MS+kumA/PVSUFF_DO:2D</pos> km kum PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+kum/PVSUFF_DO:2MP</pos> kn kun~a PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+kun~a/PVSUFF_DO:2FP</pos> ny niy PVSuff-0ah he/it <verb> me <pos>+(null)/PVSUFF_SUBJ:3MS+niy/PVSUFF_DO:1S</pos> nA nA PVSuff-0ah he/it <verb> us <pos>+(null)/PVSUFF_SUBJ:3MS+nA/PVSUFF_DO:1P</pos>
And this last one for stem

;--- ktb ;; katab-u_1 ktb katab ktb kotub ktb kutib ktb kotab ;; kAtab_1 kAtb kAtab kAtb kAtib ;; >akotab_1 >ktb >akotab Aktb >akotab ktb kotib ktb kotab ;; takAtab_1 tkAtb takAtab tkAtb takAtab
PV write IV write PV_Pass be written;be fated;be destined IV_Pass_yu be written;be fated;be destined PV IV_yu correspond with correspond with
PV dictate;make write PV dictate;make write IV_yu dictate;make write IV_Pass_yu be dictated PV IV correspond correspond
4.5 Candidate key phrase:

After investigating different key phrase s written for Arabic documents, we found that the following syntactic rules are effective for extracting candidate phrases:
1- The candidate phrase can start only with some sort of nouns like general-noun, placenoun, proper-noun, and declined-noun.
2- The candidate phrase can end only with general-noun, place-noun, proper-noun, declined-noun, time-noun, augmented-noun, adjective, and adverb.
50
3- For three words phrase, the second word is allowed to be count-noun, conjunction, preposition, and comparison, in addition to those cited in rule 2. It is worthwhile to note that the used rules are language-dependent, and the given rules are applicable only to Arabic language. 4- We created two lists of Stop-words, stop-words that shouldnt appear in a Keyphrase and stop-words that can be the middle word of a three-word Keyphrase.
4.6 Feature Extraction phase:

Statistical is the main repository of all information needed to be saved , In that module we can update Statistical for each Sentence and accumulate Statistical for each new sentence to that previous ones to get all Statistical all over the Document . Calculating Statistical module uses Candidates key phrases for that Sentence and those Lemmas and those POSs (Part Of Speech) and those original words. First we calculate all words count in each sentence ignoring all punctuation count in that sentence , and then we calculate all words count in all Document, after that we get the maximum phrase Length in all Document to be used in Calculating the Normalization of each phrase length to maximum length, and it s to important to Calculate Maximum Phrase Frequency to can be used in Calculating feature called Phrase Relative Frequency(PRF) and as the same way calculating the maximum word frequency to be used in feature called Word Relative Frequency (WRF) . In the Scope of one sentence, we save if that Sentence contain verb or not , it will be highly effective in calculating features because and as the same way about if that Sentence contains question because most of writers Questioning about their main topic purpose Features Each candidate phrase is assigned a number of features used to evaluate its importance. In our algorithm, three factors control the selection of features and their values. 1. The absolute importance of the phrase, which identifies its importance independent of its original document. Therefore, most feature values are normalized when necessary, to have ranges from zero to one. 2. Heuristics: where the feature values are computed, based on our hypothesis of its importance, after investigating many human written key phrase s. 3. All the extracted features and values are based upon the abstract form of the phrases.
51
The following features are adopted:

a) Normalized Phrase Words (NPW), which is the number of words in each phrase normalized to the maximum number of words in a phrase. The values of this feature can be 1, 1/2, or 1/3. The hypothesis is that key phrase s consists of three words are better than key phrase s contain two words, and so on.. b) The Phrase Relative Frequency (PRF), which represents the frequency of abstract form of the candidate phrase normalized by dividing it by the most frequent phrase in the given document. PRF has a maximum value of 1; when the candidate key phrase is the most frequent one in a given document. c) The Word Relative Frequency (WRF): The frequency of the most frequent single abstract word in a candidate phrase, normalized by dividing it by the maximum number of repetitions of all phrase words in a given document. The feature is calculated as follows: First, the frequency of all unique abstract words used in phrases for a given document is computed. Second, the maximum number of repletion is found, and used to normalize the computed frequencies. Third For each phrase, the maximum normalized frequency of its words is selected as a WRF. WRF has a maximum value of 1, when it contains the most frequent word of all words of phrases in a given document . d) Normalized Sentence Location (NSL), which measures the location of the sentence containing the candidate phrase within the document. We use the heuristic that key phrases located near the beginning and end of document are important phrases. We use the simple distribution function NSL= (2(I/m)-1)2, where I is the location of the sentence within a document divided by total sentences in that document (m). The maximum value of NSL is 1 for first (I=0), and last sentences (I=m) in the document. e) Normalized Phrase Location (NPL) feature is adopted to measure the location of the candidate phrase within its sentence. The NPL is given by (2(x/n)-1)2, where x is the occurrence location of the phrase within a sentence divided by the total number of words of that sentence (n). Our motivation is that important key phrase s occur near the beginning and ending of sentences. f) Normalized Phrase Length (NPLen), which is the length of the candidate phrase (in words), divided by the number of words of its sentence. This feature has a value of one, when the whole sentence is a key phrase . Our hypothesis is that this will capture titles and subtitles of the document, which are likelihood to contain key phrase s . g) Sentence Contain Verb (SCV). This feature has a value of zero if the sentence of the candidate phrase contains verb, else it has a value of one. Our motivation is that, this
52
feature will give more weight to key phrase s written in titles and subtitles of a document. The feature value is assigned after analyzing the part of speech of sentence words. h) Is It Question (IIT): This feature has a value of one if the sentence of the candidate phrase is written in a question form; else its value is 0. The hypothesis is that some authors highlight their main concepts as question forms. The feature is adopted to capture the important key phrase s written in documents as questions. During this work, question forms are only identified by part of speech tagging, when detecting question marks and/or question words.
53
Chapter 5 Results and Future Work
54
5.1 Results:
The program was tested on many documents in various fields and many authors. In order to evaluate the performance of the proposed system, many experiments were carried out to test the proposed system. A total of 25 documents were used. The first experiment aimed to measure the level of acceptance of the extracted keyphrases. Since there is no author-assigned keyphrases for these documents, a human judge was adopted to evaluate this level; we have compared the results with KP-miner, but we couldnt compare it with Sakhr because the output was inapplicable to be compared.
5.1.1 Overall results Total # of Documents Categories selected Output Keyphrase per document Results for Our System Precision Recall
25 Politics , sports , community , technology , religion, psychology 15,20

Table 5.1
0.25 (for 15 keyphrase) 0.171 (for 20 keyphrase) 0.443 (for 15 keyphrase) 0.447 (for 20 keyphrase)
Table 5.2
Results for Kp-miner Precision Recall
0.214 (for 15 keyphrase) 0.178 (for 20 keyphrase) 0.399 (for 15 keyphrase) 0.414 (for 20 keyphrase)
Table 5.3
55
5.1.2 Results for document samples

However the training was done for technology field, the test on other fields like .psychology
Our system

4.5 Table
KP-Miner
The following table is the worst result we got from those test files compared to very good output from KP-Miner
Our system

5.5 Table
KP-Miner
Another test on politics
Our system
65
KP-Miner

Table 5.6
5.2 Future Work

There are many improvements will be done to refine the output and produce more efficient results. 1. Providing more features that represents the writer style of typing to be more relevant and provide valuable Keyphrase best suit the content. 2. Do a better training technique to generate equation that can be used with new feature. 3. Do some fixes to code to optimize both the performance and size of program. Add some features that relate to some punctuation in Arabic language and their effect such as: a) Symbol called and it come as a conjunction between words or phrases and it can be treated as a conjunction letter . b) " : " Symbol called and it comes after subtitles to give more details for that subtitle and it also comes for quotes so it can be get more weight to the phrase before that symbol or it can be weighted correctly after testing in future . c) " "symbol called and it comes to show that what after this symbol is an explanation of what before it . a. " - - " any phrase or word get between that symbol is called and that phrase can be weighted correctly after much testing because this phrase can be neglected or can be more important , so it must be definitely decided . d) " ... " symbol is mark for deleting and its too important to ignore any number of sentences or phrases related with this symbol. e) " - " symbol called and if the coming after is one word it will be important because it will express a subject coming late.
57
Appendix
Definitions:
Nouns:
is a name or an attribute of aperson (Ali), place (Mecca), thing (house) , or quality (honor). The word "noun" comes from the Latin nomen = "name." The noun or substantive category in Arabic includes in addition to simple nouns the pronouns, adjectives, adverbs, and verbids (participles andverbal nouns).
Pronouns :
Pronouns in Arabic belong to the category of "nouns." Therefore, everything that applies to nouns will apply to them. Pronouns have genders, numbers, and grammatical case. Pronouns are always definite nouns.
General-noun:
it can be classified into concrete/abstract,human/non-human, animate/inanimate nouns which can be used in any type of text to create lexical cohesion. The types of general noun which we encounter in Arabic are : a( concrete human noun: ex : b) abstract human nouns: ex: d) concrete animate non- human noun: ex: e) abstract inanimate non- human noun: ex:
Place-noun:
with a form mafal or similar, e.g. maktab ,maktaba "library" (from kataba "to write"); maba " kitchen" (from abaa " to cook"); masra"theater"( from saraa "to release"). Nouns of place formed from verbs other than Form I have the same form as the passive participle, e.g. mustafan "hospital" (from the Form X verb istaf "to cure").
58
Time-noun:
it is a name derived from a verb to indicate the time of the occurrence of the act , e.g mod (from wada )or mzhab ( from zahab .)
Proper-noun:
it is refer to unique or particular objects (cannot be preceded by words such as "some" or "any") permanently its names of persons or places
Common-noun:
it is refer to non-unique or non-particular objects (can be preceded by words such as "some" or "any").
Adjective:
Adjectives in Arabic follow the nouns or pronouns they modify in gender, number, grammatical case, and the state of definiteness. They always come after the words they modify. Adjectives in Arabic belong to the "noun" category, and there are several types of nouns that can serve as adjectives.
Declined-noun:
Nouns undergo inflection , which means that parts of them change in order to express changes in gender, number, case, tense, voice, person, or mood. The declension of Arabic nouns expresses changes in: Gender Arabic nouns have two grammatical genders (.) - Number Arabic nouns have three grammatical numbers(.) - Case Arabic nouns have three grammatical cases( .)- State Arabic nouns have three grammatical states(.)- - Declension Gender Masculine Feminine Number Singular Dual Plural Case Raf" (nom.) State Absolute Determinate Construct
Nasb (acc./dat./voc.) Jarr (gen./abl.)
59
Mass nouns:
are nouns that refer to single as well as plural units when they are grammatically singular and to plural units when they are grammatically plural. These usually refer to plants or animals. Examples: Plural Mass Nouns thimaar fruits 'ashjaar trees Singular Mass Nouns thamar fruit/fruits shajar tree/trees
Adverb:
Arabic adverbs are part of speech. Generally they're words that modify any part of language other than a noun. Adverbs can modify verbs, adjectives (including numbers), clauses, sentences and other adverbs. In Arabic an adverb is mostly translated with an adverb in the 4th declension like Huwa yatakallam kathiiran 3an ibnihi (he speaks a lot about his son).
Count-noun:
are nouns that refer to single units when they are grammatically singular, and to plural units when they grammatically plural. Example : Plural Count Nouns rijaal Singular Count Nouns rajul
Conjunction:
a word that connects sentences, clauses or words, it is like ka- " related to "as"
and "fa- " " thus, so"
Preposition:
it expresses a relationship between two entities, There are only twenty Arabic prepositions the most important and commonly used are si x prepositions (min, ila, ala, ba, la, fi) ( .)
60
Comparison:
elative forms of adjectives that are used for both comparisons (ex. "bigger") and superlatives (ex. "best"). Elative adjectives are invariable and take three regular forms: 1. ( af3al) e.g: ( Kibiir) (akbar) 2. ( af3a) - corresponds to adjectives that end in -( i) or -( w). e.g: (Helw) (aHla) 3. ( afa3ll) - corresponds to adjectives with a doubled/geminate root. e.g: ( gediid) ( agadd).
Nominative case - ( al-marf E):

This case is marked by a Damma, it is a case of a noun or pronoun that is functioning as the subject of a clause or sentence. Other words such as adjectives may have a nominative case in agreement with a noun. e.g.
Accusative case - ( al-manSb):

This case is marked by a fatHa. It is the case that identifies the direct object of a verb, or certain other grammatical parts.e.g:
Dative (Arabic :)
the case that indicates the indirect object of a verb. e.g:
Genitive case - ( al-majrr):

This case is marked by a kasra, a case that indicates possession. e.g:
Essive (Arabic :)
a case that expresses the temporary state of the referent specified by a noun. It means "while," or "in the capacity of." e.g:
Locative (Arabic :)
a case that indicates a location. It corresponds to the English prepositions "in," "on," "at," and "by." In Arabic, it is only used with place expressions, such as "front" or "back." e.g.
61
The genitive construct:

In Arabic, two nouns can be placed one after the other in what is called a genitive construct ( )to indicate possession. First comes the noun being possessed ( ,)then comes the noun referring to the owner (.)
Temporal (Arabic :)
a case that indicates a time. It corresponds to the English prepositions "in," "on," "at," and "by." In Arabic, it is only used with time expressions, such as "morning" or "evening." e.g.
Partitive (Arabic :)
a case that indicates "partialness," "without result," or "without specific identity."e.g. thirteen men came.
Cognate Accusative (Arabic :)

a case that identifies the object of an intransitive verb; with the object having the same root as the verb.
Final (Arabic :)
a case that indicates a final cause of an action. e.g.
Comitative (Arabic :)
a case that indicates companionship. It corresponds to the English preposition "with."e.g.
Perlative (Arabic :)
in Arabic, it indicates a movement along the referent of the noun that is marked. e.g. the man walked along the beach.
Vocative (Arabic :)
a case that indicates that somebody or something is being directly addressed by the speaker. E.g.
Ablative:
the case that indicates the source, agent, or instrument of action of the verb. It
62
indicates the object of most common prepositions. e.g.
63
BUCKWALTER TRANSLITERATION:
Buckwalter transliteration ' | > & < } A b p t v j H x d * r z s $ S D T Z E g _ f q k l m n h w Y y F N Unicode U+0621 U+0622 U+0623 U+0624 U+0625 U+0626 U+0627 U+0628 U+0629 U+062A U+062B U+062C U+062D U+062E U+062F U+0630 U+0631 U+0632 U+0633 U+0634 U+0635 U+0636 U+0637 U+0638 U+0639 U+063A U+0640 U+0641 U+0642 U+0643 U+0644 U+0645 U+0646 U+0647 U+0648 U+0649 U+064A U+064B U+064C Arabic Letters ARABIC LETTER HAMZA ARABIC LETTER ALEF WITH MADDA ABOVE ARABIC LETTER ALEF WITH HAMZA ABOVE ARABIC LETTER WAW WITH HAMZA ABOVE ARABIC LETTER ALEF WITH HAMZA BELOW ARABIC LETTER YEH WITH HAMZA ABOVE ARABIC LETTER ALEF ARABIC LETTER BEH ARABIC LETTER TEH MARBUTA ARABIC LETTER TEH ARABIC LETTER THEH ARABIC LETTER JEEM ARABIC LETTER HAH ARABIC LETTER KHAH ARABIC LETTER DAL ARABIC LETTER THAL ARABIC LETTER REH ARABIC LETTER ZAIN ARABIC LETTER SEEN ARABIC LETTER SHEEN ARABIC LETTER SAD ARABIC LETTER DAD ARABIC LETTER TAH ARABIC LETTER ZAH ARABIC LETTER AIN ARABIC LETTER GHAIN ARABIC TATWEEL ARABIC LETTER FEH ARABIC LETTER QAF ARABIC LETTER KAF ARABIC LETTER LAM ARABIC LETTER MEEM ARABIC LETTER NOON ARABIC LETTER HEH ARABIC LETTER WAW ARABIC LETTER ALEF MAKSURA ARABIC LETTER YEH ARABIC FATHATAN ARABIC DAMMATAN
64
K a u i ~ o ` { P J V G
U+064D U+064E U+064F U+0650 U+0651 U+0652 U+0670 U+0671 U+067E U+0686 U+06A4 U+06AF
ARABIC KASRATAN ARABIC FATHA ARABIC DAMMA ARABIC KASRA ARABIC SHADDA ARABIC SUKUN ARABIC LETTER SUPERSCRIPT ALEF ARABIC LETTER ALEF WASLA ARABIC LETTER PEH ARABIC LETTER TCHEH ARABIC LETTER VEH ARABIC LETTER GAF
POS Tag Set:

JJ RB CC DT FW NN NNS NNP adjective adverb coordinating conjunction determiner/demonstrative pronoun foreign word common noun, singular common noun, plural or dual proper noun, singular
NNPS proper noun, plural or dual RP VBP VBN VBD UH PRP particle imperfect verb (***nb: imperfect rather than present tense) passive verb (***nb: passive rather than past participle) perfect verb (***nb: perfect rather than past tense) interjection personal pronoun
65
PRP$ possessive personal pronoun CD IN WP WRB , . : cardinal number subordinating conjunction (FUNC_WORD) or preposition (PREP) relative pronoun wh-adverb punctuation, token is , (PUNC) punctuation, token is . (PUNC) punctuation, token is : or other (PUNC)
AraMorph
Dictionaries file format:
"dictPrefixes" contains all Arabic prefixes and their concatenations. Sample entries: w f b k wb fb wk fk Al wAl fAl bAl wa fa bi ka wabi fabi waka faka Al waAl faAl biAl Pref-Wa and <pos>wa/CONJ+</pos> Pref-Wa and;so <pos>fa/CONJ+</pos> NPref-Bi by;with <pos>bi/PREP+</pos> NPref-Bi like;such as <pos>ka/PREP+</pos> NPref-Bi and + by/with <pos>wa/CONJ+bi/PREP+</pos> NPref-Bi and + by/with <pos>fa/CONJ+bi/PREP+</pos> NPref-Bi and + like/such as <pos>wa/CONJ+ka/PREP+</pos> NPref-Bi and + like/such as <pos>fa/CONJ+ka/PREP+</pos> NPref-Al the <pos>Al/DET+</pos> NPref-Al and + the NPref-Al and/so + the NPref-BiAl <pos>wa/CONJ+Al/DET+</pos> <pos>fa/CONJ+Al/DET+</pos>
with/by + the
<pos>bi/PREP+Al/DET+</pos> kAl kaAl NPref-BiAl like/such as + the
<pos>ka/PREP+Al/DET+</pos>
66
wbAl
wabiAl
NPref-BiAl
and + with/by the
<pos>wa/CONJ+bi/PREP+Al/DET+</pos> fbAl fabiAl NPref-BiAl and/so + with/by + the
<pos>fa/CONJ+bi/PREP+Al/DET+</pos> wkAl wakaAl NPref-BiAl and + like/such as + the
<pos>wa/CONJ+ka/PREP+Al/DET+</pos> fkAl fakaAl NPref-BiAl and + like/such as + the
<pos>fa/CONJ+ka/PREP+Al/DET+</pos>
"dictSuffixes" contains all Arabic suffixes and their concatenations. Sample entries: p ty ap atayo NSuff-ap [fem.sg.] NSuff-taytwo <pos>+ap/NSUFF_FEM_SG</pos>
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS</pos> tyh atayohi NSuff-tayhis/its two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hu/POSS_PRON_3MS</pos> tyhmA atayohimA NSuff-taytheir two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+humA/POSS_PRON_3D</pos> tyhm atayohim NSuff-taytheir two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hum/POSS_PRON_3MP</pos> tyhA atayohA NSuff-tayits/their/her two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hA/POSS_PRON_3FS</pos> tyhn atayohin~a NSuff-taytheir two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hun~a/POSS_PRON_3FP</pos> tyk atayoka NSuff-tayyour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ka/POSS_PRON_2MS</pos> tyk atayoki NSuff-tayyour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ki/POSS_PRON_2FS</pos>
67
tykmA
atayokumA
NSuff-tayyour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kumA/POSS_PRON_2D</pos> tykm atayokum NSuff-tayyour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kum/POSS_PRON_2MP</pos> tykn atayokun~a NSuff-tayyour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kun~a/POSS_PRON_2FP</pos> ty atay~a NSuff-taymy two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ya/POSS_PRON_1S</pos> tynA atayonA NSuff-tayour two
<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+nA/POSS_PRON_1P</pos>
"dictStems" contains all Arabic stems. Sample entries: ;--- ktb ;; katab-u_1 ktb ktb ktb ktb katab kotub kutib kotab PV IV write write
PV_Pass be written;be fated;be destined IV_Pass_yu be written;be fated;be destined
;; kAtab_1 kAtb kAtb kAtab kAtib PV IV_yu correspond with correspond with
;; >akotab_1 >ktb Aktb ktb ktb ; ;; kitAb_1 >akotab PV >akotab PV kotib kotab IV_yu dictate;make write dictate;make write dictate;make write be dictated
IV_Pass_yu
68
ktAb ktb
kitAb kutub
Ndu N
book books
;; kitAboxAnap_1 ktAbxAn kitAboxAn ktbxAn kutuboxAn ;; kutubiy~_1 ktby kutubiy~ Ndu book-related NapAt NapAt library;bookstore library;bookstore
;; kutubiy~_2 ktby ktby kutubiy~ Ndu kutubiy~ Nap bookseller booksellers <pos>kutubiy~/NOUN</pos>
;; kut~Ab_1 ktAb ktAtyb kut~Ab N katAtiyb Ndip kuttab (village school);Quran school kuttab (village schools);Quran schools
;; kutay~ib_1 ktyb kutay~ib NduAt booklet
;; kitAbap_1 ktAb kitAb Nap writing
;; kitAbap_2 ktAb ktAb kitAb kitAb Napdu NAt essay;piece of writing writings;essays
;; kitAbiy~_1 ktAby kitAbiy~ N-ap writing;written <pos>kitAbiy~/ADJ</pos>
;; katiybap_1 ktyb ktA}b ktA}b katiyb Napdu brigade;squadron;corps brigades;squadrons;corps Phalangists
katA}ib Ndip katA}ib Ndip
;; katA}ibiy~_1 ktA}by ktA}by katA}ibiy~ katA}ibiy~ Nall Nall brigade;corps brigade;corps <pos>katA}ibiy~/NOUN</pos> <pos>katA}ibiy~/ADJ</pos>
;; katA}ibiy~_2
69
ktA}by ktA}by
katA}ibiy~ katA}ibiy~
Nall Nall
Phalangist Phalangist
<pos>katA}ibiy~/NOUN</pos> <pos>katA}ibiy~/ADJ</pos>
;; makotab_1 mktb mkAtb makotab Ndu makAtib Ndip bureau;office;department bureaus;offices
;; makotabiy~_1 mktby makotabiy~ N-ap office <pos>makotabiy~/ADJ</pos>
;; makotabap_1 mktb mkAtb makotab NapAt makAtib Ndip library;bookstore libraries;bookstores
THE THREE COMPATIBILITY TABLES :

Compatibility table "tableAB" lists compatible Prefix and Stem morphological categories, such as: NPref-Al N NPref-Al N-ap NPref-Al N-ap_L NPref-Al N/At NPref-Al N/At_L NPref-Al N/ap NPref-Al N/ap_L Compatibility table "tableAC" lists compatible Prefix and Suffix morphological categories, such as: NPref-Al Suff-0 NPref-Al NSuff-u NPref-Al NSuff-a NPref-Al NSuff-i NPref-Al NSuff-An NPref-Al NSuff-ayn
70
Compatibility table "tableBC" lists compatible Stem and Suffix morphological categories, such as: PV PVSuff-a PV PVSuff-ah PV PVSuff-A PV PVSuff-Ah PV PVSuff-at PV PVSuff-ath
Grammatical categories
Prefixes
Category CONJ EMPHATIC_PARTICLE FUNC_WORD FUT_PART INTERJ INTERROG_PART IV1S IV2MS Cunjunction Emphatic particle TODO : to be precisely defined Future particle Interjection Interrogative particle Imperfective 1st person singular Imperfective 2nd person masculine singular Imperfective 2nd person feminine singular Imperfective 3rd person masculine singular Imperfective 3rd person feminine singular Description
IV2FS
IV3MS
IV3FS
71
IV2D IV2FD IV3MD IV3FD IV1P IV2MP IV2FP IV3MP IV3FP NEG_PART PREP RESULT_CLAUSE_PARTICLE
Imperfective 2nd person dual Imperfective 2nd person feminine dual Imperfective 3rd person masculine dual Imperfective 3rd person feminine dual Imperfective 1st person plural Imperfective 2nd person masculine plural Imperfective 2nd person feminine plural Imperfective 3rd person masculine plural Imperfective 3rd person feminine plural Negative particle Preposition Result clause particle
Stems
Category ABBREV ADJ ADV DEM_PRON_F DEM_PRON_FS DEM_PRON_FD DEM_PRON_MS DEM_PRON_MD DEM_PRON_MP DET INTERROG NO_STEM NOUN NOUN_PROP NUMERIC_COMMA PART PRON_1S PRON_2MS Description Abbreviation Adjective Adverb Feminine demonstrative pronoun Feminine singular demonstrative pronoun Dual demonstrative pronoun Masculine singular demonstrative pronoun Masculine dual demonstrative pronoun Masculine plural demonstrative pronoun Determinative ? TODO : to be precisely defined No stem for the word Noun Proper noun Decimal separator Particle Personal pronoun : 1st person singular Personal pronoun : 2nd person masculine singular
72
PRON_2FS PRON_3MS PRON_3FS PRON_2D PRON_3D PRON_1P PRON_2MP PRON_2FP PRON_3MP PRON_3FP REL_PRON VERB_IMPERATIVE VERB_IMPERFECT VERB_PERFECT NO_RESULT
Personal pronoun : 2nd person feminine singular Personal pronoun : 3rd person masculine singular Personal pronoun : 3rd person feminine singular Personal pronoun : 2nd person common dual Personal pronoun : 3rd person common dual Personal pronoun : 1st person plural Personal pronoun : 2nd person masculine plural Personal pronoun : 2nd person feminine plural Personal pronoun : 3rd person masculine plural Personal pronoun : 3rd person feminine plural Relative pronoun Imperative verb imperfective verb Perfective verb Word that could not be analyzed
Suffixes
Category CASE_INDEF_NOM CASE_INDEF_ACC CASE_INDEF_ACCGEN CASE_INDEF_GEN CASE_DEF_NOM CASE_DEF_ACC CASE_DEF_ACCGEN CASE_DEF_GEN NSUFF_MASC_SG_ACC_INDEF NSUFF_FEM_SG NSUFF_MASC_DU_NOM NSUFF_MASC_DU_NOM_POSS NSUFF_MASC_DU_ACCGEN NSUFF_MASC_DU_ACCGEN_POSS Description Indefinite, nominative Indefinite, accusative Indefinite, accusative/genitive Indefinite, genitive Definite, nominative Definite, accusative Definite, accusative/genitive Definite, genitive Nominal suffix : masculine singular, accusative, indefinite Nominal suffix : feminine singular Nominal suffix : dual masculine, nominative Nominal suffix : dual masculine, nominative, construct state Nominal suffix : dual masculine, accusative/genitive Nominal suffix : dual masculine, accusative/genitive, construct state
73
NSUFF_FEM_DU_NOM NSUFF_FEM_DU_NOM_POSS NSUFF_FEM_DU_ACCGEN NSUFF_FEM_DU_ACCGEN_POSS NSUFF_MASC_PL_NOM NSUFF_MASC_PL_NOM_POSS NSUFF_MASC_PL_ACCGEN NSUFF_MASC_PL_ACCGEN_POSS NSUFF_FEM_PL POSS_PRON_1S POSS_PRON_2MS POSS_PRON_2FS POSS_PRON_3MS POSS_PRON_3FS POSS_PRON_2D POSS_PRON_3D POSS_PRON_1P POSS_PRON_2MP POSS_PRON_2FP POSS_PRON_3MP POSS_PRON_3FP IVSUFF_DO:1S IVSUFF_DO:2MS
Nominal suffix : dual feminine, nominative Nominal suffix : dual feminine, nominative, construct state Nominal suffix : dual feminine, accusative/genitive Nominal suffix : dual feminine, nominative, construct state Nominal suffix : masculine plural, nominative Nominal suffix : masculine plural, nominative, construct state Nominal suffix : masculine plural, accusative/genitive Nominal suffix : masculine plural, accusative/genitive, construct state Nominal suffix : feminine plural Personnal suffix : 1st person singular Personnal suffix : 2nd person masculine singular Personnal suffix : 2nd person feminine singular Personnal suffix : 3rd person masculine singular Personnal suffix : 3rd person feminine singular Personnal suffix : 2nd person common dual Personnal suffix : 3rd person common dual Personnal suffix : 1st person plural Personnal suffix : 2me person masculine plural Personnal suffix : 2me person feminine plural Personnal suffix : 3me person masculine plural Personnal suffix : 3me person feminine plural Imperfective verb direct object : 1st person singular Imperfective verb direct object : 2nd person masculine singular Imperfective verb direct object : 2nd person feminine singular Imperfective verb direct object : 3rd person masculine
IVSUFF_DO:2FS IVSUFF_DO:3MS
74
singular IVSUFF_DO:3FS IVSUFF_DO:2D IVSUFF_DO:3D IVSUFF_DO:1P IVSUFF_DO:2MP IVSUFF_DO:2FP IVSUFF_DO:3MP IVSUFF_DO:3FP IVSUFF_MOOD:I IVSUFF_SUBJ:2FS_MOOD:I IVSUFF_SUBJ:D_MOOD:I IVSUFF_SUBJ:3D_MOOD:I Imperfective verb direct object : 3rd person feminine singular Imperfective verb direct object : 2nd person common dual Imperfective verb direct object : 3rd person common dual Imperfective verb direct object : 1st person plural Imperfective verb direct object : 2nd person masculine plural Imperfective verb direct object : 2nd person feminine plural Imperfective verb direct object : 3rd person masculine plural Imperfective verb direct object : 3rd person feminine plural Imperfective verb : indicative mode Imperfective verb : subject marker, 2nd person feminine singular, indicative mode Imperfective verb : subject marker, dual, indicative mode Imperfective verb : subject marker, 3rd person common dual, indicative mode Imperfective verb : subject marker, masculine plural, indicative mode Imperfective verb : subjunctive/jussive mode Imperfective verb : subject marker, 2nd person feminine singular, subjunctive/jussive mode Imperfective verb : subject marker, dual, subjunctive/jussive mode Imperfective verb : subject marker, masculine plural, subjunctive/jussive mode Imperfective verb : subject marker, 3rd person du masculine plural, subjunctive/jussive mode Imperfective verb : subject marker, feminine plural
IVSUFF_SUBJ:MP_MOOD:I IVSUFF_MOOD:S IVSUFF_SUBJ:2FS_MOOD:SJ IVSUFF_SUBJ:D_MOOD:SJ IVSUFF_SUBJ:MP_MOOD:SJ
IVSUFF_SUBJ:3MP_MOOD:SJ IVSUFF_SUBJ:FP
75
PVSUFF_DO:1S PVSUFF_DO:2MS PVSUFF_DO:2FS PVSUFF_DO:3MS PVSUFF_DO:3FS PVSUFF_DO:2D PVSUFF_DO:3D PVSUFF_DO:1P PVSUFF_DO:2MP PVSUFF_DO:2FP PVSUFF_DO:3MP PVSUFF_DO:3FP PVSUFF_SUBJ:1S PVSUFF_SUBJ:2MS PVSUFF_SUBJ:2FS PVSUFF_SUBJ:3MS PVSUFF_SUBJ:3FS PVSUFF_SUBJ:2MD PVSUFF_SUBJ:2FD PVSUFF_SUBJ:3MD PVSUFF_SUBJ:3FD PVSUFF_SUBJ:1P PVSUFF_SUBJ:2MP PVSUFF_SUBJ:2FP PVSUFF_SUBJ:3MP PVSUFF_SUBJ:3FP CVSUFF_DO:1S CVSUFF_DO:3MS CVSUFF_DO:3FS
Perfective verb direct object : 1st person singular Perfective verb direct object : 2nd person masculine singular Perfective verb direct object : 2nd person feminine singular Perfective verb direct object : 3rd personmasculine singular Perfective verb direct object : 3rd person feminine singular Perfective verb direct object : 2nd person common dual Perfective verb direct object : 3rd person common dual Perfective verb direct object : 1st person plural Perfective verb direct object : 2nd person masculine plural Perfective verb direct object : 2nd person feminine plural Perfective verb direct object : 3rd person masculine plural Perfective verb direct object : 3rd person feminine plural Perfective verb subject : 1st person singular Perfective verb subject : 2nd person masculine singular Perfective verb subject : 2nd person feminine singular Perfective verb subject : 3rd person masculine singular Perfective verb subject : 3rd person feminine singular Perfective verb subject : 2nd person dual masculine Perfective verb subject : 2nd person dual feminine Perfective verb subject : 3rd person dual masculine Perfective verb subject : 3rd person dual feminine Perfective verb subject : 1st person plural Perfective verb subject : 2nd person masculine plural Perfective verb subject : 2nd person feminine plural Perfective verb subject : 3rd person masculine plural Perfective verb subject : 3rd person feminine plural Imperative verb direct object : 1st person singular Imperative verb direct object : 3rd person masculine singular Imperative verb direct object : 3rd person feminine singular
76
CVSUFF_DO:3D CVSUFF_DO:1P CVSUFF_DO:3MP CVSUFF_DO:3FP CVSUFF_SUBJ:2MS CVSUFF_SUBJ:2FS CVSUFF_SUBJ:2MP
Imperative verb direct object : 3rd person common dual Imperative verb direct object : 1st person plural Imperative verb direct object : 3rd person masculine plural Imperative verb direct object : 3rd person feminine plural Imperative verb subject : 2nd person masculine singular Imperative verb subject : 2nd person feminine singular Imperative verb subject : 2nd person masculine plural
77

Arabic Keyphrase Extraction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arabic Keyphrase Extraction

Uploaded by

Copyright:

Available Formats

Benha University Faculty of Engineering at Shoubra Computer engineering department

Arabic Keyphrase Extraction

Prof. Dr. Abdulwahab Al-Sammak

Cairo, Egypt June 14, 2012

Background and Related Work:

Part of Speech tagging:

Candidate Phrases Extraction:

Feature Vector Calculation:

Chapter 2 Data Mining

2.2 The Scope of Data Mining

2.4 KDD Process:

Lets examine the knowledge discovery process in the diagram above:

2.5 The Cross-Industry Standard Process for Data Mining (CRISPDM)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

2.6 Simplified process in KDD:

2.6.2 Data Mining

2.6.3 Results Validation

Chapter 3 Keyphrase Extraction

3.2 Supervised Machine Learning Techniques

3.2.1 C4.5 decision tree induction algorithm

3.2.2 GenEx (Genitor and extractor)

3.3 unsupervised Machine Learning Techniques

3.3.1.1 System Overview

3.3.1.2 Candidate keyphrase selection

3.3.1.3Candidate keyphrases weight calculation

3.3.1.4 Final Candidate Phrase List Refinement

Chapter 4 Proposed System

4.2 Pre-Processing Phase

Diactries Table 4.2

Unicode \u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652

And after processing in the Segmenter it will be :

4.4 POS Tagging Phase:

4.4.1 Training data supplied to POS Tagger

4.4.2 POS Tag Set

Glossed as : stem : book

prefix ktAb ktA ktA kt kt kt k k k k

stem b Ab A tAb tA t ktAb ktA kt k b b Ab b Ab tAb b Ab tAb ktAb

Dictionaries also include the following one for prefix

and this one for suffix

And this last one for stem

4.5 Candidate key phrase:

4.6 Feature Extraction phase:

The following features are adopted:

Chapter 5 Results and Future Work

25 Politics , sports , community , technology , religion, psychology 15,20

Results for Kp-miner Precision Recall

5.1.2 Results for document samples

Another test on politics

5.2 Future Work

Nasb (acc./dat./voc.) Jarr (gen./abl.)

Nominative case - ( al-marf E):

Accusative case - ( al-manSb):

Genitive case - ( al-majrr):

The genitive construct:

Cognate Accusative (Arabic :)

indicates the object of most common prepositions. e.g.

POS Tag Set:

<pos>bi/PREP+Al/DET+</pos> kAl kaAl NPref-BiAl like/such as + the

and + with/by the

<pos>wa/CONJ+bi/PREP+Al/DET+</pos> fbAl fabiAl NPref-BiAl and/so + with/by + the

<pos>fa/CONJ+bi/PREP+Al/DET+</pos> wkAl wakaAl NPref-BiAl and + like/such as + the

<pos>wa/CONJ+ka/PREP+Al/DET+</pos> fkAl fakaAl NPref-BiAl and + like/such as + the

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS</pos> tyh atayohi NSuff-tayhis/its two