You are on page 1of 11

A Rule-Based Approach for Extraction of LinkContext from Anchor-Text Structure

Suresh Kumar1, Naresh Kumar2, Manjeet Singh3, Asok De1


1

Ambedkar Institute of Advanced Communication Technologies & Research, Delhi-31, India 2 AIIT, Amity University, Noida, India 3 YMCA University of Science & Technology, Faridabad, India sureshpoonia@yahoo.com, naresh.dhull@gmail.com, mstomer2000@yahoo.com, asok.de@gmail.com

Abstract. Most of the researchers have widely explored the use of link-context to determine the theme of target web-page. Link-context has been applied in areas such as search engines, focused crawlers, and automatic classification. Therefore, extraction of precise link-context may be considered as an important parameter for extracting more relevant information from the web-page. In this paper, we have proposed a rule-based approach for the extraction of the link-context from anchortext (AT) structure using bottom-up simple LR (SLR) parser. Here, we have considered only named entity (NE) anchor-text. In order to validate our proposed approach, we have considered a sample of 4 ATs. The results have shown that, the proposed LCEA has extracted 100% actual link-context of each considered AT.

Keywords: Ontology, Augmented Context-Embedded grammar, SLR parser, Indexing, Focused-Crawling, Semantic-Web, NLP, Bare-Concept.

1 Introduction
World Wide Web (WWW) is collection of billions of pages that are linked together by hyperlinks. The hyperlink is often described in the following format: <A HREF= http://www.ai.ac.in/about.html/> Ambedkar Institute of Technology</A>. The first part indicates the target web-page location and the second part, i.e. Ambedkar Institute of Technology gives information about the content of the web-page, which is called anchor-text (AT). It has been very challenging for the crawler and indexer to get relevant web-pages because of enormous size of www. Most of the researchers have widely explored the use of

link-context to determine the theme of target web-page [1], [2], [3], [4], [5], [13]. Link-context has been applied in areas such as search engines [11], and focused crawlers [12] and automatic classification [14]. The extraction of precise linkcontext may be considered as an important parameter for extracting more relevant information from the web-page. This saves the effort of crawling and indexing of useless and irrelevant pages [6]. In most of the cases AT or text around AT was used to derive the context of a link. This motivates us to propose a rule-based approach for the extraction of the link-context. Here, in this paper, we proposed a link-context extraction algorithm (LCEA) to derive context of a link from web-page. We have categorized AT into various bare-concepts (BC) such as named entity (NE), class name (CLN), technology (T), framework (F), entertainment (E), and sports (S) etc by manual analysis of 100 web pages from Wikipedia and Open Directory project (ODP). Further, we design ontology of these ATs. For illustration of our approach, we have considered only named entity (NE) ATs involving 14 terminals and 23 non-terminals. After that, we developed augmented context-embedded grammar which is used by SLR parser. Finally, the output of SLR parser is used by LCEA to extract precise link context of web-page. We have validated our proposed approach by considering limited samples of ATs. The results have shown that, the proposed LCEA has extracted 100% actual link-context of each considered AT. The rest of this paper is organized as follows: In section 2 related literature is discussed. In section 3, we present our proposed approach in detail, followed by conclusion and future work in section 4.

2 Related Literature
Since the search engines have come into existence, development of various techniques were witnessed in the literature, in order to get optimized result of the search engine. Some of the techniques were focused on the ranking of the search results while others are related to the crawling appropriate pages depending upon end user search trends and the focus area of the search engine [11], [12]. In order to achieve these objectives, both statistical and natural language processing [1], [2], [3], [4], [5], [13] based techniques have proposed. For example, in [1] a dependency analysis based link-context was extracted. The main idea in [1] is to simulate the browsing behavior of web readers. The author fractionize the behavior into four steps which were parsing, decomposing, grouping and selection. But in this technique as author itself made a statement that word variation between the link-context and the target web-page has made the quality of link-context derivation very low. In [2] authors have described an approach to generate automatic rich semantic annotations of text, which can be utilized by semantic-web. In [3] authors have given an idea that cohesive text and noncohesive text surrounding the AT provide rich semantic cues about a target webpage. In [4] a scheme based on parsing of the text around anchor-link was proposed. In [4], it was tried to extract relevant sentence fragments in the

sentence. But in this approach, the applicability of parser was confined to the single sentence only and things were arranged based on their semantic importance using ontology. However, they did not use the concept of AT to extract linkcontext. Technique in [5] was also related to deriving link-context from HTML tag tree, where firstly an HTML parser was used to find the hierarchical structure of the content arranged using HTML tags and then the actual AT is analyzed. In his approach two link-context derivation techniques were described. In first technique, context from aggregation nodes was derived and in second, context from text window was derived. But both of these techniques fail to capture the conceptual information of the AT, which result in finding poor quality of theme of the target web-page. In [13], a scheme related to link-context extraction based on semi-NLP approach was proposed and their results were not so encouraging with respect to precise extraction of link-context.

3 Proposed Method
In this section, we will first give the overall model of the system and then the issues related to the design of the system will be discussed in the subsequent subsections.

3.1 Model
In this model, first of all we have designed ontology OWL: thing of AT [8], [10]. Based on this ontology, we developed augmented context-embedded SLR grammar [7] of NE anchor-text only. Using this grammar the AT is parsed using SLR parser and the output of SLR parser is passed to LCEA as an argument. LCEA derive the context of AT. The proposed Model is shown in Fig.1.
LCEA SLR parser Context of AT AT

Augmented Context-Embedded SLR grammar

Ontology

Fig. 1. Model for extraction of link-context

3.2 Types of Context Determination


In order to decide the types of context of AT, we analyzed 100 web pages from Wikipedia and Open Directory Project (ODP). Based on manual analysis of 100 web pages, we are able to categorize AT in various contexts or concepts such as named Entity (NE), technology (T), framework (F), phenomenon (P) and mechanism (M), sports (S) and entertainment (E). Ontology graph of these contexts/bare-concepts is shown in Fig. 2. In OWL [8], [10] there are two fundamental classes from which all other classes are derived, one is - OWL: Thing and second is - OWL: Nothing. The resource OWL: Thing is the class of all individuals, and every resource that is an instance of a class is implicitly a member of OWL: Thing. The resource OWL: Nothing represents the empty class, a class that has no members. In our ontology, AT is subclass of owl: thing. All of these concepts have rdf: is relationship with AT [9], [10].

Fig. 2. Ontology graph of ATs

NE anchor-text has further divided into 6 subclasses named as person named entity (PN), organization named entity (ON), discipline named entity (DN), country name entity (CTN), state name entity (STN) and city name entity (CN). All these 6 subclasses have relationship rdfs: subclass-of with NE class [10]. PN entity has further subdivided into two classes named as direct person name (DP) and indirect person name (IP). tim is instance of DP and president of India is instance of IP class. These instances have rdf: is-a relationship with their respective base classes. ON is also further subdivided into education organization (EO) and non-education organization (NEO) subclasses. Both of these classes have rdfs: subclass-of relationship with ON class [10]. DN has sub classified into commerce (COM), science (SC), arts (ARTs). In this ontology graph, we have explored NE, anchor-text only.

3.3 Rule-Based Development


On the basis of above classification, we have constructed the context-embedded grammar as shown in table 2. Rule AT1AT is added to make it an augmented grammar for the SLR parser. Initially, we have taken just 14 terminals and 23 nonterminals that are listed in abbreviation table 1 (used for the sake of simplicity). For example, for President, we have used pre as an abbreviation. In order to make discussion clear, we also used notational conventions like all terminals are represented using small case letters and non-terminals are represented using upper case letters.
Table 1. Abbreviation table S. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Abbreviation roh raj md uni tim ber ind of pre acc phy his lee inf AT NE PN ON DN Semantic Description Rohtak Rajasthan MDU University Tim Berner India of President Accounts Physics History Lee Infosis Anchor-Text Named Entity Person Named Entity Organization Named Entity Discipline Name Entity S. No. 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Abbreviatio n SN CTN CN DP IP EO NEO SC COM ART FN MN LN DES CNJ UT CL COY Semantic Description State Name Entity Country Name Entity City Name Entity Direct Person Name Entity Indirect Person Name Entity Education Organization Non-Education Organization Science discipline Commerce discipline Art discipline First Name Middle Name Last Name Designation Conjector University Title Class Name Company

Table 2. SLR Grammar Rule No r0 r1 r2 Rule AT1AT ATNE NEPN Rule No r18 r19 r20 Rule IPDES CNJ CTN EOUT CN EOCL CNJ STN

r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17

NE ON NEDN NE STN NECTN NECN PNDP PNIP ONEO ONNEO DNSC DNCOM DNART DPFN MN LN DPFN MN DPFN

r21 r22 r23 r24 r25 r26 r27 r28 r29 r30 r31 r32 r33 r34 r35

NEOCOY STNraj CNroh CTN ind CLuni FNtim MNber LNlee CNJof DESpre COMacc SCphy ARThis UT md COYinf

Using sets-of-items construction algorithm for SLR parser [7], the following table is constructed.
Table 3. Collection of set of items State Set of items AT1.AT , AT.NE , NE.PN, NE.ON, NE.DN , NE.STN , NE.CTN , NE.CN ,PN.DP , PN.IP, ON.EO , ON.NEO , DN.SC , DN.COM , DN.ART ,DP.FN MN LN, DP.FN MN, DP.FN,IP.DES CNJ CTN, EO.UT CN, EO.CL CNJ STN, NEO.COY, STN.raj, CN.roh, CTN.ind, CL.uni, FN.tim, DES.pre, COM.acc, SC.phy, ARTS.his, UT.md ,COY.inf AT1AT. ATNE. NEPN. NEON. NEDN. NESTN. NECTN. NECN. PNDP. PNIP. ONEO. State Set of items

I0

I21

STNraj.

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15

I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 I33 I34 I35 I36

ONNE. DNSC. DNCOM. DNART.

CNroh. CTNind. CLuni. FNtim. DESpre. COMacc. SCphy. ARThis. UTmd. COYinf. DPFN MN.LN DPFN MN. LN.lee MNber. IPDES CNJ.CTN,CTN.ind CNJof. EOUT CN.

I16 I17 I18 I19 I20

DPFN. MN LN, DPFN. MN, DPFN.,MN.ber IPDESI.CNJ CTN, CNJ.of EOUT.CN, CN.roh EOCL.CNJ STN, CNJ.of NEOCOY.

I37 I38 I39 I40 I41

EOCL CNJ.STN,STN.raj DPFN MN LN. LNlee. IPDES CNJ CTN. EOCL CNJ STN.

Thereafter using algorithm 6.1 [7] for the construction of SLR parsing table, the parsing table is constructed as indicated in table 4.
Table 4. Parsing table

3.4 Link-Context Extraction Algorithm (LCEA)


In this algorithm, we have taken three data structure as an input: AL [ ] array of strings of actions carried out during parsing (from SLR operation table). n is total number of moves to reach up to accept string in SLR operation table. AT is the input string of which context is to be determined.

LCEA (AL [ ], n, AT) (1) Declare a string variable s, and set s=AL [n-3]. (2) Set NT= right hand side of in s string. (3) If NT = any of (STN, CTN, CN) then print NT as a context of given AT, and stop. (4) Otherwise Set s = AL [n-4]. (5) Set NT= right hand side of in s string. (6) Print NT as a context of given AT, and stop.

3.5 Testing of LCEA


In order to test LCEA, we have taken following set of test cases of ATs as input string: President of India Tim Berner Lee University of Rajasthan History

SLR operation of above ATs are shown in table 5 to table 8 respectively. The moves in these tables use the abbreviations which are already mentioned in table 1. The shift operation used in these tables, shifts the terminal lies on the front of the input string to the top of the stack and the reduce operation, pops twice the elements on the right hand side of the rule and push left hand side of the rule onto the stack.
Table 5. SLR Operation table of AT: President of India Stack 0 0 pre 26 0 DES 17 0 DES 17of 35 0 DES 17 CNJ 34 0 DES 17 CNJ 34 ind 23 0 DES 17 CNJ 34 CTN 40 0 IP 10 0 PN 3 0 NE 2 0 AT 1 Input pre of ind $ of ind $ of ind $ ind $ ind $ $ $ $ $ $ $ Action shift reduce by DESpre shift reduce by CNJof shift reduce by CTNind reduce by IPNDES CNJ CTN reduce by PNIP reduce by NEPN reduce by ATNE accept

Table 6. SLR Operation table of AT: Tim Berner Lee Stack 0 0 tim 25 0 FN 16 0 FN 16 ber 33 0 FN 16 MN 32 0 FN 16 MN 32 lee 39 0 FN 16 MN 32 LN 38 0 DP 9 0 PN 3 0 NE 2 0 AT 1 Input tim ber lee $ ber lee $ ber lee $ lee $ lee $ $ $ $ $ $ $ Action shift reduce by FNtim shift reduce by MNber shift reduce by LNlee reduce by DPFN MN LN reduce by PNDP reduce by NEPN reduce by ATNE accept

Table 7. SLR Operation table of AT: University of Rajasthan Stack 0 0 uni 24 0 CL 19 0 CL 19 of 35 0 CL 19 CNJ 37 0 CL 19 CNJ 37 raj 21 0 CL 19 CNJ 37 STN 41 0 EO 11 0 ON 4 0 NE 2 0 AT 1 Input uni of raj $ of raj $ of raj $ raj $ raj $ $ $ $ $ $ $ Action shift reduce by CLuni shift reduce by CNJof shift reduce STNraj reduce by EOCL CNJ STN reduce by ONEO reduce by NEON reduce by ATNE Accept

Table 8. SLR Operation table of AT: History Stack 0 0 his 29 0 ART 15 0 DN 5 0 NE 2 0 AT 1 Input his $ $ $ $ $ $ Action Shift reduce by ARThis reduce by DNART reduce by NEDN reduce by ATNE Accept

We are now using above mentioned table 5 to table 8 as input to our proposed LCEA discussed in section 3.4. Result of manual execution of LCEA for these 4 ATs is shown in following table 9. Moreover, we have tested LCEA algorithm to compute the context of around 100 ATs and found 100% actual link-context in all cases. Due to the space constraint, in this paper, we have only shown the illustration of our LCEA considering above mentioned 4 ATs.

Table 9. Result of Link Context Extraction Algorithm S. No. 1 Input string / AT pre of ind Execution Steps of LCEA Here n=11 (1) s=AL [8] = NEPN (2) NT = PN (3) Not Executed (4) s=AL [7] = PNIP (5) NT=IP (6) Print IP (Indirect Person Name) as a context of AT=President of India and stop Here n=11 (1) s=AL [8] = NEPN (2) NT = PN (3) Not Executed (4) s=AL [7] = PNDP (5) NT=IP (6) Print IP (Direct Person Name) as a context of AT= Tim Berner Lee and stop Here n=11 (1) s=AL [8] = NEON (2) NT = ON (3) Not Executed (4) s=AL [7] = ONEO (5) NT=EO (6) Print EO (Education Organization Name) as a context of AT= University of Rajasthan and stop Here n=6 (1) s=AL [3] = NEDN (2) NT = DN (3) Not Executed (4) s=AL [2] = DNART (5) NT=ART (6) Print ART (Subject of Art Discipline) as a context of AT= History and stop

tim ber lee

uni of raj

his

4 Conclusion and Future work


In this paper, we have proposed a rule-based approach to derive link-context from anchor-text structure, where link-context is inherently embedded in rules. The approach has been successfully tested for new anchor-texts generated using the terminal symbols of the grammar. We have tested LCEA algorithm to compute the context of around 100 ATs and found 100% actual link-context in all cases. Due to the space constraint, in this paper, we have shown the illustration of our LCEA considering only 4 ATs. In this paper, we have taken a limited type of ATs ontology and hence the limited SLR grammar. In future, the type of ATs and their

relations in our ontology will be explored. Finally, the performance of the LCEA would be tested. Subsequently, we would also try to extend terminal and nonterminal list, so that all categories of BC of ATs as shown in ontology graph could be covered.

References
1. Tao Jing, Tao Ping, Walini Zuo.: Deriving Link Context through Dependency Analysis. IEEE International Conference on Education Technology and Computer 2009 2. Akshay Java et al.: Using a Natural Language Understanding System to Generate Semantic Web Content. International Journal on Semantic Web and Information Systems, 3(4), 2007 3. Naresh Chauhan, A. K. Sharma. : Analyzing Anchor- Links to Extract Semantic Inference of a Web page. In: 10th IEEE International Conference on Information Technology, 2007 4. Q. Xu, Wanli Zuo.: Extracting Precise Link Context Using NLP Parsing Technique. Proceeding of the IEEE/WIC/ACM International Conference on Web Intelligence (WI04) 5. Gautam Pant.: Deriving Link-context from HTML Tag Tree. Proceedings of 8th SIGMOD Workshop on Research Issues in data Mining and Knowledge Discovery, 2003 6. M. Henzinger, at al.: Link Analysis in Web Information Retrieval. IEEE Data Engineering Bulletin, 23(3)3-8, 2000 7. A.V.AHO, J.D. Ullman.: Principals of Compiler Design. Narosa Publishing House (25th reprint 2003), pp: 197-214 8. Fensal.D, Van Harmelen, Horrocks. I,McGuinness, Patel-S.:OIL: An ontology Infrastructure for the Semantic Web. IEEE Intelligent Systems, vol: 16, issue 2, pp: 3845 (2001) 9. Klein M.: Tutorial: The Semantic Web- XML, RDF, and Relatives. IEEE Intelligent Systems 16(2), 26-28 (2001) 10. J. Hebeler, M. Fisher, R. Blace, A. P. Lopez.: Semantic Web Programming. Wiley Publication (2009), pp: 63-139 11. S. Brin, L. page. : The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998 12. C. C. Aggarwal, F. AI-Garawi, and P. S. Yu. : Intelligent crawling on the World Wide Web with arbitrary predicates. In WWW 10, Hong Kong, may 2001 13. Naresh Chauhan, A. K. Sharma. : A framework to derive web page context from hyperlink structure. International journal of Information and Communication Technology, vol:1,(3/4), pp: 329-346 14. G. Attardi, A. Gulli, F. Sebastini. : Automatic Web page categorization by link and context analysis. In Proceeding of THAI-99, 1s European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999

You might also like