You are on page 1of 32

CHAMELI DEVI INSTITUTE OF

TECHNOLOGY AND MGMT.

Session 2009-10

TOPIC – XML Parsers

Submitted to: Submitted by:


Mr. Hemant Verma Shraddha Porwal
0832cs071103
Prateek Sharma
0832cs071071
CONTENTS-
• Definition
• Need
• Parsing
• Types of Parsers
• DOM Parser
• JDOM: Better than DOM
• SAX Parser
• Use of CallBacks
• SAX interfaces
• Stax: Better than SAX
• SAX v/s DOM
What is a XML Parser?

It is a software library (or a package) that provides

methods (or interfaces) for client applications to work


with XML documents
It checks the well-formattedness
It may validate the documents
It does a lot of other detailed things so that a client is

shielded from that complexities


Need for XML Parsers?

• As we have seen, XML is a standard way of communicating


in the J2EE environment.
• To communicate a system needs to be able to both create an
XML document and process an XML document.
• There are many ways to create this XML document
– Two popular options are:
• Use a text editor
• Create it In Java
Continue…

• Once you have created an XML document, you need to be


able to process/read it.
– To do this in Java, you use the Java parsers.
– You can use the parsers included in the sdk or you can
use a vendor-developed parser.
– There are two basic types of XML parser: SAX and
DOM
What is parsing?

• Interpretation of text.
• The XML parser’s job is load the document, check that
follows all necessary rules (at minimum, for well-
formatted ness), and build a document tree structure that
can be passed on to the application.
• The application is any program (e.g. browser, reader,
middleware) that acts upon the tree structure, processing
the data it contains.
How parsing Is done?

Packets of
parsed
XML data Application
XML XML to
Document Parser manipulate
XML Data

XML Application

A parser is a program that:


Reads a document
Checks whether it is syntactically correct
Takes some action as it processes the document
Types of XML parsers
• XML parsers can be classified as tree-based or event-based
– Tree-based parsers read the entire XML document into memory and
stores it as a tree data structure
Tree-based parsers allow random access to the XML, hence are usually
more convenient to work with
It’s usually possible to manipulate the tree and write out the modified
XML
Parse trees can take up a lot of memory
DOM and XOM are tree-based
– Event-based (or streaming) parsers read sequentially through the
XML file
Event-based parsers are faster and take very little memory—this is
important for large documents and for web sites
SAX and StAX are event-based
In the event-driven model, the parser reads through the document and
signals each significant parsing event (e.g. start of document, start of
element, end of element). Callback methods are used to handle these events
as they occur. This approach is used by the Simple API for XML (SAX).
Some Java XML Parsers
• DOM
– Sun JAXP
– IBM XML4J
– Apache Xerces
– Resin (Caucho)
– DXP (DataChannel)
• SAX
– Sun JAXP
– SAXON
• JDOM
• STAX
DOM Parser-Document Object Model

• Flexible, can walk the tree up and down


• Builds a tree that represents the document
• Easier to use for most applications
• Parsed tree gives complete overview of data
• DOM standard defines interfaces and methods to analyze
and modify the tree structure that represents an XML
document
• Have methods like getChildNode(), getNextSibling(),
getParentNode(), getNodeType(), etc.
DOM Document Object
• A DOM document is an object containing all the
information of an XML document
• It is composed of a tree (DOM tree) of nodes , and various
nodes that are somehow associated with other nodes in the
tree but are not themselves part of the DOM tree
• There are 12 types of nodes in a DOM - Element,
Attribute, Comment, Text, CDATA section, Entity
reference, Entity, Processing Instruction, Document,
Document type, Document fragment, and Notation.
Continue…
 A DOM parser creates an internal structure in memory
which is a DOM document object
 Client applications get the information of the original
XML document by invoking methods on this Document
object or on other objects it contains
An XML Document
Tree View of XML Document
 Advantage:
(1) It is good when random access to widely
separated parts of a document is required
(2) It supports both read and write operations

 Disadvantage:
(1) It is memory inefficient
(2) It seems complicated, although not really
The Trouble With DOM

• Written by C programmers

• Cumbersome API
– Node does double-duty as collection

– Multiple ways to traverse, with different interfaces

• Tedious to walk around tree to do simple tasks

• Doesn't support Java standards (java.util collections)


JDOM: Better than DOM

• Java from the ground up

• Open source

• Clean, simple API

• Uses Java Collections


JDOM vs. DOM

• Classes / Interfaces

• Java / Many languages

• Java Collections / Idiosyncratic collections

• getChildText() and other useful methods /


getNextSibling() and other useless methods
SAX parsers – Simple API For XML

Event-driven
• This is an event-driven, serial-access mechanism
that does element-by-element processing.
It calls a method you provide to process each
construct it encounters
More efficient for handling large XML documents
Gives you the information in bits and pieces
Continue…

It does not first create any internal structure


Client does not specify what methods to call
Client just overrides the methods of the API and place his
own code inside there
• When the parser encounters start-tag, end-tag,etc., it
thinks of them as events
Continue…

• When such an event occurs, the handler automatically


calls back to a particular method overridden by the client,
and feeds as arguments the method what it sees
SAX parser is event-based,it works like an event handler
in Java (e.g. MouseAdapter)
Client application seems to be just receiving the data
inactively, from the data flow point of view
SAX uses callbacks
• SAX works through callbacks: you call the parser, it calls
methods that you supply

Your program
startDocument(...)
The SAX parser
startElement(...)
main(...)
parse(...)
characters(...)
endElement( )
endDocument( )
Simple SAX program

• The program consists of two classes:


– Sample -- This class contains the main method; it
• Gets a factory to make parsers
• Gets a parser from the factory
• Creates a Handler object to handle callbacks from the parser
• Tells the parser which handler to send its callbacks to
• Reads and parses the input XML file
– Handler -- This class contains handlers for three kinds of
callbacks:
• startElement callbacks, generated when a start tag is seen
• endElement callbacks, generated when an end tag is seen
• characters callbacks, generated for the contents of an
element
SAX Parser interfaces

 ContentHandler

 DTDHandler

 EntityResolver

 ErrorHandler
Used to create a
SAX Parser Handles document
events: start tag, end
XML-Reader tag, etc.
Factory
Handles
Content Parser Errors
Handler

Error
XML Handler Handles
XML Reader DTD
DTD
Handler

Entity Handles
Resolver Entities
 Advantage:

(1) It is simple

(2) It is memory efficient

(3) It works well in stream application


 Disadvantage:

The data is broken into pieces and clients never have all the
information as a whole unless they create their own data
structure
StAX better than SAX

• At this point, there seems to be no reason to use SAX rather than


StAX--if you have a choice
– SAX and StAX are both streaming parsers
– StAX is faster and simpler
– With StAX your program has control, rather than being controlled by the
parser
• This means:
– You can choose what tags to look at, rather than having to deal with them all
– You can stop whenever you like

• However,
– StAX is new, so most existing projects use SAX
– Many ideas, such as the use of factories, are the same for each
SAX vs DOM Parsing: Efficiency

 The DOM object built by DOM parsers is usually complicated


and requires more memory storage than the XML file itself
 A lot of time is spent on construction before use
 For some very large documents, this may be impractical

 SAX parsers store only local information that is encountered


during the serial traversal

 Hence, programming with SAX parsers is, in general, more


efficient
Which should we use? DOM vs. SAX
• If your document is very large and you only need to extract
only a few elements – use SAX

• If you need to manipulate (i.e., change) the XML – use DOM

• If you need to access the XML many times – use DOM


(assuming the file is not too large)
DOM and SAX Parsers
Summary -
• Parsing is mainly interpretation of
text.
• DOM and SAX generic Parsers.
• SAX more efficient.
THANK YOU

QURIES???

You might also like