You are on page 1of 18

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/ WEB TECHNOLOGIES (RT705)

MODULE I Introduction to SGML Features Of XML XML as a Subset of SGML XML Vs HTML Views of XML Document - Simple XML Document Starting and Ending Tags Attributes and Tags Entity References Comments CDATA Sections I INTRODUCTION TO SGML Markup language refers to the traditional way of marking up a document. It determines the structure and meaning of textual elements .It consists of codes and tags that are added to the text to change the look or meaning of text or document. There are two types of markup languages. a) Specific Markup Language It is used to generate the code that is specific to a particular application. Examples are HTML Purpose is to format the documents for the web RTF Used for Rich Text Formatting(MSWord supports RTF) b)Generalized Markup Language It is generated to solve some problems associated with porting documents from one platform and operating system configuration to another .GML is introduced by Dr.C.F Goldforb in 1960s.It is first developed for IBM. Later it is adopted as Standard by the International Organization for Standardisation (ISO) in 1986.Thus the SGML (Standard Generalized Markup Language) originated. SGML Structure An SGML application consists of two parts SGML declaration and SGML DTD (Document type Definitions). SGML Declaration - The declaration parts identifies the characters to be used in a document .It provides a way to identify the objects that will be used throughout the SGML document. These objects are called Entities SGML DTD In the Document Type Definition we can list the element type we wish to use in your document and indicating the structural order in which they can occur.

SGML Features 1. The term SGML stands for Standard Generalized Markup Language 2. It is a system for defining the markup language. 3. SGML is a meta language .It facilitates the creation of other languages. 1

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

4. SGML is extensible .It allows the author to define a particular structure by defining the parts that fits the structure. 5. SGML a system for organizing and tagging elements of a document. 6. SGML specifies the rules for tagging elements. 7. It is widely used to manage large document that are subject to frequent revisions and need to be print in different format. 8. Authors can mark up their document by representing structural, presentational and semantic information along with the content. 9. SGML is intended to be absolutely independent of any application 10. Closing tags are optional and nothing in the SGML document indicates how the data should look. 11. HTML is an application of SGML because HTML was created using SGML standards. 12. SGML added provisions for identifying the characters to be used in the document and providing a way to identify the objects that will be used throughout a document.

II XML FEATURES

1. XML stands for Extensible Markup Language. 2. It is designed to describe data or information and focus on what data is? 3. XML is a smaller language than SGML(ie subset of SGML) 4. It is used to format and transfer data in an easy and convenient way. 5. It is a markup language like HTML. 6. XML has the ability to work with HTML for data display and presentation 7. It is a standard language used to structure and describe data that can be understood by
different application.

8. XML documents are called self describing documents 9. XML tags are not predefined . you must define your own tags. 10.XML is free and extensible. It is a compliment to HTML 11.XML includes specification for a Style Sheet Language called eXtensible Stylesheet
Language ( XSL )

12.XML includes specification for a hyper linking scheme , which is described as a separate
language called eXtensible Link Language ( XLL ) 2

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

13.Every XML document consists of data and markup.you can literally tag up your data with
your own tags .

14.XML can be used as a data interchange format .Since the XML text format is standards
based ,data can be converted and then easily read by another system or application III XML as a Subset of SGML

SGML is a very powerful, very general and a standard markup language. But with that power
comes the increased complexity.

XML is a subset of SGML intended to make SGML light enough for use on web. As XML is a proper subset of SGML, all XML documents are valid SGML documents .But
not all SGML documents are valid XML document.

Relationship of XML to SGML


SGML XML

SGML is intended to be absolutely independent of any applications

The complexity of implementing SGMLs power limits its users to big companies that need
all that power. Hence XML the simplified SGML that retains most of the inherent power of SGML in a simple ,tidy ,easy-to-use and easy-to-implement form arrived.

Since XML is optimized for use on the World Wide Web, it is designed in such a way that it
has some benefits that are not found in SGML.

XML becomes a smaller language than SGML because the designers of XML removed some
specification in SGML that was not needed for web delivery..

WEB TECHNOLOGIES IV COMPARISON OF HTML AND XML

http://www.lectnote.blogspot.com/

HTML

XML

HTML is HyperText Markup Languge


format the document

XML is eXtensible Markup Language


what data is?

It is used for displaying information and to It is designed to describe data and to focus on HTML is not extensible.The user cant modify It is Extensible,it allows the author to define a
the structure or format by adding your tags. particular structure

HTML tags are predefined. Closing tags are mostly optional HTML is not case sensitive HTML
has no Defenition(DTD) Document Type

Tags are not predefined. Closing tags are compulsory XML is highly case sensitive XML uses DTD to describe data elements
used in the document

Document

display is direct and easy using any web browser with HTML . XML need XSL interaction for web browser display of document Cascading Style Sheet (CSS) a style sheet standard for HTML can be embed within HTML code In XML presentation and content are kept separate ie XSL page is acting independently.

WEB TECHNOLOGIES XML BASED SYSTEM

http://www.lectnote.blogspot.com/

A simple XML based system involves atleast three distinct items. 1)XML Document :It consists of a mixture of XML character strings known as markup along with the actual information content of the document known as Character data. Markup start tag, end tag, comment etc Character data User-defined elements 2)XML DTD :An XML document can optionally be associated with a set of rules that specify what order of occurrence of markup and character data is permitted .These rules are housed in a Document Type Definition or DTD for short .So an XML document can have an XML DTD. DTD component is optional that is denoted by the dashed lines in the diagram. 3) XML Processor :It reads the XML document and provide access to their content and structure .It is also known as XML parser .It is responsible for combining XML document with or without the presence of a DTD, in order to split it up into a group of markup and group of character data .Examples of XML processor includes msxml (developed by Microsoft) ,XP (Java based XML processor) etc. There are two classes of XML processors .An XML processor capable of checking for validity is known as a validating XML processor. msxml processor from Microsoft is an example of validating XML processor. An XML processor that ignores any validity constraints spelled out in DTDs is known as a nonvalidating XML processor. Elfred , a java based XML processor developed by Microstar Corporation is an example of nonvalidating XML processor. 4)XML Application :The XML processor feeds this processed information through an XML application .Examples of XML application includes Online Banking, Push Technology with Microsoft Active Channels (daily news, stock prices etc) ,Web Automation ,Database publishing etc

Comparison of XML Document & HTML Document Let us compare the XML and HTML document linking to world wide web with an example . Consider a company running e-Business selling PCs on the Internet . Here is the sort of information the company needs to publish: maker :Acme PC Inc 5

WEB TECHNOLOGIES item :PC Brand :Acme deluxe Storage: RAM -72 MB Hard Disk -10 GB CPU :speed 500GHz

http://www.lectnote.blogspot.com/

In order to publish this information using HTML , they need to create a document looking something like this: <html><body> <h1>PC For Sale</h1> <h2>Maker :Acme PC Inc</h2> <h3>Brand :Acme Deluxe</h3> <table border=1 align=center> <tr><td>Storage</td><td>CPU</td></tr> <tr><td>RAM :72 MB<br> Hard Disk :10 GB</td> <td>CPU Speed :500GHz</td></tr> </table></body></html>

Fig :HTML document linked to the World Wide Web The HTML version of the data knows nothing about PCs or hard disk sizes . All it knows are heading levels ,tables ,TR ,Italic text etc . When this document is put on the World Wide Web ,search engines and users alike see only a collection of levels ,tables , bold or italic text etc

XML sample :<PcForSale> <item type=PC> <Maker>Acme PC Inc</Maker> <Brand>Acme Deluxe</Brand> <storage> <RAM units=MB>72</RAM> <HardDisk units=GB>10</HardDisk> 6

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

</storage> <CPU >Speed 500 GHz </CPU> </item> </PcForSale> This document can have a much richer interface to web ,an interface that presents all sort of possibilities about how it might be put to use as in fig

Fig : XML document linked to World Wide Web So the XML document consists of three distinct components : Data content- the Words themselves Structure the document type and organization of its elements ie what kinds of elements it can contain and in what order they can occur. Presentation the way the information is presented to the reader

V TWO VIEWS OF AN XML DOCUMENT The overall structure of any given XML document can be looked at in two distinct ways .Firstly it has a Logical structure and side by side with the logical structure XML document have a Physical structure. 1.Logical structure Viewed from this angle an XML document is a hierarchy of information .It enlists the elements to be included in a document and in the order in which they have to be included .The elements or character data of the document hangs in individual group in a tree like structure created by the markup. At the very top of the tree is called Root element from which all the further logical structure develops .Thus it refers to the organization of the different parts of a document , ie it indicates how a document is built.

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

fig: the logical structure of the acmepc catalog XML document. The logical structure is the layer above the physical structure .At this level an XML document consists of an optional prolog, root element, and an optional epilog. The first structural element in XML document that precedes the first start-tag is collectively known as prolog. The prolog is everything that occurs before the root element starts .It can be completely empty but should at least contain an XML declaration. The XML declaration identifies the version of the XML specification to which the document conforms .The sample document begins with the XML declaration <?xml version = 1.0?> If the XML document is going to be associated with a Document Type Definition then the prolog will contain a Document Type Declaration. The Document Type Declaration is the area of the prolog used to declare element types ,attributes ,entities and so on .It takes the following general form:<!DOCTYPE > .It consists of markup code that indicates the grammar rule .It can also point to an external file that contains all or part of DTD. The following code adds a Document Type Declaration to the sample document <?xml version= 1.0?> <!DOCTYPE catalog SYSTEM catalog.dtd> The above statement conveys the XML parser that the document is of the class catalog and conforms to the rules formed in the DTD files named catalog.dtd . Root Element :The root element of an XML document is the element that contains all other element in the document. <?xml version= 1.0?> <hello> Welcome to XML</hello> Here hello is the root element. The root element can be empty. <?xml version= 1.0?> <hello/> 8

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

Epilog The epilog is everything that occurs after the root element ends. The word epilog is used here to name that area which can contain processing instruction, comments or white space. 2) Physical Structure The physical structure of an XML document is composed of all the content used in the document .A single XML document can be made up of a number of distinct physical storage units known as Entities .An Entity is a unit of text and are building blocks of XML document. The full document is rooted in the entity known as Document Entity. An entity can be part of the XML document or external to the document .Each entity is identified by a unique name and contains its own content from a single character inside the document to a large file that exists outside the document. Entities are declared in the document in the prolog and referenced in the document element. An entity can contain reference to other entities, which themselves can contain references to other entities. The previous XML document is split across five separate entities-typically files or storage medium or other. Acme PC Entity A (part1.xml) Entity B ( part2.xml)

Entity A1 (part11.xml)

Entity A2 (part12.xml)

Fig : Physical View of an XML document An XML processor sees an XML document as a series of characters, which reads in a series fashion .when it sees something called Entity Reference ,it reads the name of the entity and replaces the entity reference with the actual text or graphic or other type of media that is referred to. Types of Entities 1. Predefined Entity In XML certain character (< ,> , /) are used specifically for marking up the document .It cannot be interpreted as Character data ,so cannot be used as content .You must use.Entity Reference to insert the character into the document like (&lt;,&gt; ,&amp etc) <myelement>7 &gt; 2</myelement> 2.Parsed Entity It contains text data that becomes part of the XML document once the data is processed .Parsed entity is intended to be read by the XML processor which will extract the content. 9

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

After the content is extracted it becomes part of the document at the location of the entity reference. Eg: publisher information (PUB1) entity can be declared as <!ENTITY PUB! BPB Publishers> Whenever the entity declaration is referenced in the document it will be replaced by its content .First insert an ampersand (&) and then enter entity name followed by (;) for entity reference. <publisher>This book is from &PUB1;</publisher> 3. Unparsed Entity The contents may or may not be text .It is often a binary file or image that is not directly interpreted by the XML processor .Unparsed entity requires a notation. Notation identifies the format or type or resource to which the entity is declared. <!ENTITY myimage SYSTEM 1.gif NDATA GIF> Here GIF is the notation .Notation declaration for GIF is <!Notation GIF SYSTEM utils\gifview.exe> The above declaration tells the processor that whenever it encounters an entity of type GIF it should use gifview.exe to process it. 4. External Entity It refers to a storage unit in its declaration by using a SYSTEM or public identifier.It provides a pointer to a location at which entity can be found. <!ENTITY myimage SYSTEM http://www.abc.com/image/1.gif NDATA GIF> In this example the XML processor must read the file 1.gif to retrieve the content of this entity. VI SIMPLE XML DOCUMENT Create a test.xml file with the following content. <greeting> Hello World </greeting> The one line document has 3 component parts A start tag (<greeting>) An End tag ( </greeting>) Character data (Hello World) By default the XML parser does not produce any output.It gives a simple tree structure it has built from an XML document.The document consists of Element -- greeting PCDATA -- Hello World WHITESPACE -- 0xa Here the Hello World text has been encapsulated beneath a greeting element.At the same level it gives some White space in the form of end-of-line code added to the file by the text editor The parser reports this as a line feed character denoted by 0xa(linefeed in Unicode and ASCII)

10

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

GRAPHICAL REPRESENTATION OF SIMPLE XML

DOCUMENT

GREETING

WHITE SPACE

HELLO WORLD

CREATING XML DOCUMENT There are seven forms of markup that can occur in XML document. Start and End Tags Attribute Assignment Entity References Comments CDATA section Processing Instruction Document Type Declaration VII START AND END TAGS Elements are the primary building blocks of an XML document .These elements are denoted by Tags of various forms .Majority of elements are intended to contain characters ,other elements or a mixture of the two .These elements have start and end points denoted by start and end tags respectively. Tag <greeting> </introduction> <Joe Black> <42> </ Product> Meaning Starts a greeting element Ends an introduction element Bad start tag .No space allowed Element name cannot begin with number No space allowed b/w slash and element name very

Elements can be nested to an arbitrary depth to describe rich information structure .Element which does not have content is an empty element

11

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

ex: <hello/> .another example is <br> (line break) element in HTML, cannot sensibly have any content .In XML it is an empty element .Empty element can have attributes.<hello happy=TRUE/> is valid. Also in <hello />, the new line will be ignored as it occurs within markup .Empty element can also have matching start and end tags as given below.<hello></hello> and <hello ></hello>is also valid. Elements that contain some mixture of markup/character data must have matching start and end tags as the following example illustrate: 1) <person>Sean Smith</person> 2) <printer type=laser>Acer</printer> 3) <address> <street>Main CG st</street> <country>India</country> <business type=retail/> </address> 4) <printer Name=acer type=laser ><description>Bright pink and noisy</description> </printer> The new lines surrounding the printer attributes above are ignored as they occur in markup.

VIII ATTRIBUTE ASSIGNMENT Attributes are pieces of information ,typically small ,that are associated with the XML element .Attribute assignment always associated within the start tag or an empty tag of an element . They always take the form of an assignment of an attribute value to an attribute name with = in between. [ name of the attribute] = [value of the attribute] Attribute values cannot contain other forms of markup such as start tags , comments ,CDATA section etc

12

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

Attribute Assignment 1) <fruit type=apple> 2) <fruit type=apple> 3) <table border=2> 4) <animal leg=4 blood=cold>

Meaning type attribute have value apple Single quotes can also be used Invalid .Attributes must be quoted The leg attribute has the value 4. The blood attribute has the value cold. White space within start tag are ignored by the parser.

Attribute values can be delimited by either matching double quotes or matching single quotes. Examples: <person name=Sean ODear> <quotation text=He said hello to Fred> Attribute values can contain entity references .Here is an example of an attribute value specification that contains an entity reference. <!DOCTYPE test [ <!ENTITY company AcmePC Inc>]> <intro title=&company; will solve all your problems/> Will give the result : A title AcmePC Inc will solve your problems.

IX ENTITY REFERENCES Entities are the building blocks of XML document.They are included in the XML document by means of entity reference. A simple and common usage of entity reference is to slide characters in XML ,that cannot be entered directly without confusing the XML parser .Suppose we wish to have the following content in an XML document. <Document> if a<b and b<c then a<c </Document> This is not well formed in XML as the character (<,>)are reserved for markup and cannot be used as content .To escape a character use entity reference.The general form is Prepend an & and append a ; to the name of the entity .The built in entity for < in XML is same as in HTML ,namely lt; so the above code can be rewritten as <Document> if a &lt; b and b &lt; c then a &lt; c </Document> Here the parser has successfully detected three entity references for three entity lt .This entity is built into the parser and thus the parser knows that the content of the entity lt is <

13

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

.When the XML processor parses the document it will replace the entity reference with actual characters and will not interpret characters as markup.

Five Built-in Entity Entity Reference Interpretation &lt; &gt; &amp; &apos; &quot; < > &

Regardless of the entity types, all entities are referenced in the same way: &name;This code will include the simple entity <!entity iso International Organization for Standardization> within a sentence : the &iso; sets the standard for character encoding. when interepted by an XML parser the result is the International Organization for Standardization sets the standard for character encoding. External entities are referenced in the same way as internal text entities ,with the content of the external file referenced by the entity declaration ,which replaces the entity reference .Binary entities can be referenced only as the value of the element with an attribute that takes an entity value. [NB: add more points from physical structure] X COMMENTS 1. XML comments take exactly the same form as HTML comment <!-- This is a comment --> 2. Note that the string - - cannot occur within a comment, apart from that anything goes , including new lines. For example <!- - This is not a - -well formed XML comment - -> XML processor is not required to pass then along with an application .Comments can appear anywhere outside other markup .ie a comment can appear in the prolog ,the root element or the epilog. 3.Here is a comment spanning two lines: <!- -this is perfectly Legal comment spanning two lines- -> 4. Here is an illegal comment: <!- -this comment is not legal <!element apple #PCDATA><!- -the apple element--> 14

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

Because it is trying to wrap itself around another comment--> 5.Here is a comment occurring within the Document type declaration: <?xml version =1.0?> <!DOCTYPE apple[ <!- -this DTD is for apples--> <!ELEMENT apples(#PCDATA)>]> <apples>12</apples> 6. Here is a comment occurring within the epilog: <?xml version =1.0?> <!DOCTYPE apple[ <!ELEMENT apples(#PCDATA)>]> <apples>12</apples> <!- -end of apples- -> 7.Here is an illegal comment(comments cannot occur within other markup): <?xml version =1.0?> <!DOCTYPE apple[ <!ELEMENT apples(#PCDATA) <!- - apples contain PCDATA--> ]><apples>12</apples> <!- -end of apples- ->

XI CDATA SECTION A CDATA section is used in XML to shield a body of text from the attentions of the XML processor. In a document, a CDATA section instructs the parser ,not to interpret the data defined within it as markup . XML allows a block of text to be insulated from the attention of the parser using CDATA section .CDATA stands for character data . you can mark a section as character data using this syntax. <![CDATA[content]]> Ex1: <Document> <![CDATA[ if a<b and b<c then a<c]] > </Document> Between the start of the section <![CDATA[ and end of the section ]]> all character data is passed directly to the application .comments are not recognized in a CDATA section. Here the parser has detected the presence of a CDATA section and waved the entire string a<b and b<c then a<c through the application directly. Ex2: <![CDATA [this is not an <apple>start tag]]>

15

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

CDATA sections can occur anywhere character data can occur. Because the first occurrence of ]]> will terminate the CDATA section .CDATA sections cannot be nested. Here is an example of a valid CDATA section: Ex3: <?xml version=1.0?> <apples><![CDATA [this is not an </apple>end tag and this is not an &entity; reference]]> </apples> Note the shielding effect of the CDATA section ,which protects what looks like an apple endtag and what looks like an entity reference. In element type models ,the keyword #PCDATA denotes character data. <!ELEMENT para (#PCDATA)> #PCDATA means zero or more characters. PROCESSING INSTRUCTION(PI). PIs are defined as markup that provides information to be used by s/w application.PIs begins with <? and ends with ?> pair. XML itself make use of processing instruction in what is known as XML declaration.The simplest form of PI which should head up all the XML document is <?xml version=1.0?> DOCUMENT TYPE DECLARATION 1. A Document Type Declaration is a statement embedded in an XML document whose purpose is to acknowledge the existence and location of Document Type Definition(DTD). 2. Document Type Declaration Definition(DTD) . is a statement that points to the Document Type

3. Document Type Definition is a set of rules that defines the structure of an XML document where as a Document Type Declaration is a statement that tells the parser which DTD to use for checking and validation. 4. All Document Type Declaration starts with a string <!DOCTYPE 5. The Document Type Declaration can be external or internal 6. If external the DTD must be specified either as SYSTEM or PUBLIC. 7. If PUBLIC the DTD can be used by anyone by referring the URL. 8. if SYSTEM that means it resides on local harddisk and may not be available for use by other application. For example suppose there is an XML document called myfile.xml that we want to parse and validate against a DTD called my-rules.dtd .

16

WEB TECHNOLOGIES

http://www.lectnote.blogspot.com/

The way to associate the content structure of myfile.xml against the rules specified in the myrules.dtd is to insert the following line after XML declaration.The DTD can be housed exclusively by either the external or internal subsets or both.The simplest declaration which allows only an external subset of the DTD follows. <!DOCTYPE name SYSTEM file > If the DTDs are internal the syntax is <!DOCTYPE root-element [<!internal type definition>]> ----------------------------------------------------XML Examples Ex1 <?xml version="1.0"?> <hello> <wel>welcome to XML</wel> <xm> <xm1>XML is eXtensible Markup Language</xm1> <xm2> <sg1>SGML is Standard Generalized Markup Language</sg1> <sg2>XML is a subset of SGML</sg2> </xm2> </xm></hello> WELL-FORMED XML DOCUMENT FOR ACMEPC CATALOG APPLN Step 1:Open a notepad and type this code <?xml version="1.0"?> <!--commented out for the moment <!DOCTYPE acmepc SYSTEM "acmepc.dtd">--> <acmepc> <item type="PC" code="ACME1"> <make> <brand>Acme Deluxe</brand> <supplier id="Acme"/> </make> <specification> <cpu type="986" speed="500"/> <harddisk type="IDE" size="20" units="GB"/> </specification> <price n="1000" units="USD"/> <feature> <p>A versatile PC for &amp; business use</p> <ul> <li><index term="multimedia"/>Sound Card</li> <li>90 day money back guarantee</li> <li>Mouse pad</li> </ul> </feature> <internal> <instock num="44"/> <perunitcost n="900" units="USD"/> 17

WEB TECHNOLOGIES </internal> </item> </acmepc>

http://www.lectnote.blogspot.com/

Step 2 : Save this file as catalog.xml and open the file in the browser . In the browser the well-formed XML parsed into a tree like structure as shown below.For getting a proper display in the browser we need XSL <?xml version="1.0" ?> - <acmepc> - <item type="PC" code="ACME1"> + <make> + <specification> <price n="1000" units="USD" /> + <feature> + <internal> </item> </acmepc>

18

You might also like