You are on page 1of 34

Welcome to:

What Is XML?

Copyright IBM Corporation 2004


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3.1

Unit Objectives
After completing this unit, you should be able to:
Describe the basic rules of XML
Describe what it means for an XML document to be well-formed
List the components that make up an XML document
Differentiate between XML and HTML
Describe the internationalization support in XML
Define some best practices for XML

Copyright IBM Corporation 2004

What Is XML?
At its core XML is text formatted to follow a well-defined set of rules.
XML documents consist primarily of tags and text.
If you've ever seen the source to an HTML document, then the
XML structure should look familiar
This text may be stored/represented in:
A normal file stored on disk
A message being sent over HTTP
A character string in a programming language
A CLOB (character large object) in a database
Any other way textual data can be used
XML documents do not need to exist as documents --they may be:
Byte streams sent between applications
Fields in a database record
Collections of XML Infoset information items
For simplicity they will be referred to as though they are
documents and files.
Copyright IBM Corporation 2004

Example Tree Representation of XML


XML documents should be thought of as a hierarchical tree
structure.

<?xml version="1.0"?>
<book>
<author>
Tom Wolfe
</author>
<title>
The Right Stuff
</title>
<price>
$6.00
</price>
</book>

ROOT

<book>

<author>

<title>

<price>

"Tom
Wolfe"

"The
Right
Stuff"

"$6.00"

Copyright IBM Corporation 2004

A Simple XML Document - Basic Structure


<?xml version="1.0"?>

"Optional" first line; only required if


encoding IS NOT UTF-8 or UTF-16*

<book>

Root element start tag

<title>
Alphabet from A to Z
</title>

First child element with data

<isbn number="1112-23-4356" />

Empty element (no data)

<author>

Begin element tag

<firstName>Boreng</firstName>
<lastName>Riter</lastName>

Nested child elements

</author>

End element tag

<chapter title="Letter A">


The letter A is the first in
the alphabet. It is also the
first of five vowels.
</chapter>

Element containing an attribute and


parsed character data (PCDATA) [TBD]

<!-- The rest of the letter


chapters are missing -->

Comment

<chapter title="Letter Z">


The letter Z is the last
letter in the alphabet.
</chapter>

Last element in document

</book>

Root element end tag

Copyright IBM Corporation 2004

A Simple XML Document Basic Nomenclature


The XML instance on the previous page consists of:
One main element book
Subelements title, isbn, author, chapter, and comment
Author contains other subelements firstName and lastName
ISBN and chapter contain attributes number and title, respectively
Title, firstName, and lastName contain only strings:
Elements that contain numbers, strings, dates, and so forth (TBD) but no
subelements (or attributes) are said to have simple types
ISBN and chapter carry attributes; author has subelements:
Elements that contain subelements or carry attributes are said to have
complex types
Attributes always have simple types (that is, they are numbers, strings,
dates, and so forth.
TBD -- In a later chapter we describe XML Schemas which have access to
a collection of built-in simple types

Copyright IBM Corporation 2004

Basics of Well-formed XML (1 of 2)


XML documents are considered to be well-formed when they
adhere to a set of five rules that define basic XML syntax and
structure + a sixth for worldwide conformity.
1. There must be a single root element:

All other elements are nested inside the root element

2. Elements must be properly terminated:

For every opening tag "<...>" there must be a matching closing tag
"</...>"
The exception is an empty (no content or body) tag "<.../>"

3. Elements must be properly nested underneath a parent tag


(except for the single, root element):
A nested tag-pair may not overlap another tag
There is no limit to the nesting level of children elements

Copyright IBM Corporation 2004

Basics of Well-formed XML (2 of 2)


4. Tag names are case sensitive:

All tag and attribute names, attribute values, and data must comply
with XML naming rules.

5. Attributes, extra information that can be provided for elements,


must be properly quoted:
That is, all attribute values must be in quotes.

6. The first line should/must contain the special tag that identifies
the version of the XML specification to apply:
XML 1.0 is currently the most common.

Copyright IBM Corporation 2004

Element Rules - Rule 1. Single Root Element


All XML documents must have a single root element.

Legal:

Not legal:

<?xml version="1.0"?>
<colors>
<color>red</color>
<color>green</color>
</colors>

Colors is the root element for


this XML.

<?xml version="1.0"?>
<color>red</color>
<color>green</color>

Color represents multiple root


elements.

Copyright IBM Corporation 2004

Element Rules - Rule 2. Element Tag Rules


Elements consist of start and end tags.
End tag is identified by the /.
Example:
<color>red</color>
Elements may contain attributes within the start tag.
Example:
<book isbn="34323"></book>
Note: The attribute is isbn.
Empty elements contain no child elements or data.
These elements can be represented with a special shorthand
notation.
Example:
<record key="123"></record>
Can be shortened to:
<record key="123" /> (preferred)
Or, if the element has no data as:
<record />

Copyright IBM Corporation 2004

Element Rules - Rule 3. Element Nesting


Elements must be properly nested.
The end tags of inner elements must occur before the end tags of
outer elements.
Any number of child elements or data may be nested within the start
and end tags of an element.

Copyright IBM Corporation 2004

Element Nesting Example


Legal:

Not legal:

<?xml version="1.0"?>
<shirt>
<style>Polo</style>
<color>red</color>
<size>large</size>
</shirt>

All elements are properly nested.

<?xml version="1.0"?>
<shirt>
<style>
<size>large
<color>red
Polo
</style>
</size></color>
</shirt>
The element tags are mixed up
and not ordered.

Best Practice: Use indentation to represent the document's hierarchy.


Important if your document will likely be read by humans.
Computers and programs don't usually care.
Copyright IBM Corporation 2004

Element Rules - Rule 4. XML Naming Rules


XML name construction:
The first character must be A-Z, a-z, or _ (underscore)
Any number of subsequent letters, numbers, hyphens,
periods, colons, and underscore characters.
XML names are case sensitive.
Names cannot contain spaces.
Names must not have a prefix of xml in any case combination
(such names are reserved).
Best Practice: Brevity in tag names is not necessary.
Use descriptive names for elements and attributes.
<Queue> or <que> is far better than <q>.
Best Practice: Maintain standard naming conventions and
quoting.
Camelback, dot and underscore notation are all common
(For example, camelBackNotation, dot.notation, and
underscore_notation).
Copyright IBM Corporation 2004

Rule 4. Tag Naming - Samples


Legal

Not Legal

Comments

title, book.isdn,
lastName, _street,
addrLine1, name:first
<color>
red
</color>
<SIZE>
small
</SIZE>

1name, -street,
&name

Examples of legal and


illegal element names.

<color>
red
</COLOR>
<SIZE>
small
<SiZe>

Element names are


case sensitive and
start and end tags
must match.

<fname>
John
</fname>

<f name>
John
</f name>

Element names must


not contain spaces.

<nameXML>
John
</nameXML>

<xmlName>
John
</xmlName>

Elements must not


contain any W3C
reserved words.

Copyright IBM Corporation 2004

Rule 4. Element Content (1 of 2): General


An XML instance is composed of elements expressed in tag pairs
(except for empty tags) plus optional attributes that always have
quoted values and optional data that appears between the element
start tag and the element end tag.
Mixed content - element content that contains data (PCDATA is
shown) and other elements.
Example (snippet):
<title><ref>XML</ref> Example</title>
<chapter>
Chapter information
<para>What is XML</para>
<para>What is HTML</para>
More chapter information
</chapter>

Copyright IBM Corporation 2004

Rule 4. Element Content (2 of 2): Data


Element data content is handled in one of two ways:
1. Parsed Character Data (PCDATA): is examined by the XML
parser to discover XML content embedded within it.
2. Character Data (CDATA): is delimited by the special syntax
<![CDATA[ ... ]]> and is not processed by the parser.

Copyright IBM Corporation 2004

Rule 4. PCDATA - Parsed Character Data


Predefined entities exist to address ambiguous syntax situations,
situations where the literal would be interpreted as part of the
XML document syntax rather than its content.

Entity
&lt;
&gt;
&amp;
&apos;
&quot;

Description
"less than"
"greater than"
"ampersand"
"apostrophe"
"quote"

Character
<
>
&
'
"

Examples:
<range>&gt; 6 &amp; &lt; 20</range>
<quotes characters="'&quot;'"/>

Copyright IBM Corporation 2004

Rule 4. CDATA - Character Data


Syntax:
<![CDATA[ ...Anything can go here... ]]>
Note: Anything except the literal string "]]>";
to embed "]]>" use "]]&gt;"
CDATA is not parsed and is treated as-is.
Useful for embedding other languages within the XML.
HTML documents.
XML documents.
JavaScript source.
Or any other text with a lot of special characters.
Generally speaking the escaping rules inside a CDATA section are
those of the embedded language
For example, to escape an ampersand in Javascript use &#38;.

Copyright IBM Corporation 2004

Rule 4. CDATA Examples


These script elements contain JavaScript:
<script><![CDATA[
function matchwo(a,b) {
if (a < b && a < 0)
then
{ return 1 }
else
{ return 0 }
}
]]></script>

<script><![CDATA[
function matchwo(a,b) {
if (a < b &#38;&#38; a < 0)
then
{ return 1 }
else
{ return 0 }
}
]]></script>

This nameXML element stores actual XML to be treated as text:


<nameXML>
<![CDATA[
<name common="freddy" breed="springer-spaniel">
Sir Frederick of Ledyard's End
</name>
]]>
</nameXML>
Copyright IBM Corporation 2004

Element Rules - Rule 5. Element Attributes


Attributes are used to attach information to elements.
Attributes consist of a name="value" pair, where the name is a legal
XML name. This is often referred to as a "key-value" pair.
Attributes are placed in the start tag of the element to which they
apply.
An element may have several attributes, each uniquely named.
Examples:
<title type="section" number="1">XML overview</title>
<title type="boat" state="FL">Yacht</title>

Notice the different usage of the attribute "type" in the two elements;
semantically they are not the same.
Attributes must have a value.
Values must be quoted with either double or single quotes.
Convention is to stick with one or the other.

Copyright IBM Corporation 2004

Element Rules - Rule 6.


XML Declaration (1 of 2)
The XML Declaration is an optional first line in all XML documents:
<?xml version="1.0" ?>
<?xml version="1.0" encoding="UTF-8" ?>
<?xml version="1.0" standalone="yes"?>

If this declaration is used, the version attribute is mandatory.


The encoding attribute indicates the character encoding used in the
document; if UTF-8 or UTF-16 is used it may be omitted.
ASCII is a subset of UTF-8 and need not be declared.
Comments are not allowed before this statement.
The XML Declaration follows the syntax of a Processing Instruction or PI,
which is described on a subsequent chart, but it is considered to be
unique and is treated separately in the 1.0 XML specification.
GENERAL NOTE OF CAUTION: You can not always rely on a browser or
tool to completely/correctly enforce the specifications. Nor are the
specifications always written in language that, to a particular reader, is
unambiguous. Still, the best advice is when in doubt, refer to the
specification, which for XML is www.w3.org/XML.
Copyright IBM Corporation 2004

Element Rules - Rule 6.


XML Declaration (2 of 2)
The stand-alone attribute is included here for completeness: it is used to
indicate if this XML document depends on information declared externally to
this document (in a DTD or XSL file (TBD), for examples); value may be yes
or no.
A value of "yes" indicates there are no external markup declarations; if
there are no external markup declarations, the declaration has no
meaning.
A value of "no"indicates there are or may be such external markup
declarations; if there are such declarations but there is no standalone
declaration, "no" is assumed.
. . . so it is typically not used.
In any event, the inclusion in the XML instance of references to external
entities, such as those in an embedded DTD, does not change its
standalone status.
A bigger issue associated with the stand-alone attribute is that of defining or
setting values in any entity that may be external to the XML instance.
Arguably, the principal reason for using XML is that it explicitly defines the
elements it includes. If attribute values are overridden then the XML
instance before us is no longer declarative.
Copyright IBM Corporation 2004

Comments
<!--

--> Defines a comment.

A space after the beginning and before the trailing hyphens is


recommended but not required.
<?xml version="1.0"?>
<!-- This is a comment. They can go anywhere
inside an XML document except within an element
tag.
-->
<book>
<chapter>A is the first letter</chapter>
<!-- Here is another comment. -->
<chapter>Z is the last letter</chapter>
</book>
Improper usage:
<chapter <!-- comment -->>Some text.</chapter>
...or before the XML Declaration statement.
Copyright IBM Corporation 2004

Internationalization and Encoding (1 of 2)


Support for different character encodings is provided through the
encoding attribute of the XML Declaration.
<?xml version="1.0" encoding="charset"?>
The encoding attribute indicates the set of characters that are
permitted in the document.
In the absence of an encoding declaration, Unicode UTF-8 or
UTF-16 characters may be used.
Documents exchanged via network may be presented to the
processor in an encoding format other than the specified encoding
as long as the transport protocol (for example, HTTP) indicates the
encoding used.

Copyright IBM Corporation 2004

Internationalization and Encoding (2 of 2)


It is very important that the editor and operating system used to
write and save an XML document support the encoding specified in
the XML Declaration.
Sample encoding declarations:
ASCII (subset of UTF-8)
<?xml version="1.0" encoding="ISO-8859-1"?>
16 bit UNICODE
<?xml version="1.0" encoding="UTF-16"?>
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
...
Japanese
<?xml version="1.0" encoding="ISO-2022-JP"?>
<?xml version="1.0" encoding="Shift_JIS"?>
...
Note: Encoding names are case-insensitive
Copyright IBM Corporation 2004

Processing Instruction
Syntax <? target arg*?>
Processing Instruction is often abbreviated as PI in
documentation.
A feature inherited from SGML.
Used to embed application-specific instructions in documents.
The target name immediately follows "<?" and is used to
associate the PI with an application.
May include zero or more arguments.
May be preceded by comments.

For example, <?xml-stylesheet href="common.css" type="text/css"?>,


which is a generally available stylesheet for simple formatting.

Copyright IBM Corporation 2004

Well-formed versus Valid


A well-formed XML document:
Consists of XML elements that are nested within another.
Has a unique root element.
Follows the XML naming conventions.
Follows the XML rules for quoting attributes.
Has tags that are properly terminated.
All XML parsers check for well-formedness.
A valid XML document has an associated vocabulary and obeys the
structural rules specified by that vocabulary.
Associated vocabulary is typically defined by either a DTD or an
XML Schema.
XML parsers may be validating or non-validating depending upon
whether or not they can apply an associated grammar.
Studio is an example of a tool whose XML capabilities include
validation.

Copyright IBM Corporation 2004

HTML versus XML (1 of 2)

HTML is about presentation and


browsing

XML is about structured information


interchange

<course>
<name>Java Programming</name>
<department>EECS</department>
<teacher>
<name>Paul Thompson</name>
</teacher>
<student>
<name>Ron Jones</name>
</student>
<student>
<name>Uma Abingdon</name>
</student>
<student>
<name>Lindsay Garmon</name>
</student>
</course>

Copyright IBM Corporation 2004

HTML versus XML (2 of 2)


HTML

XML

<html>
<title>Course Roster</title>
<body>
<center>
<h1>Course Roster</h1>
<h2>XML Programming</h2>
<h3>Department: EECS</h3>
<p>
<table border=2>
<tr>
<th>Teacher</th>
<td>Paul Thompson</td>
</tr><tr>
<th>Student<br>List</th>
<td>Ron Jones<br>
Uma Abingdon<br>
Lindsay Garmon
</td>
</tr>
</table>
</center>
</body>
</html>

<?xml version="1.0"?>
<course>
<name>Java Programming</name>
<department>EECS</department>
<teacher>
<name>Paul Thompson</name>
</teacher>
<student>
<name>Ron Jones</name>
</student>
<student>
<name>Uma Abingdon</name>
</student>
<student>
<name>Lindsay Garmon</name>
</student>
</course>

Copyright IBM Corporation 2004

HTML and XML Key Differences


HTML

XML

Predefined tags define how to present


data.

Defines its own tags to identify data.

Allows missing end tags.


<br> and <p>

Requires matching end tags.

Attributes do not require quotes.

Attributes must be quoted.

<img src=myDog.jpeg>

<book isdn="3432"></book>

Attributes do not require a value.

Attributes must have a value.

<input type=radio checked>

<device type="radio" />

Tolerates non-nested tags.

Strict nesting and tag matching rules.

<H1><center>Hello!</H1></center>

<H1><center>Hello!</center></H1>

Browsers will almost always do a "best


guess" on ill-formed HTML.

XML Parsers will generate a fatal


exception for well-formedness violations.

Does not support empty elements, but


allows single start tags.
<br> and <hr>

Provides for empty elements.

Is not case sensitive.


<TABLE> ... </table>

is valid

<name>test</name>

<device type="radio" />

Is case sensitive.

Copyright IBM Corporation 2004

Checkpoint Questions (1 of 3)
1. Basic XML can be described as:
A. A hierarchical structure of tagged elements, attributes and text.
B. All the HTML tags plus a set of new XML only tags.
C. Object-oriented structure of rows and columns.
D. Processing instructions (PIs) for text data.
E. Textual data with tags for visual presentation.
2. Which of these XML fragments is not well-formed?
A. <root><class>XML</class></root>
B. <class><root>XML</root></class>
C. <root><class id="XML"></root>
D. <root>XML<class id="XML"/>XML</root>
E. <root class="XML"><class id="root"/>XML</root>

Copyright IBM Corporation 2004

Checkpoint Questions (2 of 3)
3. XML Comments are allowed (Select all that apply):
A. Before the XML Declaration
B. Anywhere
C. Between element tags
D. Before the root element
E. All of the Above
4. Which of these XML elements with attributes is not well-formed?
A. <name first='Tony' LAST="Romeo" />
B. <name name="Tony" NAME="ROMEO" />
C. <_name_ first-name="Tony" last-name="Romeo"/>
D. <name="Tony Romeo" />
E. <name name="first='Tony' last='Romeo'" />
F. All of the Above

Copyright IBM Corporation 2004

Checkpoint Questions (3 of 3)
5. Which of these comments regarding HTML and XML is not true?
A. HTML markup is focused on presentation.
B. XML markup is based on defining the data.
C. XML is based on HTML.
D. HTML tags are not case sensitive.
E. XML tags are case sensitive.
F. Both XML and HTML support attributes.

Copyright IBM Corporation 2004

Unit Summary
Having completed this unit, you should be able to:
Describe the basic rules of XML
Describe what it means for an XML document to be well-formed
List the components that make up an XML document
Describe the differences between XML and HTML
Describe the internationalization support in XML
Describe some best practices in XML

Copyright IBM Corporation 2004

You might also like