Lecture Introduction To XML Lecture Introduction To XML What Is XML - Pdfwhat Is XML

Welcome to:
What Is XML?
Copyright IBM Corporation 2004

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3.1
Unit Objectives
After completing this unit, you should be able to:
Describe the basic rules of XML
Describe what it means for an XML document to be well-formed
List the components that make up an XML document
Differentiate between XML and HTML
Describe the internationalization support in XML
Define some best practices for XML
What Is XML?
At its core XML is text formatted to follow a well-defined set of rules.
XML documents consist primarily of tags and text.
If you've ever seen the source to an HTML document, then the
XML structure should look familiar
This text may be stored/represented in:
A normal file stored on disk
A message being sent over HTTP
A character string in a programming language
A CLOB (character large object) in a database
Any other way textual data can be used
XML documents do not need to exist as documents --they may be:
Byte streams sent between applications
Fields in a database record
Collections of XML Infoset information items
For simplicity they will be referred to as though they are
documents and files.
Example Tree Representation of XML

XML documents should be thought of as a hierarchical tree
structure.
<?xml version="1.0"?>
<book>
<author>
Tom Wolfe
</author>
<title>
The Right Stuff
</title>
<price>
$6.00
</price>
</book>
ROOT
<book>
<author>
<title>
<price>
"Tom
Wolfe"
"The
Right
Stuff"
"$6.00"
A Simple XML Document - Basic Structure

"Optional" first line; only required if

encoding IS NOT UTF-8 or UTF-16*
<book>
Root element start tag
<title>
Alphabet from A to Z
</title>
First child element with data
<isbn number="1112-23-4356" />
Empty element (no data)
<author>
Begin element tag
<firstName>Boreng</firstName>
<lastName>Riter</lastName>
Nested child elements
</author>
End element tag
<chapter title="Letter A">

The letter A is the first in
the alphabet. It is also the
first of five vowels.
</chapter>
Element containing an attribute and

parsed character data (PCDATA) [TBD]

Comment
<chapter title="Letter Z">

The letter Z is the last
letter in the alphabet.
</chapter>
Last element in document
</book>
Root element end tag
A Simple XML Document Basic Nomenclature

The XML instance on the previous page consists of:
One main element book
Subelements title, isbn, author, chapter, and comment
Author contains other subelements firstName and lastName
ISBN and chapter contain attributes number and title, respectively
Title, firstName, and lastName contain only strings:
Elements that contain numbers, strings, dates, and so forth (TBD) but no
subelements (or attributes) are said to have simple types
ISBN and chapter carry attributes; author has subelements:
Elements that contain subelements or carry attributes are said to have
complex types
Attributes always have simple types (that is, they are numbers, strings,
dates, and so forth.
TBD -- In a later chapter we describe XML Schemas which have access to
a collection of built-in simple types
Basics of Well-formed XML (1 of 2)

XML documents are considered to be well-formed when they
adhere to a set of five rules that define basic XML syntax and
structure + a sixth for worldwide conformity.
1. There must be a single root element:
All other elements are nested inside the root element
2. Elements must be properly terminated:
For every opening tag "<...>" there must be a matching closing tag
"</...>"
The exception is an empty (no content or body) tag "<.../>"
3. Elements must be properly nested underneath a parent tag

(except for the single, root element):
A nested tag-pair may not overlap another tag
There is no limit to the nesting level of children elements
Basics of Well-formed XML (2 of 2)

4. Tag names are case sensitive:
All tag and attribute names, attribute values, and data must comply
with XML naming rules.
5. Attributes, extra information that can be provided for elements,

must be properly quoted:
That is, all attribute values must be in quotes.
6. The first line should/must contain the special tag that identifies
the version of the XML specification to apply:
XML 1.0 is currently the most common.
Element Rules - Rule 1. Single Root Element

All XML documents must have a single root element.
Legal:
Not legal:
<colors>
<color>red</color>
<color>green</color>
</colors>
Colors is the root element for

this XML.
<color>red</color>
<color>green</color>
Color represents multiple root

elements.
Element Rules - Rule 2. Element Tag Rules

Elements consist of start and end tags.
End tag is identified by the /.
Example:
<color>red</color>
Elements may contain attributes within the start tag.
Example:
<book isbn="34323"></book>
Note: The attribute is isbn.
Empty elements contain no child elements or data.
These elements can be represented with a special shorthand
notation.
Example:
<record key="123"></record>
Can be shortened to:
<record key="123" /> (preferred)
Or, if the element has no data as:
<record />
Element Rules - Rule 3. Element Nesting

Elements must be properly nested.
The end tags of inner elements must occur before the end tags of
outer elements.
Any number of child elements or data may be nested within the start
and end tags of an element.
Element Nesting Example

Legal:
Not legal:
<shirt>
<style>Polo</style>
<color>red</color>
<size>large</size>
</shirt>
All elements are properly nested.
<shirt>
<style>
<size>large
<color>red
Polo
</style>
</size></color>
</shirt>
The element tags are mixed up
and not ordered.
Best Practice: Use indentation to represent the document's hierarchy.

Important if your document will likely be read by humans.
Computers and programs don't usually care.
Element Rules - Rule 4. XML Naming Rules

XML name construction:
The first character must be A-Z, a-z, or _ (underscore)
Any number of subsequent letters, numbers, hyphens,
periods, colons, and underscore characters.
XML names are case sensitive.
Names cannot contain spaces.
Names must not have a prefix of xml in any case combination
(such names are reserved).
Best Practice: Brevity in tag names is not necessary.
Use descriptive names for elements and attributes.
<Queue> or <que> is far better than <q>.
Best Practice: Maintain standard naming conventions and
quoting.
Camelback, dot and underscore notation are all common
(For example, camelBackNotation, dot.notation, and
underscore_notation).
Rule 4. Tag Naming - Samples

Legal
Not Legal
Comments
title, book.isdn,
lastName, _street,
addrLine1, name:first
<color>
red
</color>
<SIZE>
small
</SIZE>
1name, -street,
&name
Examples of legal and

illegal element names.
<color>
red
</COLOR>
<SIZE>
small
<SiZe>
Element names are

case sensitive and
start and end tags
must match.
<fname>
John
</fname>
<f name>
John
</f name>
Element names must

not contain spaces.
<nameXML>
John
</nameXML>
<xmlName>
John
</xmlName>
Elements must not

contain any W3C
reserved words.
Rule 4. Element Content (1 of 2): General

An XML instance is composed of elements expressed in tag pairs
(except for empty tags) plus optional attributes that always have
quoted values and optional data that appears between the element
start tag and the element end tag.
Mixed content - element content that contains data (PCDATA is
shown) and other elements.
Example (snippet):
<title><ref>XML</ref> Example</title>
<chapter>
Chapter information
<para>What is XML</para>
<para>What is HTML</para>
More chapter information
</chapter>
Rule 4. Element Content (2 of 2): Data

Element data content is handled in one of two ways:
1. Parsed Character Data (PCDATA): is examined by the XML
parser to discover XML content embedded within it.
2. Character Data (CDATA): is delimited by the special syntax
<![CDATA[ ... ]]> and is not processed by the parser.
Rule 4. PCDATA - Parsed Character Data

Predefined entities exist to address ambiguous syntax situations,
situations where the literal would be interpreted as part of the
XML document syntax rather than its content.
Entity
<
>
&
'
"
Description
"less than"
"greater than"
"ampersand"
"apostrophe"
"quote"
Character
<
>
&
'
"
Examples:
<range>> 6 & < 20</range>
<quotes characters="'"'"/>
Rule 4. CDATA - Character Data

Syntax:
<![CDATA[ ...Anything can go here... ]]>
Note: Anything except the literal string "]]>";
to embed "]]>" use "]]>"
CDATA is not parsed and is treated as-is.
Useful for embedding other languages within the XML.
HTML documents.
XML documents.
JavaScript source.
Or any other text with a lot of special characters.
Generally speaking the escaping rules inside a CDATA section are
those of the embedded language
For example, to escape an ampersand in Javascript use &.
Rule 4. CDATA Examples

These script elements contain JavaScript:
<script><![CDATA[
function matchwo(a,b) {
if (a < b && a < 0)
then
{ return 1 }
else
{ return 0 }
}
]]></script>
<script><![CDATA[
function matchwo(a,b) {
if (a < b && a < 0)
then
{ return 1 }
else
{ return 0 }
}
]]></script>
This nameXML element stores actual XML to be treated as text:

<nameXML>
<![CDATA[
<name common="freddy" breed="springer-spaniel">
Sir Frederick of Ledyard's End
</name>
]]>
</nameXML>
Element Rules - Rule 5. Element Attributes

Attributes are used to attach information to elements.
Attributes consist of a name="value" pair, where the name is a legal
XML name. This is often referred to as a "key-value" pair.
Attributes are placed in the start tag of the element to which they
apply.
An element may have several attributes, each uniquely named.
Examples:
<title type="section" number="1">XML overview</title>
<title type="boat" state="FL">Yacht</title>
Notice the different usage of the attribute "type" in the two elements;
semantically they are not the same.
Attributes must have a value.
Values must be quoted with either double or single quotes.
Convention is to stick with one or the other.
Element Rules - Rule 6.

XML Declaration (1 of 2)
The XML Declaration is an optional first line in all XML documents:
<?xml version="1.0" ?>
<?xml version="1.0" encoding="UTF-8" ?>
<?xml version="1.0" standalone="yes"?>
If this declaration is used, the version attribute is mandatory.

The encoding attribute indicates the character encoding used in the
document; if UTF-8 or UTF-16 is used it may be omitted.
ASCII is a subset of UTF-8 and need not be declared.
Comments are not allowed before this statement.
The XML Declaration follows the syntax of a Processing Instruction or PI,
which is described on a subsequent chart, but it is considered to be
unique and is treated separately in the 1.0 XML specification.
GENERAL NOTE OF CAUTION: You can not always rely on a browser or
tool to completely/correctly enforce the specifications. Nor are the
specifications always written in language that, to a particular reader, is
unambiguous. Still, the best advice is when in doubt, refer to the
specification, which for XML is www.w3.org/XML.
Element Rules - Rule 6.

XML Declaration (2 of 2)
The stand-alone attribute is included here for completeness: it is used to
indicate if this XML document depends on information declared externally to
this document (in a DTD or XSL file (TBD), for examples); value may be yes
or no.
A value of "yes" indicates there are no external markup declarations; if
there are no external markup declarations, the declaration has no
meaning.
A value of "no"indicates there are or may be such external markup
declarations; if there are such declarations but there is no standalone
declaration, "no" is assumed.
. . . so it is typically not used.
In any event, the inclusion in the XML instance of references to external
entities, such as those in an embedded DTD, does not change its
standalone status.
A bigger issue associated with the stand-alone attribute is that of defining or
setting values in any entity that may be external to the XML instance.
Arguably, the principal reason for using XML is that it explicitly defines the
elements it includes. If attribute values are overridden then the XML
instance before us is no longer declarative.
Comments
 Defines a comment.
A space after the beginning and before the trailing hyphens is

recommended but not required.

<book>
<chapter>A is the first letter</chapter>

<chapter>Z is the last letter</chapter>
</book>
Improper usage:
<chapter >Some text.</chapter>
...or before the XML Declaration statement.
Internationalization and Encoding (1 of 2)

Support for different character encodings is provided through the
encoding attribute of the XML Declaration.
<?xml version="1.0" encoding="charset"?>
The encoding attribute indicates the set of characters that are
permitted in the document.
In the absence of an encoding declaration, Unicode UTF-8 or
UTF-16 characters may be used.
Documents exchanged via network may be presented to the
processor in an encoding format other than the specified encoding
as long as the transport protocol (for example, HTTP) indicates the
encoding used.
Internationalization and Encoding (2 of 2)

It is very important that the editor and operating system used to
write and save an XML document support the encoding specified in
the XML Declaration.
Sample encoding declarations:
ASCII (subset of UTF-8)
<?xml version="1.0" encoding="ISO-8859-1"?>
16 bit UNICODE
<?xml version="1.0" encoding="UTF-16"?>
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
...
Japanese
<?xml version="1.0" encoding="ISO-2022-JP"?>
<?xml version="1.0" encoding="Shift_JIS"?>
...
Note: Encoding names are case-insensitive
Processing Instruction
Syntax <? target arg*?>
Processing Instruction is often abbreviated as PI in
documentation.
A feature inherited from SGML.
Used to embed application-specific instructions in documents.
The target name immediately follows "<?" and is used to
associate the PI with an application.
May include zero or more arguments.
May be preceded by comments.
For example, <?xml-stylesheet href="common.css" type="text/css"?>,

which is a generally available stylesheet for simple formatting.
Well-formed versus Valid

A well-formed XML document:
Consists of XML elements that are nested within another.
Has a unique root element.
Follows the XML naming conventions.
Follows the XML rules for quoting attributes.
Has tags that are properly terminated.
All XML parsers check for well-formedness.
A valid XML document has an associated vocabulary and obeys the
structural rules specified by that vocabulary.
Associated vocabulary is typically defined by either a DTD or an
XML Schema.
XML parsers may be validating or non-validating depending upon
whether or not they can apply an associated grammar.
Studio is an example of a tool whose XML capabilities include
validation.
HTML versus XML (1 of 2)
HTML is about presentation and

browsing
XML is about structured information

interchange
<course>
<name>Java Programming</name>
<department>EECS</department>
<teacher>
<name>Paul Thompson</name>
</teacher>
<student>
<name>Ron Jones</name>
</student>
<student>
<name>Uma Abingdon</name>
</student>
<student>
<name>Lindsay Garmon</name>
</student>
</course>
HTML versus XML (2 of 2)

HTML
XML
<html>
<title>Course Roster</title>
<body>
<center>
<h1>Course Roster</h1>
<h2>XML Programming</h2>
<h3>Department: EECS</h3>
<p>
<table border=2>
<tr>
<th>Teacher</th>
<td>Paul Thompson</td>
</tr><tr>
<th>Student<br>List</th>
<td>Ron Jones<br>
Uma Abingdon<br>
Lindsay Garmon
</td>
</tr>
</table>
</center>
</body>
</html>
<course>
<name>Java Programming</name>
<department>EECS</department>
<teacher>
<name>Paul Thompson</name>
</teacher>
<student>
<name>Ron Jones</name>
</student>
<student>
<name>Uma Abingdon</name>
</student>
<student>
<name>Lindsay Garmon</name>
</student>
</course>
HTML and XML Key Differences

HTML
XML
Predefined tags define how to present

data.
Defines its own tags to identify data.
Allows missing end tags.

<br> and <p>
Requires matching end tags.
Attributes do not require quotes.
Attributes must be quoted.
<img src=myDog.jpeg>
<book isdn="3432"></book>
Attributes do not require a value.
Attributes must have a value.
<input type=radio checked>
<device type="radio" />
Tolerates non-nested tags.
Strict nesting and tag matching rules.
<H1><center>Hello!</H1></center>
<H1><center>Hello!</center></H1>
Browsers will almost always do a "best

guess" on ill-formed HTML.
XML Parsers will generate a fatal

exception for well-formedness violations.
Does not support empty elements, but

allows single start tags.
<br> and <hr>
Provides for empty elements.
Is not case sensitive.

<TABLE> ... </table>
is valid
<name>test</name>
<device type="radio" />
Is case sensitive.
Checkpoint Questions (1 of 3)
1. Basic XML can be described as:
A. A hierarchical structure of tagged elements, attributes and text.
B. All the HTML tags plus a set of new XML only tags.
C. Object-oriented structure of rows and columns.
D. Processing instructions (PIs) for text data.
E. Textual data with tags for visual presentation.
2. Which of these XML fragments is not well-formed?
A. <root><class>XML</class></root>
B. <class><root>XML</root></class>
C. <root><class id="XML"></root>
D. <root>XML<class id="XML"/>XML</root>
E. <root class="XML"><class id="root"/>XML</root>
3. XML Comments are allowed (Select all that apply):
A. Before the XML Declaration
B. Anywhere
C. Between element tags
D. Before the root element
E. All of the Above
4. Which of these XML elements with attributes is not well-formed?
A. <name first='Tony' LAST="Romeo" />
B. <name name="Tony" NAME="ROMEO" />
C. <_name_ first-name="Tony" last-name="Romeo"/>
D. <name="Tony Romeo" />
E. <name name="first='Tony' last='Romeo'" />
F. All of the Above
5. Which of these comments regarding HTML and XML is not true?
A. HTML markup is focused on presentation.
B. XML markup is based on defining the data.
C. XML is based on HTML.
D. HTML tags are not case sensitive.
E. XML tags are case sensitive.
F. Both XML and HTML support attributes.
Unit Summary
Having completed this unit, you should be able to:
Describe the basic rules of XML
Describe what it means for an XML document to be well-formed
List the components that make up an XML document
Describe the differences between XML and HTML
Describe the internationalization support in XML
Describe some best practices in XML

Lecture Introduction To XML Lecture Introduction To XML What Is XML - Pdfwhat Is XML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Introduction To XML Lecture Introduction To XML What Is XML - Pdfwhat Is XML

Uploaded by

Copyright:

Available Formats

Welcome to:

Copyright IBM Corporation 2004

Copyright IBM Corporation 2004

Example Tree Representation of XML

Copyright IBM Corporation 2004

A Simple XML Document - Basic Structure

"Optional" first line; only required if

Root element start tag

First child element with data

<isbn number="1112-23-4356" />

Empty element (no data)

Begin element tag

Nested child elements

End element tag

<chapter title="Letter A">

Element containing an attribute and

<!-- The rest of the letter

<chapter title="Letter Z">

Last element in document

Root element end tag

Copyright IBM Corporation 2004

A Simple XML Document Basic Nomenclature

Copyright IBM Corporation 2004

Basics of Well-formed XML (1 of 2)

All other elements are nested inside the root element

2. Elements must be properly terminated:

3. Elements must be properly nested underneath a parent tag

Copyright IBM Corporation 2004

Basics of Well-formed XML (2 of 2)

5. Attributes, extra information that can be provided for elements,

Copyright IBM Corporation 2004

Element Rules - Rule 1. Single Root Element

Colors is the root element for

Color represents multiple root

Copyright IBM Corporation 2004

Element Rules - Rule 2. Element Tag Rules

Copyright IBM Corporation 2004

Element Rules - Rule 3. Element Nesting

Copyright IBM Corporation 2004

Element Nesting Example

All elements are properly nested.

Best Practice: Use indentation to represent the document's hierarchy.

Element Rules - Rule 4. XML Naming Rules

Rule 4. Tag Naming - Samples

Examples of legal and

Element names are

Element names must

Elements must not

Copyright IBM Corporation 2004

Rule 4. Element Content (1 of 2): General

Copyright IBM Corporation 2004

Rule 4. Element Content (2 of 2): Data

Copyright IBM Corporation 2004

Rule 4. PCDATA - Parsed Character Data

Copyright IBM Corporation 2004

Rule 4. CDATA - Character Data

Copyright IBM Corporation 2004

Rule 4. CDATA Examples

This nameXML element stores actual XML to be treated as text:

Element Rules - Rule 5. Element Attributes

Copyright IBM Corporation 2004

Element Rules - Rule 6.

If this declaration is used, the version attribute is mandatory.