XML

From Knowino

Jump to: navigation, search

XML (eXtensible Markup Language) is a markup language that early 1998 became a "Recommendation"^[1] of the World Wide Web Consortium (W3C). It was developed by a working group consisting of members from the computer industry, W3C, and academia.^[2] In general, a document written in a markup language contains human-readable text in which important terms are marked by special tags. XML, a simplified subset of the older Standard Generalized Markup Language^[3] (SGML), follows the SGML convention of marking noteworthy terms by angular-bracketed tags: <tag_name> ⋅⋅⋅ </tag_name>. For instance, an XML document about the science and culture of the Interbellum (period between WWI and WWII) could contain the marked-up sentence:


Do not confuse the German scientist <physicist>Einstein</physicist> with the Russian filmmaker <cinematographer>Eisenstein</cinematographer> !

This example contains two elements, which, by definition, consist of a start and end tag plus content in between. The name of the element is the name between < and > in the start tag, which must be exactly the same as the name between </ and > in the end tag. The tagged pieces of text can be extracted by a computer program (often referred to as "user agent"), which for example can prepare a typeset version of the document. The user agent might set the tagged contents in a special font and remove the tags.

An important application of XML is in creating legible databases. A public library, for instance, could build a computer-readable catalog as follows (note that in contrast to HTML, element names in XML are case-sensitive, <library> is not the same as <Library>):

 <Library>
   <Fiction>
      <Book>
	  <Author>Adams, Douglas</Author>
	  <Title>The Hitchhiker's Guide to the Galaxy</Title>
	  <Publisher>Crown</Publisher>
          <ISBN-10> 1400052920 </ISBN-10>
      </Book>
      <Book>  ...  </Book>
      <Book>  ...  </Book>
   </Fiction>
   <Non-Fiction>
      <Languages>
	  <Book>  ...  </Book>
      </Languages>
      <Science>
         ...
      </Science>
   </Non-Fiction>
 </Library>

A user agent, knowing the tags appearing in this XML document, can parse it, extract the book titles, and, after some formatting, publish them on the internet, so that the titles become easily accessible to members of the library. At the same time the program (i.e., the user agent) could store the information about the books into a proprietary database.

XML is a meta-language which means that it prescribes the syntax (grammar) of elements. However, XML does not fix actual names or semantics (meanings) of elements. One could say that XML is infinitely large, practically any element name can be chosen by the author of an XML document. Only relatively few characters are forbidden in element names and many different scripts besides Latin are allowed. A markup language with an infinite number of different element names is unworkable and a finite—well-defined—number of permitted names and their content type must be chosen. The chosen elements can be recorded in an XML schema or in a Document Type Definition (DTD), either of which defines in fact a valid XML-conform markup language. For example, a markup language describing the cultural history of the Interbellum could have an associated DTD in which <physicist>, <cinematographer>, <mathematician>, <sculptor>, etc., are recorded as valid elements.

When a document is to be published (on screen, in print, in braille, or as audio file), the appearance of the elements is dictated by a stylesheet. For instance, the stylesheet may prescribe that the contents of <author> ⋅⋅⋅ </author> are printed in bold. XML itself does not do anything, a computer program (a user agent) written in a language like C++ or Java is needed to perform the layout and/or storing or transferring of XML documents.

The best-known XML-compliant language is XHTML, a language for web pages that differs only in minor details from the older HTML 4.01. The definitions of XHTML are contained in a DTD file accessible to the world as web document. The stylesheets most commonly used with XHTML are written according to the CCS 2.1 standard (Cascading Style Sheet version 2.1). CSS 2.1 also applies to XML documents. User agents as the web browsers Internet Explorer, Firefox, or Chrome, perform the parsing and further processing of XHTML documents. XML has found many applications besides XHTML, see Ref. ^[4] for an extensive list of XML-based languages.

The XML document

An XML document may be text-based, or contain data to be transferred from one electronic application to another. A document might also contain abstract structures, such as graphic shapes described by vectors, as in SVG (scalable vector graphics) documents.^[5] A single XML document can be spanned over multiple computer files.

As was discussed in the introduction, content in an XML document is marked with two surrounding tags. The markup is opened with an element name between angular brackets (the start tag) and closed by the same name between </ and > (the end tag). Also empty elements (without content) are allowed; they are denoted by a single tag closed by />,

<element_name />

Often empty elements are used to convey information through their attributes, like the src attribute in the img element of XHTML:

<img src="Einstein.jpg" />

Most of the letters in the UTF-8 character set are allowed in names of elements, meaning that letters in many different scripts may be employed. However, names cannot begin with a digit ([0-9]), a dot (.), a middle dot (·) or a hyphen(-). The letters commonly used in names are: [A-Z], [a-z], [0-9], - (hyphen), _ (underscore), and . (dot). Colons are used in flagging namespaces (see below) and hence are better avoided. Names containing one or more slashes or backslashes (as used in file paths and URLs) are forbidden, as are names containing space(s). Any of the white space characters: tab, line-feed, carriage return, and space terminates a name. As stated earlier, in contrast to HTML, element names are case sensitive, <head> is not the same as <HEAD>, although both are allowed.

There are five predefined character references in XML:

<	<	less than
>	>	greater than
&	&	ampersand
'	'	apostrophe
"	"	quotation mark

The names given in the first column are part of XML and do not have to be defined by the user. The two characters & and < have special meaning for XML parsers and should be referred to by & and <, respectively. The other three characters in the table may be invoked by either value, whatever is more convenient. When in a piece of text the characters & and < are needed several times, it is often more convenient to enclose the text between <![CDATA[ and ]]>. The text thus enclosed is not parsed and may contain any characters (except the string consisting of the three closing characters). Example:

<![CDATA[  Here is an unadorned ampersand: &. And here a less-than symbol: <. Followed by the closing ]]>.

A web browser will show the text (including & and <), but not the bracketing strings.

The elements in an XML document form a tree. The tree starts at the root (in the example below the element <Library> is the root of the document), and branches to the lower levels of the tree. The presence of a root element is mandatory in an XML document (in contrast to HTML 4). All XML elements may have children. <Fiction> is a child of <Library> and the siblings <Book> are children of <Fiction> and descendants of the common ancestor <Library>:

<Library>
   <Fiction>
      <Book>  ...  </Book>
      <Book>  ...  </Book>
   </Fiction>
</Library>

Thus, the terms parent, child, sibling, ancestor, and descendant are used to describe the relationships between elements. Children of the same parent are called siblings, which is a generic term for brothers and sisters. A document tree is only properly defined when its elements are properly nested, that is, a construct as <parent><child> ... </parent></child> is forbidden. All elements must be closed by an end tag (this is optional in HTML 4, but obligatory in XHTML 1.0, because the latter is XML-conform).

Additional information about elements may be provided by attributes, which have the general syntax

<element  attr1 = "val1" attr2="val2" attr3='val3'> ... </element>

An element may have any number of attributes with attribute names that obey the same rules as element names [almost any UTF-8 letter is allowed, but names cannot start with digit, hyphen or (middle) dot, etc.]. The value of an attribute must be surrounded by a pair of single or double quotes. Spaces around the assignment symbol (=) are optional. Examples of syntactically well-formed elements:

<physicist born = '1879' died="1955" nationality="German-American">Albert Einstein</physicist>

<Image source = "Photo_of_Einstein.jpeg" description = "Albert Einstein at age 26"  />

Just as in HTML 4 documents, an XML document can contain comments, opened by < !- - and closed by - ->.

The two-character markup string "<?" has special meaning. It may mark some processing instructions as in

<?php PHP instructions ?> ^[6]

or an XML declaration, as in

<?xml version="1.0" encoding="ASCII" ?>

The latter line may optionally precede an XML document (will come before <!DOCTYPE). This line is advisable if the encoding of the document is not the default UTF-8.

The DTD

An XML Schema Definition (XSD) is a set of definitions of XML elements and their attributes. XSDs provide a means for defining the structure, content and semantics (meaning) of XML documents. They are updated versions of the older DTDs that go back to SGML. XSDs are more powerful—but more complex—than DTDs. See Ref. ^[7] for more details about XSDs. Here we will consider only some rudimentary notions of DTDs,^[8] which to date (2012) are still part of both XHTML 1 and HTML 4. A DTD puts on record elements and their attributes, but it may also define entity references, indicated in a document by a beginning & and a closing semicolon (;). For instance, π refers to the entity (Greek letter) π.

A document is valid (in the terminology of XML) when all elements and attributes are defined in an accompanying DTD and, obviously, if all elements and attributes satisfy their definitions. A valid document must be well-formed (syntactically correct), but the converse is not necessarily true (in fact, a well-formed document may not even have a DTD). Most XHTML browsers check for well-formedness (and are obliged to stop execution as soon as they detect an error). This is not true for invalidity. Usually browsers try to fix the invalid expressions that they encounter. Special validators exist for XHTML^[9] and XML.^[10]

To give a taste of a DTD declaration, the following simple example of an XML document is presented that has an internal DTD (that is, the declarations are included in the XML document)

<!-- <!DOCTYPE Courtman_family SYSTEM "example1.dtd"> -->
<!DOCTYPE Courtman_family [
    <!ELEMENT  Courtman_family  (#PCDATA|Mother|Father|Son|Daughter)*>
    <!ELEMENT  Mother           (First_names+, Last_name)>
    <!ELEMENT  Father           (First_names*, Son*, Last_name+)>
    <!ELEMENT  First_names      (#PCDATA)>
    <!ELEMENT  Last_name        (#PCDATA)>
    <!ELEMENT  Son              (#PCDATA)>
    <!ELEMENT  Daughter         (#PCDATA)>
    <!ATTLIST  Father            born  CDATA #IMPLIED 
                                 profession  CDATA #IMPLIED >
    <!ENTITY   Address           "Hillsdale Blvd">
]>
<?xml-stylesheet href="" ?>
<Courtman_family>
   Ms. <Mother><First_names>Corina S. </First_names><Last_name>Robinson</Last_name></Mother>
   has two children  from her previous marriage: son <Son>Theo</Son> and daughter 
   <Daughter>Lizz</Daughter>. She has a son: <Father><Son>Mike</Son> <Last_name>Courtman</Last_name>   
   <Father> with her present husband, mr. <Father born="March 6, 1964" 
   profession="Carpenter"><Last_name>Courtman</Last_name></Father>.
   The family lives on &Address;
</Courtman_family>

A browser outputs this document simply as:

Ms. Corina S. Robinson has two children from her previous marriage: son Theo and daughter Lizz. She has a son: Mike Courtman with her present husband, mr. Courtman.The family lives on Hillsdale Blvd

Explanation

The first line (commented out) in the XML document gives the DTD declaration when the DTD is external (in the separate file example1.dtd).
The second line starts the DTD declaration (anything between the matching square brackets). The element <Courtman_family> is the root element of the document. This line associates the DTD with the document. Recall that the name of the root element (<Courtman_family>) is unique within the document.
The third line (the first actual DTD declaration) gives the syntax of the root element <Courtman_family>. Any element declaration starts with <!ELEMENT and is followed by the name of the element. Then the element syntax is described by a string between parentheses.
The string defining the syntax of <Courtman_family> contains the primitive (atomic) value #PCDATA (Parsed Character DATA), which is a sequence of UTF-8 characters that does not have any children, and hence is not allowed to contain < or & (an entity reference is also seen as a child).
The syntax of <Courtman_family> allows also for the non-primitive children Mother, Father, Son, Daughter. They may be defined in the DTD (which they are further down in the example). Any element that appears in the document as child of root must be defined here. Lower descendants (grandchildren, etc.) must be defined separately. If a child of root is not used, it may be left here without further definition.
Note that the element <Son> appears as child of root and as child of <Father> (i.e., grandchild of root). This is legitimate, but <Son> must be defined in both relationships.
The line <?xml-stylesheet href="" ?> specifies a null stylesheet and triggers a browser to apply its default style rules. If this line is omitted, a browser outputs an XML document as a tree. If the line refers to a separate CSS file, this dictates the style.
The symbol * appearing in the first line is borrowed from the Unix editor Ed as part of a regular expression. It says that the elements between the parentheses may occur zero or more times, that is, any of the children: #PCDATA|Mother|Father|Son|Daughter of the root may appear an indeterminate number of times. The vertical bar (|) indicates a logical "or", i.e., a choice between the elements separated by |.
The fourth line defines <Mother> as a sequence, recognized by members between commas. The first child must be <First_names> (and cannot be text). The + sign originates from regular expressions: the child <First_names> must appear at least once and possibly more than once. Because the comma defines a sequence, all appearances of <First_names> must come before <Last_name>. This last child of <Mother> must appear always and just once.
<Father> has three children. The order of their appearance in the document is fixed. The first two are optional, while the last child must appear at least once.
The following four lines define elements as childless pieces of text.
The one but last line defines the attributes of <Father>. They are "IMPLIED", meaning that they are optional, and consist of CDATA (simple character data not containing & or <).
The last line in the DTD defines an entity, referred to as &Address;.

For use internal to the DTD, parameter entities may be defined, as in:

<!ENTITY % URI "CDATA">

Note the spaces around %, they must be there. A parameter entity may be used everywhere in the DTD, for instance in the attribute href of the empty element base

<!ELEMENT base EMPTY>
<!ATTLIST base href %URI; #REQUIRED>

In the XML document this would be used as, e.g., <base href="http://knowino.org/wiki/Knowino:Welcome!" /> and inside the DTD the string %URI gets the value http://knowino.org/wiki/Knowino:Welcome!, consisting of simple character data.

The syntax of XML and DTD is formally (and precisely) declared in Extended Backus-Naur Form (EBNF), see Ref. ^[11] To give the flavor of it, an incomplete definition of element is given and explained:

S := (#x20|#x9|#xD|#xA)+

Stag := '<'Name (S Attribute)* S? '>'

Etag := '</'Name S? '>'

Name := (Letter | '_' | ':') (NameChar)*

element := Stag content Etag

First white space S is defined as: space, tab, carriage return, or line feed, with the respective code points in hexadecimal(decimal) x20(32), x9(9), xD(13), and xA(10). Note that S contains at least one of these characters, and possibly more; this is indicated by the terminating regular expression symbol +. The start tag (Stag) is opened with the literal <, then follows—with no white space in between—the name of the element, followed by zero or more attributes. The Etag (end tag) is defined beginning with the literal </, no space and then the name of the element. Zero or more white space characters precede the literal >. The question mark is borrowed from regular expressions and indicates an option (zero or one). A Name starts with a Letter, underscore, or colon. The allowed entities Letter cover almost all of Unicode, but from the ASCII set (first 128 code points) only [A-Z] and [a-z] can be used, i.e., a name cannot start with a digit or punctuation mark (other than the underscore and colon). The exact definition of Letter and NameChar can be found in Ref.^[8] Finally, an element consists of an Stag, content, and an Etag (the definition of empty elements is omitted here). The definition of content is fairly involved, because recursive nesting occurs, but basically content consists of one or more sequences (strings separated by commas) and/or choices (strings separated by vertical bars).

The style sheet

A markup language proper only defines the syntax and semantics of terms. A markup language is not concerned with appearance (indentations, vertical spacings, choice of fonts, etc.). For many applications, such as database applications, appearance is irrelevant. For humans, however, the presentation (printed or on screen) of a document improves readability and hence is important. The appearance of certain tagged terms may be defined separately in a style sheet. A style sheet may have different sources that together form a cascade of sources. Priority in a cascade is well-defined: if two sources of a style sheet give contradictory definitions, then the definition lowest in the cascade takes priority. In this manner Cascading Style Sheets (CSS) are defined, which can be used by XML documents to make up their appearance on screen, in print, braille, or as audio file. The latest standard of style definitions is CSS2.1. As of this writing (spring 2012) the definition of CSS3 is nearing completion.

The definition of CSS 2.1 is such that it is fully interoperable with XML, HTML 4, and XHTML 1.0, see Ref. ^[12] for a quick tutorial of CSS with XML. Because CSS 2.1 is described elsewhere, no more details are given here.

The XML parser

The software module that reads and checks the information in XML documents is known as an XML parser or processor. It serves usually as the front end of a user agent. That is, parsers are part of XML-compliant applications (such as web browsers or database servers). One of the the tasks of a parser is to ascertain that the XML document follows all the rules of the XML markup syntax. If that is the case the documents is said to be well-formed. Some parsers go a step further and check the documents against the DTD or XML schema, these are validating parsers. If a document passes both checks, it is said to be valid. Once it has been established that a document is well-formed, the parsed document is passed on to the engine (core part) of the user agent. In the case of web browsers, for example, the engine takes care of the layout of the page on the user screen.

XML namespaces

Every XML application has its own markup vocabulary consisting of element names and attributes that the application understands. A single XML document may be input to multiple XML applications with different vocabularies. For instance, an XHTML 1.0 document may contain SVG diagrams that are illustrations to the XHTML text. The XML-conform application SVG^[13] is a vector-based drawing program. In this case a parser must distinguish SVG from XHTML elements as it must decide to which application the element must be sent. A problem here is that vocabularies may have overlap, two element names may "collide" meaning that they are the same. For example, the parser needs to differentiate between two meanings of title: a tooltip of a drawing (in the vocabulary of SVG) and a title of a document (in the vocabulary of XHTML).

To resolve this possible conflict the W3C created the namespace convention.^[14] A declaration provides a long and unique namespace name for a particular XML vocabulary that is conveniently represented by a shorthand consisting of a unique (within the document) prefix. Prefixes attached to local names of elements distinguish the names originating from different vocabularies. It is important to emphasize that namespaces are independent of DTDs, a DTD does not know whether a document contains elements with names from a namespace. A document is valid only if all element names (prefixed or not) of the document are properly recorded in its DTD. For a DTD a prefixed name is a name as any. Conversely, a document with different namespaces does not need to have an associated DTD (and will be non-valid in that case, but can still be well-formed).

As said, a namespace obviously must have a unique name. It is common to give a name that is an URL.^[15] This choice kills two birds with one stone: it gives information about the organization that maintains the application with its vocabulary and an URL is unique.

Namespaces are declared in an XML document by the xmlns[:prefix] attribute. One can establish a namespace for an element and all its descendants. All elements being descendants of root, declaration as attribute of root applies to the whole document. A strict XHTML 1.0 document (that must have <html> as root) is required to have its (default) namespace declared. This is done by:

<html xmlns="http://www.w3.org/1999/xhtml">

This declaration associates the namespace with an empty prefix, all unprefixed names are associated with this (default) namespace.

If one plans to invoke a vocabulary repeatedly inside an XHTML document (for instance SVG), the xlmns attributes may be added to the root element, as in:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:s="http://www.w3.org/2000/svg">

This declaration associates elements <s:local_name>, starting with s:, with the namespace http://www.w3.org/2000/svg containing local_name. If a namespace declaration does not specify a prefix, it acts as default namespace and its elements are referred to without prefix. The name of a prefix is free, but for obvious reasons it cannot contain a colon.

An ellipse within a blue rectangle can now be drawn as follows (all elements, except <p> and <html>), are from the SVG namespace):

   <html xmlns="http://www.w3.org/1999/xhtml"  xmlns:s="http://www.w3.org/2000/svg">
      <p> Here comes some SVG</p> 
      <s:svg width="15cm" height="15cm">
         <s:rect x="1cm" y="1cm" width="12cm" height="12cm"
                 fill="none" stroke="blue" stroke-width="2" />
         <s:g transform="translate(250 275)">
            <s:ellipse rx="125" ry="90"
                fill="cyan"  />
         </s:g>
      </s:svg>		
   </html>

If one applies a namespace to part of the document only, one can do it by defining a prefix on a parent element.

In general attributes are not prefixed and keep the meaning defined by the element to which they belong, as is shown by the example above. In exceptional cases one can associate an attribute with a namespace that differs from its element and then a prefix attached to the attribute is required.

Notes

↑ The W3C issues "recommendations". W3C Recommendations are similar to the standards published by other organizations.
↑ W3C XML Working Group members
↑ ISO 8879:1986 SGML standard. Not free
↑ List of XML based languages on Wikipedia
↑ The SVG language is XML compliant.
↑ PHP is a server-side computer language
↑ XML Schema, a W3C publication.
↑ ^8.0 ^8.1 Annotated definition of the XML grammar in EBNF
↑ HTML and XHTML validation
↑ XML validation (only for Internet Explorer).
↑ Explanation of Extended Backus-Naur notation (EBNF)
↑ CSS 2.1 tutorial for XML
↑ Scalable Vector Graphics (SVG) 1.1
↑ Namespaces in XML 1.0
↑ Here a shortcut is taken: the official standard says that a namespace name must be an URI (Uniform Resource Identifier). There are two general forms of URI: Uniform Resource Locators (URL) and Uniform Resource Names (URN). Either type of URI may be used as a namespace identifier, but URLs are more common.

[0] The W3C issues "recommendations". W3C Recommendations are similar to the standards published by other organizations.

[1] W3C XML Working Group members

[2] ISO 8879:1986 SGML standard. Not free

[3] List of XML based languages on Wikipedia

[4] The SVG language is XML compliant.

[5] PHP is a server-side computer language

[6] XML Schema, a W3C publication.

[XMLdef-7] 8.0 ^8.1 Annotated definition of the XML grammar in EBNF

[8] HTML and XHTML validation

[9] XML validation (only for Internet Explorer).

[10] Explanation of Extended Backus-Naur notation (EBNF)

[11] CSS 2.1 tutorial for XML

[12] Scalable Vector Graphics (SVG) 1.1

[13] Namespaces in XML 1.0

[14] Here a shortcut is taken: the official standard says that a namespace name must be an URI (Uniform Resource Identifier). There are two general forms of URI: Uniform Resource Locators (URL) and Uniform Resource Names (URN). Either type of URI may be used as a namespace identifier, but URLs are more common.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

XML

Contents

The XML document

The DTD

The style sheet

The XML parser

XML namespaces

Notes

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Community

Toolbox