Xml - Xpath

1 - About

XPath is a language for navigating in XML documents.

The XPath specification is the foundation for a variety of specifications:

  • including XSLT. Xpath is used to query nodes from the source document and apply styling templates to them to create a result document.
  • and linking/addressing specifications such as XPointer.

XPath operators, functions, wild cards, and node-addressing mechanisms can be combined in wide variety of ways.

3 - Expression

An XPath expression specifies a pattern that selects a set of XML nodes.

The nodes (not from an XML document!) in an XPath expression refer to more than just elements. They also refer to text and attributes, among other things.

In fact, the XPath specification defines an abstract document model that defines seven kinds of nodes.

4 - Basic XPath Addressing

4.1 - Node Navigation and Content

An XML document is a tree-structured (hierarchical) collection of nodes. As with a hierarchical directory structure, it is useful to specify a path that points to a particular node in the hierarchy (hence the name of the specification: XPath).

In fact, much of the notation of directory paths is carried over intact:

Character Designation Signification Tip
/ The forward slash Path separator An absolute path from the root of the document starts with a /.
A relative path from a given location starts with anything else.
.. A double period The parent of the current node And its content for the functions
. A single period The current node And its content for the functions

For example, in an Extensible HTML (XHTML) document, the path /h1/h2/ would indicate an h2 element under an h1. (Recall that in XML, element names are case-sensitive, so this kind of specification works much better in XHTML than it would in plain HTML, because HTML is case-insensitive).

A name specified in an XPath expression refers to an element. For example, h1 in /h1/h2 refers to an h1 element.

In a pattern-matching specification such as XPath, the specification /h1/h2 selects all h2 elements that lie under an h1 element.

4.2 - Attribute

5 - Basic XPath Expressions

The full range of XPath expressions takes advantage of the wild cards, operators, and functions that XPath defines.

5.1 - Square-bracket

5.1.1 - Indexing

The square-bracket notation ([]) is normally associated with indexing.

To select a specific h2 element, you use square brackets [] for indexing. The path /h1[4]/h2[5] would therefore select the fifth h2 element under the fourth h1 element.

The function position() gives you the element index. Then /h1[4] is the same that /h1[position()=4]

5.1.2 - Boolean

The expression @type=“unordered” specifies an attribute named type whose value is unordered. An expression such as LIST/@type specifies the type attribute of a LIST element.

The expression LIST[@type=“unordered”] selects all LIST elements whose type value is unordered.

5.1.3 - Extended

Examples that use the extended square-bracket notation:

  • /PROJECT[.=“MyProject”]: Selects a PROJECT named “MyProject”.
  • /PROJECT[STATUS]: Selects all projects that have a STATUS child element.
  • /PROJECT[STATUS=“Critical”]: Selects all projects that have a STATUS child element with the string-value Critical.

5.1.4 - Combining Index Addresses

The XPath specification defines quite a few addressing mechanisms, and they can be combined in many different ways in order to get interesting combinations:

  • LIST[@type=“ordered”][3]: Selects all LIST elements of the type ordered, and returns the third.
  • LIST[3][@type=“ordered”]: Selects the third LIST element, but only if it is of the type ordered.

Many more combinations of address operators are listed in section 2.5 of the XPath specification. This is arguably the most useful section of the specification for defining an XSLT transform.

5.2 - Wild Cards

By definition, an unqualified XPath expression selects a set of XML nodes that matches that specified pattern.

For example, /HEAD matches all top-level HEAD entries, whereas /HEAD[1] matches only the first.

Wild card Meaning
* Matches any element node (not attributes or text).
node() Matches any node of any kind: element node, text node, attribute node, processing instruction node, namespace node, or comment node.
@* Matches any attribute node.

In the project database example, /*/PERSON[.=“Fred”] matches any PROJECT or ACTIVITY element that names Fred.

6 - Extended-Path Addressing

6.1 - double forward slash

So far, all the patterns you have seen have specified an exact number of levels in the hierarchy.

For example, /HEAD specifies any HEAD element at the first level in the hierarchy, whereas /*/* specifies any element at the second level in the hierarchy.

To specify an indeterminate level in the hierarchy, use a double forward slash (). For example, the XPath expression PARA selects all paragraph elements in a document, wherever they may be found.

The pattern can also be used within a path. So the expression /HEAD/LISTPARA indicates all paragraph elements in a subtree that begins from /HEAD/LIST.

6.2 - Operator

XPath expressions yield either a set of nodes, a string, a Boolean (a true/false value), or a number.

Operator Meaning
| Alternative. For example, PARA|LIST selects all PARA and LIST elements.
or, and Returns the or/and of two Boolean values.
=, != Equal or not equal, for Booleans, strings, and numbers.
<, >, ⇐, >= Less than, greater than, less than or equal to, greater than or equal to, for numbers.
+, -, *, div, mod Add, subtract, multiply, floating-point divide, and modulus (remainder) operations (e.g., 6 mod 4 = 2).

Expressions can be grouped in parentheses, so you do not have to worry about operator precedence.

Note - Operator precedence is a term that answers the question, “If you specify a + b * c, does that mean (a+b) * c or a + (b*c)?” (The operator precedence is roughly the same as that shown in the table).

6.3 - String-Value of an Element

The string-value of an element is the concatenation of all descendent text nodes, no matter how deep. Consider this mixed-content XML data:

<PARA>This paragraph contains a <b>bold</b> word</PARA>

The string-value of the <PARA> element is “This paragraph contains a bold word”. In particular, note that <B> is a child of <PARA> and that the text bold is a child of <B>.

The point is that all the text in all children of a node joins in the concatenation to form the string-value.

6.4 - normalized

Also, it is worth understanding that the text in the abstract data model defined by XPath is fully normalized. So whether the XML structure contains the entity reference &lt; or < in a CDATA section, the element's string-value will contain the < character. Therefore, when generating HTML or XML with an XSLT stylesheet, you must convert occurrences of < to &lt; or enclose them in a CDATA section. Similarly, occurrences of & must be converted to &amp;.

7 - Documentation / Reference

markup/xslt/xpath.txt · Last modified: 2017/06/12 12:59 by gerardnico