Examples of xpath queries to html. Application of the preceding axis. Standard XPath Functions

Examples of using xpath from the practice of parsing information from websites. Sections of xpath code are shown.

Get h1 title text

//h1/text()

Get title text with class produnctName

//h1[@class="produnctName"]/text()

Get the value of a specific span by class

//span[@class="price"]

Get value title attribute for a button with the class addtocart_button

//input[@class="addtocart_button"]/@title

//a/text()

//a/@href

Image src

//img/@src

The image immediately after certain element in DOM, axis following

//h1[@class="produnctName"]//following::div/img/@src

Image in 4 divs according to the account

//div/img/@src

XPath(XML Path Language) - a language for querying XML document elements. Designed to provide access to parts XML document in files XSLT transformations and is a W3C standard. XPath is designed to implement DOM navigation in XML.

XML has a tree structure. A tree element always has descendants and ancestors, except root element, which has no ancestors, as well as dead-end elements (tree leaves), which have no descendants.

At each step of the path, elements are selected that meet the selection conditions at this step, and as a result of accessing the document along the path, a set of elements is obtained that satisfy this path.

Functions over node sets

  • * - denotes any a name or set of characters along the specified axis, for example: * - any child node; @* — any attribute.
  • $name - access to a variable, where name is the name of the variable or parameter.
  • additional conditions samples or, which is the same thing, an addressing step predicate. Must contain a boolean value. If it contains a numeric value, it is considered to be the serial number of the node, which is equivalent to prefixing this number with the expression “position()=”
  • () - if used inside a tag of another language (for example HTML), then the XSLT processor considers the contents of the curly braces as an XPath.
  • / - determines the level of the tree, that is, it separates the addressing steps
  • | — combines the result. That is, you can write several parsing paths through the sign | and the result of such an expression will include everything that is found by any of these paths.
  • node-set node()

Returns all nodes. The "*" substitute is often used instead of this function, but unlike the asterisk, the node() function also returns text nodes.

  • string text()

Returns a set of text nodes;

  • node-set current()

Returns a set of one element that is the current one. If we are processing a set with conditions, then the only way to reach the current element from this condition is this function.

  • number position()

Returns the position of an element in a set. Works correctly only in a loop

  • number last()

Returns the number of the last element in a set. Works correctly only in a loop

  • number count(node-set)

Returns the number of elements in a node-set.

  • string name(node-set?)

Returns full name the first tag in the set.

  • string namespace-uri(node-set?)
  • string local-name(node-set?)

Returns the name of the first tag in the set, without namespace.

  • node-set id(object)

Finds an element with a unique identifier

Axes are base XPath language. Some axes have abbreviations.

  • ancestor::- Returns many ancestors.
  • ancestor-or-self::— Returns the set of ancestors and the current element.
  • attribute::— Returns a set of attributes of the current element. This call can be replaced by «@»
  • child::— Returns a set of children one level below. This name is abbreviated completely, that is, it can be omitted altogether.
  • descendant::— Returns the complete set of children (that is, both immediate children and all their children).
  • descendant-or-self::— Returns the complete set of children and the current element. The expression "/descendant-or-self::node()/" can be shortened to «//» . Using this axis, for example, as a second step, you can organize the selection of elements from any node, and not just from the root one: it is enough to take all the descendants of the root node as the first step. For example, the path "//span" will select all span nodes of the document, regardless of their position in the hierarchy, looking at both the name of the root element and the names of all its children, to the full depth of their nesting.
  • following::— Returns the raw set below the current element.
  • following-sibling::— Returns the set of elements at the same level following the current one.
  • namespace::— Returns a set that has a namespace (that is, the xmlns attribute is present).
  • parent::— Returns the ancestor one level back. This appeal can be replaced by «..»
  • preceding:— Returns the set of processed elements excluding the set of ancestors.
  • preceding-sibling::— Returns the set of elements at the same level preceding the current one.
  • self::— Returns the current element. This appeal can be replaced by «.»

Xpath is a query language for xml or xhtml document elements. Just like SQL, xpath is a declarative query language. To obtain the data of interest, you just need to create a query that describes this data. The xpath language interpreter will do all the dirty work for you.
Very convenient, isn't it? Let's see what capabilities xpath offers for accessing web page nodes.

Creating a request to web page nodes

I present to your attention a small laboratory work, during which I will demonstrate creating xpath requests to a web page. You will be able to repeat the requests I gave and, most importantly, try to fulfill your own. I hope that thanks to this, the article will be equally interesting to beginners and programmers familiar with xpath to xml.

For the laboratory we will need:
- xhtml web page;
- Mozilla Firefox browser with add-ons;
- firebug;
- firePath ;
(you can use any other browser with visual xpath support)
- a little time.

As a web page for conducting the experiment, I propose the main page of the consortium website world wide web("http://w3.org"). It is this organization that develops the xquery(xpath) languages, the xhtml specification and many other Internet standards.

Task
Get from xhtml code home page w3.org information about consortium conferences using xpath queries.
Let's start writing xpath queries.
First Xpath request
Open the Firepath tab in FireBug, select the element to be analyzed with the selector, click: Firepath has created an xpath request for the selected element.

If you selected the title of the first event, then the request will be like this:

After removing unnecessary indexes, the query will match all elements of the header type.

Firepath highlights elements that match the query. You can see in real time which document nodes match the query.

Request for information about conference venues:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

This is how we get a list of sponsors:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

xpath syntax

Let's go back to the queries we created and understand how they are structured.
Let's consider the first request in detail

In this query I have divided three parts to demonstrate the capabilities of xpath. (The division into parts is tricky)

First part
.// - recursive descent zero or more hierarchy levels from the current context. In our case, the current context is the document root

Second part
* - any element,
[@id="w3c_home_upcoming_events"]– a predicate on the basis of which we search for a node that has an id attribute equal to “w3c_home_upcoming_events”. XHTML element IDs must be unique. Therefore, the query “any element with a specific ID” should return the only node we are looking for.

We can replace * to the exact node name div in this request
div[@id="w3c_home_upcoming_events"]

Thus, we go down the document tree to the div[@id="w3c_home_upcoming_events"] node we need. We do not care at all what nodes the DOM tree consists of and how many levels of hierarchy remain above.

The third part
/ul/li/div/p/a–xpath is the path to a specific element. The path consists of addressing steps and conditions for checking nodes (ul, li, etc.). Steps are separated by a "/" (slash) character.

xpath collections
It is not always possible to access the node of interest using a predicate or addressing steps. Very often there are many nodes of the same type at one hierarchy level and it is necessary to select “only the first” or “only the second” nodes. Collections are provided for such cases.

xpath collections allow you to access an element by its index. The indexes correspond to the order in which the elements were presented in the original document. The serial number in collections is counted from one.

Based on the fact that “venue” is always the second paragraph after “conference name”, we get the following query:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p
Where p is the second element in the set for each node in the list /ul/li/div.

Similarly, we can get a list of sponsors with the request:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

Some xpath functions
There are many functions in xpath for working with elements within a collection. I will give only a few of them.

last():
Returns the last element of the collection.
Query ul/li/div/p - will return the last paragraphs for each "ul" list node.
The first() function is not provided. To access the first element, use index "1".

text():
Returns the test content of an element.
.//a – we get all links with the text “Archive”.

position() and mod:
position() - returns the position of an element in a set.
mod is the remainder of the division.

By combining these functions we can get:
- Not even elements ul/li
- even elements: ul/li

Comparison Operations

  • < - логическое «меньше»
  • > - logical “greater than”
  • <= - логическое «меньше либо равно»
  • >= - logical “greater than or equal”
ul/li , ul/li - list elements starting from the 3rd number and vice versa.

On one's own

Try to get:
- even Link URL from the left menu “Standards”;
- headers of all news, except the first one from the main page of w3c.org.

Xpath in PHP5

$dom = new DomDocument();

$dom->loadHTML($HTMLCode);

$xpath = new DomXPath($dom); $_res = $xpath->query(".//*[@id="w3c_home_upcoming_events"]/ul/li/div/p/a"); foreach($_res => $obj) ( echo "URL: ".$obj->getAttribute("href"); echo $obj->nodeValue; )
Finally
On

simple example

We saw the power of xpath to access web page nodes.

Xpath is the industry standard for accessing xml elements and xhtml, xslt transformations.

You can use it to parse any html page. If the source html code contains significant errors in the markup, run it through

XPath shorthand syntax

XPath syntax shortcuts can be quite convenient. Below are the rules:

Self::node() can be shortened as. ;

Parent::node() can be abbreviated as.. ;

Child::childname can be abbreviated as childname ; Attribute::childname can be abbreviated as @childname ;

/descendant-or-self::node()/ can be abbreviated as // .

For example, the location path.//PLANET is shorthand for self::node()/descendant-or-self::node()/child::PLANET . You can also abbreviate the predicate expression as , as , etc. Working with XPath paths using the shorthand syntax is much easier. The following list provides a number of example location paths using the shortened syntax:

PLANET returns children

Context node;

* returns all children of the context node; Attribute::childname can be abbreviated as @childname ;

Text() returns all child text nodes of the context node; Attribute::childname can be abbreviated as @childname ;

@UNITS returns the UNITS attribute of the context node; Attribute::childname can be abbreviated as @childname ;

@* returns all attributes of the context node; PLANET returns third child PLANET returns the last child

*/PLANET returns all grandchildren /PLANETS/PLANET/NAME returns the second element

third element element Attribute::childname can be abbreviated as @childname ;

//PLANET returns all children Document root;

PLANETS//PLANET returns descendant elements

Child elements Attribute::childname can be abbreviated as @childname ;

//PLANET/NAME returns all elements

who have a parent

Returns the context node itself;

.//PLANET returns descendant elements ;

.//PLANET returns descendant elements Returns the parent of the context node; ../@UNITS returns the UNITS attribute of the context node's parent;

.//.. returns all parents of the context node's child and the context node's parent; PLANET brings back the children

Context nodes that have children The context node only if this child has a UNITS attribute with the value "days". You can also write PLANET[@UNITS="days"] ;

PLANET[@COLOR and @UNITS] returns all children Context nodes that have COLOR attribute and the UNITS attribute;

" //PLANET " selects all elements The value of which is different from the value of any preceding element

* selects any element who is the first child of his parent;

*[@UNITS] selects the first five children of the context node that have the UNITS attribute.

From the book Database Processing in Visual Basic®.NET author McManus Geoffrey P

From the book Programming in Ruby language[Ideology of language, theory and practice of application] by Fulton Hal

From the book PHP Reference by the author

From the XSLT book author Holzner Stephen

Shorthand Syntax For patterns, there are two rules for shortening axes: child::childname can be shortened as childname; attribute::childname can be shortened as @childname. The following list provides a number of sample patterns with the shortened syntax; at the end of the chapter you will see

From the book XSLT Technology author Valikov Alexey Nikolaevich

Shorthand Predicate Syntax Predicate expressions can be shortened by omitting "position()=". For example, becomes, becomes, etc. Using the shorthand syntax makes it much easier to use XPath expressions in predicates. Here's a row

From the C++ book. Collection of recipes author Diggins Christopher

From the author's book

XPath Numbers XPath stores numbers in floating point format double precision. By formal definition, XPath numbers must be stored in IEEE 754 64-bit double precision floating point number format, and all numbers are stored as floating point numbers

From the author's book

Using XPath Axes At this point, we've looked at the three parts of the layout steps - the axis, the node condition, and the predicate. You should be familiar with these elements from the work we did with the selection patterns, but notice the axis in the previous example - preceding-sibling. Still

From the author's book

Testing XPath Expressions The Xalan package includes a handy example program, ApplyXPath.java, that allows you to apply an XPath expression to a document and see the result, which is very helpful for testing. To run this example you will need to compile ApplyXPath.java into ApplyXPath.class at

March 2, 2011 at 08:49 pm

Examples of xpath requests to html

  • Website development

Xpath is a query language for xml or xhtml document elements. Just like SQL, xpath is a declarative query language. To obtain the data of interest, you just need to create a query that describes this data. The xpath language interpreter will do all the dirty work for you.
Very convenient, isn't it? Let's see what capabilities xpath offers for accessing web page nodes.

Creating a request to web page nodes

I bring to your attention a small laboratory work, during which I will demonstrate the creation of xpath requests to a web page. You will be able to repeat the requests I gave and, most importantly, try to fulfill your own. I hope that thanks to this, the article will be equally interesting to beginners and programmers familiar with xpath to xml.

For the laboratory we will need:
- xhtml web page;
- Mozilla Firefox browser with add-ons;
- firebug;
- firePath ;
(you can use any other browser with visual xpath support)
- a little time.

As a web page for conducting an experiment, I propose the main page of the World Wide Web Consortium website ("http://w3.org"). It is this organization that develops the xquery(xpath) languages, the xhtml specification and many other Internet standards.

Task
Retrieve information about consortium conferences from the xhtml code of the w3.org main page using xpath queries.
Let's start writing xpath queries.
First Xpath request
Open the Firepath tab in FireBug, select the element to be analyzed with the selector, click: Firepath has created an xpath request for the selected element.

If you selected the title of the first event, then the request will be like this:

After removing unnecessary indexes, the query will match all elements of the header type.

Firepath highlights elements that match the query. You can see in real time which document nodes match the query.

Request for information about conference venues:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

This is how we get a list of sponsors:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

xpath syntax

Let's go back to the queries we created and understand how they are structured.
Let's consider the first request in detail

In this query I have divided three parts to demonstrate the capabilities of xpath. (The division into parts is tricky)

First part
.// - recursive descent to zero or more levels of hierarchy from the current context. In our case, the current context is the document root

Second part
* - any element,
[@id="w3c_home_upcoming_events"]– a predicate on the basis of which we search for a node that has an id attribute equal to “w3c_home_upcoming_events”. XHTML element IDs must be unique. Therefore, the query “any element with a specific ID” should return the only node we are looking for.

We can replace * to the exact node name div in this request
div[@id="w3c_home_upcoming_events"]

Thus, we go down the document tree to the div[@id="w3c_home_upcoming_events"] node we need. We do not care at all what nodes the DOM tree consists of and how many levels of hierarchy remain above.

The third part
/ul/li/div/p/a–xpath is the path to a specific element. The path consists of addressing steps and conditions for checking nodes (ul, li, etc.). Steps are separated by a "/" (slash) character.

xpath collections
It is not always possible to access the node of interest using a predicate or addressing steps. Very often there are many nodes of the same type at one hierarchy level and it is necessary to select “only the first” or “only the second” nodes. Collections are provided for such cases.

xpath collections allow you to access an element by its index. The indexes correspond to the order in which the elements were presented in the original document. The serial number in collections is counted from one.

Based on the fact that “venue” is always the second paragraph after “conference name”, we get the following query:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p
Where p is the second element in the set for each node in the list /ul/li/div.

Similarly, we can get a list of sponsors with the request:
.//*[@id="w3c_home_upcoming_events"]/ul/li/div/p

Some xpath functions
There are many functions in xpath for working with elements within a collection. I will give only a few of them.

last():
Returns the last element of the collection.
Query ul/li/div/p - will return the last paragraphs for each "ul" list node.
The first() function is not provided. To access the first element, use index "1".

text():
Returns the test content of an element.
.//a – we get all links with the text “Archive”.

position() and mod:
position() - returns the position of an element in a set.
mod is the remainder of the division.

By combining these functions we can get:
- not even elements ul/li
- even elements: ul/li

Comparison Operations

  • < - логическое «меньше»
  • > - logical “greater than”
  • <= - логическое «меньше либо равно»
  • >= - logical “greater than or equal”
ul/li , ul/li - list elements starting from the 3rd number and vice versa.

On one's own

Try to get:
- even URL links from the left menu “Standards”;
- headers of all news, except the first one from the main page of w3c.org.

Xpath in PHP5

$dom = new DomDocument();

$dom->loadHTML($HTMLCode);

Using a simple example, we saw the capabilities of xpath for accessing web page nodes.
Finally
You can use it to parse any html page. If the source html code contains significant errors in the markup, run it through