Php has a SimpleXMLElement class to parse xml document. The official website gives quite a few examples on how to use it, but no explanation of related concepts.
xml has a set of concepts. Everything in an xml document is a node, including elements(formed by tags, the text in tag’s <> is called element name, the topmost element is called root element), attributes(the stuff in tag’s <> except the element name), atomic values(the node without children such as attribute value or the text between <xxx></xxx>). An xml element corresponds to a SimpleXMLElement object. For example,
$xml=simplexml_load_string($xmlstr);
$xml corresponds to the root element represented by $xmlstr. The children elements of xml correspond to members of SimpleXMLElement object, and the member name is the name of child element.
<?php $xmlstr = <<<XML <?xml version='1.0' standalone='yes'?> <movies> <movie> <title>PHP: Behind the Parser</title> <characters> <character> <name>Ms. Coder</name> <actor>Onlivia Actora</actor> </character> <character> <name>Mr. Coder</name> <actor>El ActÓr</actor> </character> </characters> <plot> So, this language. It's like, a programming language. Or is it a scripting language? All is revealed in this thrilling horror spoof of a documentary. </plot> <great-lines> <line>PHP solves all my web problems</line> </great-lines> <rating type="thumbs">7</rating> <rating type="stars">5</rating> </movie> </movies> XML; $movies=simplexml_load_string($xmlstr); echo $movies->movie->title . "<br>";
In the above example, $movies corresponds to the root element(<movies>), the member $movies->movie corresponds to <movies>’s child element <movie>, $movies->movie->title corresponds to <movie>’s child element <title>. In xml, an element can have multiple child elements, even with the same name. For example, <movies> may have several <movie> as its children. So the $movies’ member $movies->movie is a collection of elements, although $movies->movie is a SimpleXMLElement object. A SimpleXMLElement object can have different meaning: it can represent a single element, a collection of elements, an attribute of an element, or a collection of attributes of an element. A SimpleXMLElement object is also a RecursiveIterator. You can apply [] to a RecursiveIterator object to get its members, which are also SimpleXMLElement objects. If SimpleXMLElement obj represents a set of elements/attributes, obj[0] represents the first element in obj. If SimpleXMLElement obj represents a represents a single element/attribute, obj[0] represents obj itself, and obj[i] generates a warning if i>0. The “->” operator of SimpleXMLElement obj gets a child of obj if obj represents a single element, and gets a child of obj[0] if obj represents a collection of elements. In our case, the following code have the same effect.
echo $movies->movie->title . "<br>";
echo $movies->movie[0]->title . "<br>";
echo $movies->movie->title[0] . "<br>";
The reason we can echo $movies->movie->title[0] is that SimpleXMLElement has a function __toString() which returns the text between the tags of the element.
But you will encounter a problem when printing
$movies->movie->great-lines->line
because the hyphen in great-lines is considered as the – operator in php, not part of the member name. In such cases, you should use braces and apostrophe to enclose the member name:
echo $movies->movie->{'great-lines'}->line;
Since SimpleXMLElement is RecursiveIterator, you can use foreach to iterate over SimpleXMLElement objects such as:
foreach ($movies->movie->characters->character as $character) { echo $character->name, ' played by ', $character->actor, PHP_EOL; }
If SimpleXMLElement obj represents a single object, iterating over obj is the same as iterating over obj->children(), i.e., getting all the children of obj. But if SimpleXMLElement obj represents a collection of elements, iterating over obj is not the same as iterating over obj->children(). The former will get all the elements represented by obj, while the letter will get all the children of the first element in obj.
The [] operator of SimpleXMLElement has an overload one that takes a string parameter. This version of [] is not for getting a member of a collection, but for obtaining its attributes. So
echo $movies->movie->rating[1]["type"];
will display “stars”,i.e.,the value of the “type” attribute of the second rating element.
XPath is an important concept in xml, which is a string representing a criterion. Using it we can select one or more elements cater to the criterion in an xml document. For example,
echo $movies->xpath("/movies/movie/title")[0];
will display the title of the movie. The xpath function of SimpleXMLElement takes an XPath as its parameter, which selects the title element of the movie element of the movies element. Note that the xpath function returns an ordinary php array, not a SimpleXMLElement object, so you must use [] to get its member.
XPath can have complex syntax. You can learn XPath in this tutorial. Here we list a few common used XPath examples:
- nodename – select all nodename child elements(direct descendants)
- //nodename – select all nodename elements regardless where they appear in xml document
- /nodename – select the root element named nodename
- /root/parent/child – select the child element of the parent element of the root element
- parent/child – select the child element of parent element that is child of current element.
- nodename[@attributename] – select elements named nodename and having the attributename attribute
- nodename[@attributename=attributevalue] – select the elements named nodename and having the attributename attribute and the attribute has the value attributevalue
- nodename[1] – select the first element named nodename
- nodename[last()-1] – select the last but one element that is named after nodename
- nodename[childnodename>39] – select nodename elements that have a childnodename element whose value is greater than 39
The XPath beginning with / is called absolute path, otherwise, it is called relative path. You may wonder the difference between “nodename” and “//nodename”.
$currentnode->xpath("nodename"); $currentnode->xpath("//nodename");
Well, “nodename” selects the elements named “nodename” that are children of $currentnode(direct descendants of $currentnode), while “//nodename” selects the elements named “nodename” in the whole xml document regardless they are descendants or ancestors of $currentnode. So, by using “//nodename”, you may find the resulting elements are not descendants of current node. If you want to get the descendants of current node, but not request direct children, you can use $currentnode->xpath(“.//nodename”). Similarly, if you want the descendants(maybe not direct children) of somenode(maybe not direct children) of current node, you should use $currentnode->xpath(“.//somenode//nodename”).