html entity in xml and its parsing

$dom = new DOMDocument();

$dom->load(“test.xml”);

$text=$dom->getElementsByTagName(“Name”)->item(0)->nodeValue;

The above code lines get the text between <Name></Name>

Suppose the text between the tags is:

<Name>http://myprogrammingnotes.com/index.php?p1=value1&amp;p2=http%3A%2F%2Fgoogle.com</Name>

what would $text be?

Is $text “http://myprogrammingnotes.com/index.php?p1=value1&amp;p2=http%3A%2F%2Fgoogle.com” or “http://myprogrammingnotes.com/index.php?p1=value1&p2=http://google.com”

Neither, in fact. The DOMDocument will parse the text and convert all html entities into represented characters. But others characters will keep unchanged, including the percent encoding url parameters. So the correct answer is:

$text===”http://myprogrammingnotes.com/index.php?p1=value1&p2=http%3A%2F%2Fgoogle.com”

If you want to extract a url and put it as a parameter value in another url, you should encode the first url using urlencode, otherwise, you may get the “Not Acceptable‘ when you visit the second url. Notice the situation that there is percent encoded characters in the first url and the % will be encoded again in the second url.

According to the php document, urlencode will keep letters/numbers/-_. unchanged and all other characters percent encoded(space replaced with +). The opposite urldecode decode all percent encoding characters and replace + with spaces.

If the text between the tags are put in <![CDATA[ and ]]>, the nodevalue is the text itself without converting the html entities. For example,

<Name><![CDATA[http://myprogrammingnotes.com/index.php?p1=value1&p2=http%3A%2F%2Fgoogle.com&amp;p3=value3]]></Name>,

$text=$dom->getElementsByTagName(“Name”)->item(0)->nodeValue;

would be http://myprogrammingnotes.com/index.php?p1=value1&p2=http%3A%2F%2Fgoogle.com&amp;p3=value3

 

Posted in

Comments are closed, but trackbacks and pingbacks are open.