![]() |
|
Welcome to the Computer Webmaster Gaming Console Graphics Forum forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact contact us. |
| |||||||
| PHP PHP for some can be one of the hardest website programming codes, so do you need help on your PHP script, if it is php4, php5 or lower this is the place for you for any PHP help. |
![]() |
| | LinkBack | Thread Tools | Display Modes |
| | #1 | ||
| Hi all, I'm in the process of setting up a PHP script that reads a HTML file, does a character conversion and then displays the contents of a single HTML tag as follows: $str = mb_convert_encoding (file_get_contents ('aktuel.htm'), 'HTML-ENTITIES', 'ISO-8859-1'); file_put_contents ('dmp.htm', $str); $dom = DOMDocument::loadHTML ($str); $elem = $dom->getElementsByTagName ('h5'); if ($elem->length) { $n = $elem->item (0)->nodeValue; var_dump (bin2hex ($n)); What's interesting is that the source HTML file is properly ISO-8859-1 encoded (which the contents of "dmp.htm" verifies). The trouble starts when I retrieve the contents of the first <h5> tag that has an umlaut in it. In this case, the umlaut is screwed up - what used to be a "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü" (0xc3 0x9c as the var_dump confirms). What surprises me are two things: that somehow the character changes and that the umlaut is not HTML-encoded as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux box. Any thoughts? Cheers, Christoph | |||
| | #2 | ||
| On May 9, 1:16 pm, monochro...@gmail.com wrote: > Hi all, > > I'm in the process of setting up a PHP script that reads a HTML file, > does a character conversion and then displays the contents of a single > HTML tag as follows: > > $str = mb_convert_encoding (file_get_contents ('aktuel.htm'), > 'HTML-ENTITIES', 'ISO-8859-1'); > > file_put_contents ('dmp.htm', $str); > > $dom = DOMDocument::loadHTML ($str); > $elem = $dom->getElementsByTagName ('h5'); > if ($elem->length) { > $n = $elem->item (0)->nodeValue; > var_dump (bin2hex ($n)); > > What's interesting is that the source HTML file is properly ISO-8859-1 > encoded (which the contents of "dmp.htm" verifies). The trouble starts > when I retrieve the contents of the first <h5> tag that has an umlaut > in it. In this case, the umlaut is screwed up - what used to be a > "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü"(0xc3 0x9c > as the var_dump confirms). What surprises me are two things: that > somehow the character changes and that the umlaut is not HTML-encoded > as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux > box. > > Any thoughts? > > Cheers, Christoph After some :-) research, it turns out that the encoding of the contents of the first <h5> tag has acutally changed to UTF-8 - hence the strange byte sequence. This begs the question if the default encoding for parsed HTML strings in the DOM package is UTF-8 (if we are looking at HTML-ENTITIES-conformant encoding initially). Is this a bug of DOMDocument or a feature? Cheers, Christoph | |||
| Featured Websites | ||||
|
![]() |
| Tags: behaviour, loadhtml, weird |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
| |
| Featured Websites | ||||
|