Computer Webmaster Gaming Console Graphics Forum

Welcome to the Computer Webmaster Gaming Console Graphics Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact us.

MK PitStop Main Earn $25 Earn Money Posting Extras Members Blogs Image Hosting User Pages
Go Back   Computer Webmaster Gaming Console Graphics Forum > Webmaster Forum > Website Coding > PHP
Register FAQ/Rules Become A V.I.P. Member Search Today's Posts Mark Forums Read

PHP PHP for some can be one of the hardest website programming codes, so do you need help on your PHP script, if it is php4, php5 or lower this is the place for you for any PHP help.

Google
Closed Thread
 
LinkBack Thread Tools Display Modes
Old 05-20-2007, 6:33 PM   #1
monochromec@gmail.com
 
monochromec@gmail.com's Avatar
 
Posts: n/a
My Photos: (0)

Banked:
MK Cash: $

I am Worth:
MK Cash: $
Donate

Recent Blog: None

Default Weird loadHTML behaviour

Hi all,

I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:

$str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');

file_put_contents ('dmp.htm', $str);

$dom = DOMDocument::loadHTML ($str);
$elem = $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n = $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));

What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5> tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü" (0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.

Any thoughts?

Cheers, Christoph

 
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit!
Old 05-20-2007, 6:33 PM   #2
monochromec@gmail.com
 
monochromec@gmail.com's Avatar
 
Posts: n/a
My Photos: (0)

Banked:
MK Cash: $

I am Worth:
MK Cash: $
Donate

Recent Blog: None

Default Weird loadHTML behaviour

On May 9, 1:16 pm, monochro...@gmail.com wrote:
> Hi all,
>
> I'm in the process of setting up a PHP script that reads a HTML file,
> does a character conversion and then displays the contents of a single
> HTML tag as follows:
>
> $str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
> 'HTML-ENTITIES', 'ISO-8859-1');
>
> file_put_contents ('dmp.htm', $str);
>
> $dom = DOMDocument::loadHTML ($str);
> $elem = $dom->getElementsByTagName ('h5');
> if ($elem->length) {
> $n = $elem->item (0)->nodeValue;
> var_dump (bin2hex ($n));
>
> What's interesting is that the source HTML file is properly ISO-8859-1
> encoded (which the contents of "dmp.htm" verifies). The trouble starts
> when I retrieve the contents of the first <h5> tag that has an umlaut
> in it. In this case, the umlaut is screwed up - what used to be a
> "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü"(0xc3 0x9c
> as the var_dump confirms). What surprises me are two things: that
> somehow the character changes and that the umlaut is not HTML-encoded
> as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
> box.
>
> Any thoughts?
>
> Cheers, Christoph


After some :-) research, it turns out that the encoding of the
contents of the first <h5> tag
has acutally changed to UTF-8 - hence the strange byte sequence. This
begs the question
if the default encoding for parsed HTML strings in the DOM package is
UTF-8 (if we are looking
at HTML-ENTITIES-conformant encoding initially). Is this a bug of
DOMDocument or a feature?

Cheers, Christoph

 
Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit!
Featured Websites
Free Space
Free Space
Free Space Free Space
Closed Thread
Tags: , ,




Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Featured Websites




All times are GMT +1. The time now is 12:03 AM.


Powered by: vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
LinkBacks Enabled by vBSEO 3.0.0
Cheap Computers
MK PitStop Copyright 2005 - 2008

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98