GB2312 encoding in PHP

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I have language text stored as variables in text files, which are
'included' by my PHP scripts (is there a better way?). However, I seem
to have a problem with the simplified chinese GB2312 encoding format.

I thought that most foreign encoding mechanisms would avoid the use of
the quotation mark - however, I saved something out as GB2312 and now
effectively get parse errors (due to premature quotes which appear to
form parts of the Chinese characters themselves).

A few other websites I tested though don't seem to have a problem
sending me mail in GB2312 though ... so they must have somehow managed.

Any ideas anyone?



Re: GB2312 encoding in PHP wrote:
Quoted text here. Click to load it

I am not sure if you by 'included' mean that you use
"include('somefile')", if so then "readfile" (or "file_get_contents")
might be what you want to use.

If the text-files dont or ought not to contain any php-code, then
"include" shouldn't be used as the file will be evaluated, which
obviously can be a bad thing if it is not intended.

Quoted text here. Click to load it

If php gives you parse-errors, then the above (include) might be the
problem. If you browser complain about the returned html, then use of
"htmlentities()" could probably fix it for you.


Re: GB2312 encoding in PHP

Thanks for that.

I included the file using "include" - I did this because the file
contained several variables relevant to the mail i'm using it to send
out (so $from = "XXX" ; $body="Text body") ... the problem is when I
have Chinese text in the $body variable which happens to include
quotation marks in its composition.

I suppose I can try using file_get_contents instead, but that would
mean moving everything else out of the file and defining them
elsewhere. If it solves my problem I suppose that's acceptable - so
i'll give it a go.

Thanks for the suggestion.


Re: GB2312 encoding in PHP

IIRC the GB2312 doesn't use the 0-127 range for Chinese characters.
Both bytes of a two-byte character would have their most-significant
bit set. The quotation mark itself can be used in Chinese text though.

Re: GB2312 encoding in PHP

I didn't think that GB2312 included quotation marks within the chinese
characters either - as I know that BIG5 does not have this problem, and
nor does UTF-8. However, what I tried doing was writing some text,
saving it out as GB2312 - and then having PHP process this. When PHP
complained I then opened the Chinese text file in standard ISO-8859-1
mode and I could visibly see a " included amongst the other garbled
characters in several locations. This is not visible when viewing the
Chinese - suggesting it was forming part of the characters themselves.

If what you say, that the 0-127 range is not used is true (which I can
believe it to be since I didn't expect it to be used either) - then I
am totally puzzled as to why quotation marks are appearing within my
Chinese text.



Re: GB2312 encoding in PHP

It's possible that the text is conrupted. An editor might simply toss
out the character if it sees that the second byte of a two-byte
character is incorrect. I know IE does that with UTF8 text.

Curly quotation marks appear in the 128-255 range in the CP1251
codepage. It's possible that when the text was copied and pasted from
one application to another, the curly quotes were replaced by straight

Site Timeline