character conversion from MS Word to HTML

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Here's a brief description of the problem. My organization has a
client who cuts and pastes information from Microsoft Word documents
into web-based forms, whose contents is then displayed on a website. I
wish to convert the special characters, such as ellipses and trademark
symbols (and whatever else Word might throw at us) into a proper HTML
entity (™) or character reference (®) if the entity does
not exist.

Before you make any suggestions, let me share a brief overview of my
previous attempts at a solution so neither of us wastes his time.
Right now, I'm using a combination of the character map returned by
get_html_translation_table(HTML_ENTITIES) and some kludgy code which
manually maps the Unicode value of an MS Word special character to its
HTML equivalent. For example,

$replace_array[chr(226).chr(128).chr(152)] = "‘" ;

I'd like to be able to do the above operation automatically / across
the board for wacky Word characters. I suspect I may need to use the
mbstring functions. If you have any advice, I'm happy to send helpful
folks some chocolate for their troubles.

Site Timeline