Click here to get back home

Treating text copied from MS Word

 HomeNewsGroups | Search | About
 comp.lang.php    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Treating text copied from MS Word +mrcakey 07-09-2008
Get Chitika Premium
Posted by +mrcakey on July 9, 2008, 7:03 am
Please log in for more thread options
I've built a MySQL database for a client and a web interface to be able to
add/edit/delete records in it. When he's adding stuff to the database he's
copying text from MS Word. I've tried various substitutions that I've found
hanging around the internet, but nothing's working for the "long dash" that
it insists on converting normal hyphens to.

This morning I did a bin2hex to see exactly what was being sent from $_POST:

A - long dash -.

41 20 >>>e2 80 93<<< 20 6c 6f 6e 67 20 64 61 73 68 20 2d 2e 20 20

The offending character is the one I've highlighted. As far as I can tell,
it should be getting found by this -

"\xe2\x80\x93", // long dash

but it isn't, which makes me think there's something wrong with the code
I've copied. How to find the hex string? I've tried "\xe2\x80\x93" and
"\xe2x80x93" in addition, but to no avail.

Is driving me scatty!!!

Any help much appreciated.

$search = array( chr(145),
chr(146),
chr(147),
chr(148),
chr(151),
chr(196),
'?o', // left side double smart quote
'?', // right side double smart quote
'?~', // left side single smart quote
'?T', // right side single smart quote
'?', // elipsis
'?"', // em dash
'?"', // en dash
"\xe2\x80\xa6", // ellipsis
"\xe2\x80\x93", // long dash
"\xe2\x80\x94", // long dash
"\xe2\x80\x9c", // double quote opening
"\xe2\x80\x9d", // double quote closing
"\xe2\x80\xa2" // dot used for bullet points
);
$replace = array( "'",
"'",
'"',
'"',
'-',
'-',
'"',
'"',
"'",
"'",
"&hellip;",
"-",
"-",
'&hellip;',
'-',
'-',
'"',
'"',
'*'
);
ECHO '<p>'.BIN2HEX( $_POST['short_desc'] ).'</p>';
$short_desc = STR_REPLACE($search, $replace, $_POST['short_desc']);

+mrcakey



Posted by I V on July 10, 2008, 7:25 pm
Please log in for more thread options
On Wed, 09 Jul 2008 12:03:57 +0100, +mrcakey wrote:
> The offending character is the one I've highlighted. As far as I can
> tell, it should be getting found by this -
>
> "\xe2\x80\x93", // long dash

You want to use one backslash here, not two. But, rather than specifying
the search-and-replace yourself, it's probably easier to use
htmlentities. You need to know what encoding your data has been sent in
(it looks, from your post, like you're receiving UTF-8), and do, like so:

$short_desc = htmlentities($_POST['short_desc'], ENT_COMPAT, 'UTF-8');

Similar ThreadsPosted
Treating .html pages as PHP October 6, 2006, 3:40 pm
how to write in a text file before a given word August 31, 2004, 7:01 pm
insert text, ms word document March 22, 2005, 3:57 am
Text with images from Word to RIchText October 10, 2005, 1:37 pm
Full text search in PDF and Word files ? September 19, 2005, 4:11 pm
array creation as reference, or always copied? February 16, 2005, 12:52 am
what is the preg for capitals in a word to be replaced by that word preceded by a space January 10, 2007, 11:40 am
Regular expression: non-latin word/non-word characters and UTF-8 September 22, 2005, 1:34 pm
textarea fields --> export to ms word --> word is stretch March 2, 2007, 12:20 pm
SOAP envelope change from Content-Type: text/xml to text/html April 4, 2006, 11:34 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap