What chars to be escaped/not and when?

Hi all,

I am pretty new to PHP and am stuck on - what I think - is a generic
string handling problem.

I need to read and manipulate some HTML files and have a problem in
getting some substrings found even when - it is clear - strings are
there. (see a HTML chunk I need to edit at the end of this email).

In particular, the following functions are "randomly" working for me:

1) str_replace:
$chunk = str_replace('<!-- /templates/patternFinder/freePattern.txt --
the string)

$chunk = str_replace('"20"><!-- /templates/patternFinder/
freePattern.txt --><form method="post" ','',$chunk); ==> it doesnt
work (does not find and remove the substring)

2) preg_match_all:
$pattern = '~<h6>(.*?)</h6><img src=(.*?) ~si';
if (preg_match_all($pattern, $chunk, $matches)>0) { ...
==> it works

$pattern = '~<h6>(.*?)</h6><img src="(.*?)" ~si';
if (preg_match_all($pattern, $chunk, $matches)>0) { ...
==> it doesnt work (it does not return any match)

I believed this has to do with the chars to be escaped but I have
still not found/understood what chars need to be escaped or not.

I have also tried the addcslashes function, changing from single to
double quotes string  delimiters without success.

I use PHP 5.2.3(?) with IIS locally.

I really appreciate any help or reply and can provide more
information, if needed.

Thanks a lot.


HTML file chunk
<table width="100%"><tbody><tr><td colspan="3" valign="top"
align="left"><h6>Roll-Down Wristers</h6><img src="http://
www.lionbrand.com/stores/lionbrand/thumbs/81000ada.jpg" alt="Image of
Roll-Down Wristers" width="150" border="0"><br></td><td></td><td
valign="top" align="right" height="20"><table width="400" border="0"
cellspacing="0"><tbody><tr><td valign="top" align="right"
height="20"><!-- /templates/patternFinder/freePattern.txt --><form
method="post" name="kitform1922242962" action="http://
www.lionbrand.com/cgi-bin/patternBuyer.cgi"><input name="qty"
value="1" type="hidden"><input name="itemKey" value="1922242962"
type="hidden"><input name="store" value="/stores/eyarn"
type="hidden"><input name="kit" value="1" type="hidden"><input
name="transNum" id="tn1922242962" value="" type="hidden"><input
name="sourceItem" value="" type="hidden"><input name="su"
id="su1922242962" value="" type="hidden"><table style="border-color:
rgb(217, 203, 194); border-collapse: collapse;"
border="1"><tbody><tr><td class="B1" id="b11922242962"
onmouseover="bgOn('b11922242962','T3b');" onmouseout="bgOff
('b11922242962');" width="100"><a class="B1a" href="http://
www.lionbrand.com/patterns/81000AD.html?noImages=">Free Pattern</a></
td><td class="B1" id="b21922242962" onmouseover="bgOn
('b21922242962','T3b');" onmouseout="bgOff('b21922242962');"
width="100"><a class="B1a" href="javascript:
document.kitform1922242962.submit();">Buy Materials</a></td></tr></

Re: What chars to be escaped/not and when?

stefcollect@googlemail.com wrote:
when using preg you need to escape  < > [ ] / " ' . *
there may be more but thats all i can think of atm
so your pattern

'~<h6>(.*?)</h6><img src=(.*?) ~si';
should be

'~\<h6\>(.*?)\<\/h6\>\<img src=(.*?) ~si';

however i'm not sure using (.*?) would match correctly

http://us2.php.net/manual/en/reference.pcre.pattern.syntax.php may help you

Re: What chars to be escaped/not and when?

trookat wrote:
Escaping looks fine.

Not necessary, here.

The angled brackets don't need to be escaped unless you're using them
as delimiters in the expression.  To escape PCRE regex chars, use
preg_quote().  The correct list of metacharacters is contained in the
resource you link at the end of your post.

Depends, for the OP's first expression:

   $pattern = '~<h6>(.*?)</h6><img src=(.*?) ~si';

the second backreference will contain everything until the next space,
which may or may not be right, depending on the path in the src attribute.

In the second expression:

   $pattern = '~<h6>(.*?)</h6><img src="(.*?)" ~si';

the second backreference will correctly contain everything up until
the next double quote.  Although, it would be more efficient to use:

   $pattern = '~<h6>(.*?)</h6><img src="([^"]+)" ~si';

@OP:  again, at first glance, your escaping looks fine, so I'd try to
check your data.  Personally, I didn't want to read through your
unreadable chunk of whitespace-devoid markup, so, you might want to
double-check your data.

