Removing empty tags

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I've just started changing my processing over to HTML::HTML5::Parser,
so please bear with me on this.

I've been using a regex to remove empty tags, but I see one that's not
working so I assume there's either a typo, or an error in the logic.

I'm trying to convert this:

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><br></span>



It should also catch <span...></span> (with nothing inside), or
<span...> </span> (with a whitespace inside).

"class" and "style" can be anything (or non-existent), so I'm just
trying to remove <span, followed by anything (or nothing) to the first
Quoted text here. Click to load it

Here's what I'm using:

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gi;

This doesn't appear to work, though. The string I posted above
actually came through verbatim, so it must have matched false.

Of course, I know that this would fail on nested <span></span> tags,
which is why I'm switching over to HTML::HTML5::Parser. But in the
meanwhile, why did this one not match?

Re: Removing empty tags

Quoted text here. Click to load it

It works for me.

use warnings;
use strict;

$_ = '<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><br></span>';


print "$_\n";

If you can post a short and complete program that we can run that
duplicates the problem you are having, then we can surely help
you fix it...

Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg0cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Re: Removing empty tags

Quoted text here. Click to load it

That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><b>"We ALL got problems....If you're gonna be
dumb, ya gotta be tough."</b></span><br><br><span class="Apple-style-
span" style="font-family: Arial, Verdana, Helvetica, sans-serif;

And the function:

sub fixtext {
  $text = $_[0];

  $text =~ s/&nbsp;/ /gi;

  # Convert <em> to <i> and <strong> to <b>, saves a few steps later
  $text =~ s/<em>(.*?)<\/em>/<i>$1<\/i>/gsi;
  $text =~ s/<strong>(.*?)<\/strong>/<b>$1<\/b>/gsi;

  # Strip Javascript
  $text =~ s/<script.*?>.*?<\/script>//gsi;
  $text =~ s/onmouseover=".*?"//gsi;
  $text =~ s/onclick=".*?"//gsi;

  ### Only Allow Specified Tags
  my $lt=chr(1);
  my $gt=chr(2);
    $text =~ s/<br>/$lt br $gt/gi;

    $text =~ s/<(\/)(div.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/)(span.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/)(table.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/)(tr.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/)(td.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/)(b|p)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/)(u|i)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/)(font.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/)(img.*?)>/$lt$1$2$gt/gsi;

    # delete all other tags
    $text =~ s/<.+?>//gs;

  $text =~ s/$lt/</g;
  $text =~ s/$gt/>/g;
  $text =~ s/< br >/<br>/gi;

  # Strip Word junk
  $text =~ s/Normal 0 false.*?}//gsi;
  $text =~ s/Normal 0 MicrosoftInternetExplorer4.*?}//gsi;
  $text =~ s/\/\* Style Definitions \*\/.*?}//gsi;
  $text =~ s/Normal\.dotm .*? false false//gsi;

  $text =~ s/white-space: nowrap;*//gsi;
  $text =~ s/style="(\s*)"//gsi;

  # Strip empty tags
  $text =~ s/<font[^>]*>\s*<\/font>/ /gi;
  $text =~ s/<font[^>]*>(<br>)*<\/font>/<br><br>/gi;

  $text =~ s/<span[^>]*>\s*<\/span>/ /gi;
  $text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gsi;

  $text =~ s/<i>(\s*)<\/i>/$1/gi;
  $text =~ s/<b>(\s*)<\/b>/$1/gi;
  $text =~ s/<u>(\s*)<\/u>/$1/gi;

  $text =~ s/<div>\s*<\/div>/<br>/gi;
  $text =~ s/<div>(.*?)<\/div>/<br><br>$1/gsi;

  # Limit repeating characters
  $text =~ s/(.)/$1$1$1$1/g;

  # Strip opening, trailing, or repeating whitespace, <br>
  $text =~ s/\s+/ /gs;
  $text =~ s/^\s+|\s+$//g;

  $text =~ s/(<br><br>)+/<br><br>/gi;
  $text =~ s/^(<br>)+|(<br>)+$//gi;

  return $text;

Re: Removing empty tags

On 24.02.2011 06:11, jwcarlton wrote:
Quoted text here. Click to load it

We are not interested in whole long functions but only on the relevant

Quoted text here. Click to load it

First: try the string you have posted. Your function will remove the
second span part!

And then: why don't you output the string before putting it in your
function? You need to look at the input!

Solution is probably simple: you are doing a lot of replacements. Assume
the input is "<span><br><b></b></span>". Then you don't remove the spam.
But later you remove the b. If you reverse the order, you would also
remove the span.

So you can try running the fixtext function more than once or try to
change the order of your 10000 replacements.

- Wolf

Next time please try to post a short program that one can run without
changing/adding anything! Often writing such a short program will point
you to the problem so that you can solve it on your own.

Site Timeline