The best approach to simplify/clean-up html code

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Basically a repost of the same request as
because I don't like the C# solution as the answer. I prefer Perl.  

I.e., I'm looking for a tool to simplify html mark up as much as  
possible. If there isn't one already there, I don't mind role up my  
sleeves and write one. I did find one, HTML::Clean::Human --  

    This is is an html syntax filter/reformatter.

    My initial temptation was to simply seek a solution such as html  
to text. But then I realized html code may have links and other ephemera  
that would be desireable to keep.

    What I want it to get rid of; all the stupid html things such as  
inline font declarations etc.

    This code is useful if you edit html, but you have to do it maybe  
from already existing html that some whacko wysiwyg junk spat out. Run it  
through this and voila.

However, UTSL reveals that it cannot simplify html mark up like this:

  <a href=" "  
style="color: rgb(7, 85, 215); text-decoration: none; font-family: 'DejaVu  
Sans', 'Bitstream Vera Sans', sans-serif; font-size: 16px; font-style:  
normal; font-variant: normal; font-weight: normal; letter-spacing:  
normal; line-height: normal; orphans: auto; text-align: left; text-
indent: 0px; text-transform: none; white-space: normal; widows: auto;  
word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb
(255, 255, 255);"><span class="srcversion" title="unstable">0.12</span></

So, what's the best approach to simplify/clean-up html mark up code?
Is there any better ways than the approach used in

PS. My ultimate goal is to come up with something that satisfy the HTML  
tags requirement allowed on Stack Exchange sites --


Re: The best approach to simplify/clean-up html code

Quoted text here. Click to load it

How about HTML to MarkDown, and then from MarkDown back to HTML?

Moreover, Pandoc can also generate JSON which can be post-processed by
Perl, and then converted to any format you would like. section JSON filters.

dirty HTML  -> (pandoc) -> JSON -> (your Perl script ) -> JSON ->
(pandoc) -> clean HTML.

John Bokma                                                               j3b

Blog:        Perl Consultancy:
Perl for books:

Re: The best approach to simplify/clean-up html code

On Sun, 30 Mar 2014 12:11:51 -0600, John Bokma wrote:

Quoted text here. Click to load it

Thanks for your answer John,  

No, pandoc is too big. I don't mind or I'd rather to code one myself.  
Just I'm not sure whether to use the regex hack, like the one in my OP,  
or to go through the formal html parsing then simplifying route.  

Thanks all the same.  

Re: The best approach to simplify/clean-up html code

Quoted text here. Click to load it

I've not looked at CPAN HTML parser modules.  Perhaps you might use
one, walk the tree, and either blacklist or whitelist elements and

Tim McDaniel,

Site Timeline