Tuesday, March 1, 2011

Regular expression to remove empty <span> tags

Hello, I would like such empty span tags (filled with &nbsp; and space) to be removed:

<span> &nbsp; &nbsp; &nbsp; </span>

I've tried with this regex, but it needs adjusting:

(<span>(&nbsp;|\s)*</span>)

preg_replace('#<span>(&nbsp;|\s)*</span>#si','<\\1>',$encoded);

Cheers.

From stackoverflow
  • .

    qr{<span[^>]*(/>|>\s*?</span>)}
    

    Should get the gist of them. ( Including XML style-self closing tags ie: )

    But you really shouldn't use regex for HTML processing.

    Answer only relevant to the context of the question that was visible before the formatting errors were corrected

    PhiLho : Perl code to an (unspecified!) PHP request? :-)
    Kent Fredric : Yeah, I couldn't be stuffed with nasty quoting styles needed :/ user exercise to make the regex suited to their language :p
    nickf : i'm really getting tired of people saying that you shouldn't use regexes on any sort of XML or HTML. Sometimes using something like Beautiful Soup *really isn't appropriate*.
    Brad Gilbert : In this case it would be fine, as long as it never occurs inside quoted areas. That makes this very brittle, and I wouldn't use it, except in a pinch.
    Kent Fredric : @nickf: its to combat the problem of millions of novices whom use it as the first port of call and then XSS-exploit themself.
  • Could you explain your solution if possible?

  • I suppose these span are generated by some program, since they don't seem to have any attribute.
    I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
    I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!

    So the answer should look like:

    preg_replace('#<span>(&nbsp;|\s)*?</span>#si', '<$1>', $encoded);

    (untested)

    Scott Evernden : \s* and \s*? are equivalent
  • purpose: I'm trying to filter out directly pasted MS-WORD content.

    P.S. I've tried the code above - the empty space still stays untouched...

  • The problem comes when the span gets nested like: <span><span> &nbsp; </span></span>

  • Translating Kent Fredric's regexp to PHP :

    preg_match_all('#<span[^>]*(?:/>|>(?:\s|&nbsp;)*</span>)#im', $html, $result);
    

    This will match :

    • autoclosing spans
    • spans on multilines and whatever the case
    • spans with attributes
    • span with unbreakable spaces

    Maybe you should about including spans containings only
    as well...

    As usual, when it comes to tweak regexp, some tools are handy :

    http://regex.larsolavtorvik.com/

  • I've tried with this regex, but it needs adjusting:

    In what way does the regex in the original question fail?

    The problem comes when the span gets nested like: <span><span> &nbsp; </span></span>

    This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner <span> and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.

  • Simply a good site when working with regular expressions, Regex http://regexlib.com/. Tester and cheat sheet very helpful

  • If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.

    This may not be a very elegant solution, but it'll perform well enough.

  • Here is my solution to nesting tags problems, still not complete but close...

    $test="<span>   <span>& nbsp;  </span> test <span>& nbsp; <span>& nbsp;  </span>  </span> & nbsp;& nbsp; </span>";
    
    $pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';      
    while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
    {$test= preg_replace($pattern,'', $test);}
    

    For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...

0 comments:

Post a Comment