Code Answer: Regular expression to remove empty tags

Hello, I would like such empty span tags (filled with   and space) to be removed:

       

I've tried with this regex, but it needs adjusting:

(( |\s)*)

preg_replace('#( |\s)*#si','<\\1>',$encoded);

Cheers.

From stackoverflow

.
```
qr{<span[^>]*(/>|>\s*?)}
```
Should get the gist of them. ( Including XML style-self closing tags ie: )

But you really shouldn't use regex for HTML processing.

Answer only relevant to the context of the question that was visible before the formatting errors were corrected

PhiLho : Perl code to an (unspecified!) PHP request? :-)

Kent Fredric : Yeah, I couldn't be stuffed with nasty quoting styles needed :/ user exercise to make the regex suited to their language :p

nickf : i'm really getting tired of people saying that you shouldn't use regexes on any sort of XML or HTML. Sometimes using something like Beautiful Soup *really isn't appropriate*.

Brad Gilbert : In this case it would be fine, as long as it never occurs inside quoted areas. That makes this very brittle, and I wouldn't use it, except in a pinch.

Kent Fredric : @nickf: its to combat the problem of millions of novices whom use it as the first port of call and then XSS-exploit themself.
Could you explain your solution if possible?
I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!

So the answer should look like:

preg_replace('#( |\s)*?#si', '<$1>', $encoded);

(untested)

Scott Evernden : \s* and \s*? are equivalent
purpose: I'm trying to filter out directly pasted MS-WORD content.

P.S. I've tried the code above - the empty space still stays untouched...
The problem comes when the span gets nested like:    
Translating Kent Fredric's regexp to PHP :
```
preg_match_all('#<span[^>]*(?:/>|>(?:\s|&nbsp;)*)#im', $html, $result);
```
This will match :
- autoclosing spans
- spans on multilines and whatever the case
- spans with attributes
- span with unbreakable spaces
Maybe you should about including spans containings only
as well...

As usual, when it comes to tweak regexp, some tools are handy :

http://regex.larsolavtorvik.com/
I've tried with this regex, but it needs adjusting:

In what way does the regex in the original question fail?

The problem comes when the span gets nested like:    

This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner  and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.
Simply a good site when working with regular expressions, Regex http://regexlib.com/. Tester and cheat sheet very helpful
If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.

This may not be a very elegant solution, but it'll perform well enough.

Here is my solution to nesting tags problems, still not complete but close...

$test="<span>   <span>& nbsp;  </span> test <span>& nbsp; <span>& nbsp;  </span>  </span> & nbsp;& nbsp; </span>";

$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';      
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}

For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...

Code Answer

Tuesday, March 1, 2011

Regular expression to remove empty <span> tags

0 comments:

Post a Comment

Blog Archive