Hello,
I would like such empty span tags (filled with
and space) to be removed:
<span> </span>
I've tried with this regex, but it needs adjusting:
(<span>( |\s)*</span>)
preg_replace('#<span>( |\s)*</span>#si','<\\1>',$encoded);
Cheers.
-
.
qr{<span[^>]*(/>|>\s*?</span>)}
Should get the gist of them. ( Including XML style-self closing tags ie: )
But you really shouldn't use regex for HTML processing.
Answer only relevant to the context of the question that was visible before the formatting errors were corrected
PhiLho : Perl code to an (unspecified!) PHP request? :-)Kent Fredric : Yeah, I couldn't be stuffed with nasty quoting styles needed :/ user exercise to make the regex suited to their language :pnickf : i'm really getting tired of people saying that you shouldn't use regexes on any sort of XML or HTML. Sometimes using something like Beautiful Soup *really isn't appropriate*.Brad Gilbert : In this case it would be fine, as long as it never occurs inside quoted areas. That makes this very brittle, and I wouldn't use it, except in a pinch.Kent Fredric : @nickf: its to combat the problem of millions of novices whom use it as the first port of call and then XSS-exploit themself. -
Could you explain your solution if possible?
-
I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!So the answer should look like:
preg_replace('#<span>( |\s)*?</span>#si', '<$1>', $encoded);
(untested)
Scott Evernden : \s* and \s*? are equivalent -
purpose: I'm trying to filter out directly pasted MS-WORD content.
P.S. I've tried the code above - the empty space still stays untouched...
-
The problem comes when the span gets nested like:
<span><span> </span></span>
-
Translating Kent Fredric's regexp to PHP :
preg_match_all('#<span[^>]*(?:/>|>(?:\s| )*</span>)#im', $html, $result);
This will match :
- autoclosing spans
- spans on multilines and whatever the case
- spans with attributes
- span with unbreakable spaces
Maybe you should about including spans containings only
as well...As usual, when it comes to tweak regexp, some tools are handy :
-
I've tried with this regex, but it needs adjusting:
In what way does the regex in the original question fail?
The problem comes when the span gets nested like:
<span><span> </span></span>
This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner
<span>
and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do. -
Simply a good site when working with regular expressions, Regex http://regexlib.com/. Tester and cheat sheet very helpful
-
If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.
This may not be a very elegant solution, but it'll perform well enough.
-
Here is my solution to nesting tags problems, still not complete but close...
$test="<span> <span>& nbsp; </span> test <span>& nbsp; <span>& nbsp; </span> </span> & nbsp;& nbsp; </span>"; $pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im'; while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0) {$test= preg_replace($pattern,'', $test);}
For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...
0 comments:
Post a Comment