Thursday, May 5, 2011

Regular Expression to find the start end of a list in HTML

I have a TextBox in a webpage that i'm using javascript to parse and modify to format for HTML. 90% of it works really well, the last main thing i'm trying to be able to support is copying and pasting from a word document. I got it mostly completely, i just am kinda stuck on finding list and wrapping them in a UL tag..

So, using regular expressions, i'd like to find the list in this text:

<p>paragraph goes here

<li>goes here<br/>
<li>list item 2<br/>
<li>list item 3<br/>

<p>another paragraph

and wrap the <li> section with a <ul> tag. my regexp foo isn't that good, can someone help?

----- update -----

While I appreciate all the feedback basically indicating that I need to start from scratch with this issue, I do not have the time to do that. I completely understand that regex is not the ideal way to handle HTML formatting, but how I am using it now, it will handle most of what my users are looking to do. I only need a subset of HTML tags, not a full HTML editor.

The source of my content will be a user copying and pasting from a word document (about 99.9% ) of the time. i use regex to insert HTML tags into plain text. for the lists, i find the bullet character MS word inserts into it's copied text and replace that with the <LI> tag. I just want to make it more user friendly to wrap the <LI> tags with a <UL> tag.

I'll look into being able to end my tags properly, so.. assuming they're properly ended, what would be the regex to wrap my list items with a <ul> tag?

thanks!

From stackoverflow
  • Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

    1. Regular expressions and HTML are a particularly bad fit.

    2. This is 2009, use closing tags in your HTML. (That alone will help you, if you really want to regex your html.

    3. If you've already got this page inside a browser, use the DOM! Let the browser parse the HTML for you (shove it into a hidden div if you must) and navigate the resulting DOM tree.

  • Don't parse HTML with regexes. Instead, use a real HTML parser.

    Sorry if my answer feels insubstantial, but this question is asked almost every day, and your requirements are (in my opinion) far too complicated for regular expressions.

    Also, none of your tags are closed. You should probably write that like this:

    <p>paragraph goes here</p>
    
    <li>goes here</li>
    <li>list item 2</li>
    <li>list item 3</li>
    
    <p>another paragraph</p>
    

    My HTML may be off, but you should really close all your tags.

  • I agree with James and Chris, in general it's really a lot better to use a proper parser, I've seen people fail badly doing it any other way (I'm assuming you don't have full control over the HTML input here, in which case a shortcut like regex might work fine).

    Let's assume you're using Java for the moment. If you know that your input is valid XHTML instead of HTML, you can use the Java API for XML Processing (JAXP), which comes with the Sun Java JDK. Then in a few lines you can parse your XHTML into a DOM tree and reach down to pick out the list's node and do whatever you like with it. There's a learning curve to JAXP, but it's well worth it.

    If you are using Groovy, there's XMLSlurper. Ruby has several good XML libraries. PHP has the XMLParser extension. Python has Beautiful Soup. Pretty much any modern language has good alternatives to choose from.

    Now based on your example, you don't have properly XML-ized XHTML, but wild-and-wooly HTML with unclosed tags and other nasties. If that's the case, you'll need to grab an HTML parser library, something on the order of HTMLParser. Good luck!

  • Assuming all elements have end tags, and nobody got clever by adding spaces inside start or end tags, and that some elements precede the list items, all you have to do it something like (in Perl syntax, probably compatible with a PCRE library, minus the m// operator):

    m/(?<!li)>[^<]*<li/i
    

    to identify the first list item in a group. Exploded (with the x flag, for readability):

    m/
        (?<!li)> # the end of a start or end tag that isn't part of an li element
        [^<]*    # some non-angle-bracket characters -- in-between tag content
        <li      # the beginning of an li element
    /xi          # space insensitive, case insensitive (respectively)
    

    And then you could go through the next block more confident that nothing will likely be between list items until you read its end, save that position, and use this pattern again.


    Figuring out where it ends is trickier without a parser. You could use something like (this is abridged)

    m/(?<=<li).*?<(div|form|p)/i
    

    where you list all the non-inline elements, which will trigger the li and ul to be closed and end the overall list. But the other way for the list to close implicity is for the container to close.


    If the list-item elements themselves are well-formed (have closing tags), then this might be sufficient for placing the lists's closing tag:

    m{</li>.*?<(?!li)}i
    

0 comments:

Post a Comment