Thursday, April 14, 2011

Crunching xml with python

I need to remove white spaces between xml tags, e.g. if the original xml looks like:

<node1>
    <node2>
        <node3>foo</node3>
    </node2>
</node1>

I'd like the end-result to be crunched down to single line:

<node1><node2><node3>foo</node3></node2></node1>

Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.

I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.

I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?

EDIT Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.

From stackoverflow
  • I'd use XSLT:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="*">
            <xsl:copy>
                <xsl:copy-of select="@*" />
                <xsl:apply-templates />
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>
    

    That should do the trick.

    In python you could use lxml (direct link to sample on homepage) to transform it.

    For some tests, use xsltproc, sample:

    xsltproc test.xsl  test.xml
    

    where test.xsl is the file above and test.xml your XML file.

    David Zaslavsky : I know nothing about XSLT but if that does the job, it looks really cool ;-)
    Johannes Weiß : XSLT is really great when transforming XML preferably to XML. It is indeed a turing-complete functional programming language but normal programming is (at least in XSLT1.x) a bit of pain since function invocation types very long ;-)
    : Thanks, I will give it a try, from the first look seams like it should do the trick
  • Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.

    I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.

  • This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):

    from lxml import etree
    
    parser = etree.XMLParser(remove_blank_text=True)
    
    foo = """<node1>
        <node2>
            <node3>foo  </node3>
        </node2>
    </node1>"""
    
    bar = etree.XML(foo, parser)
    print etree.tostring(bar,pretty_print=False,with_tail=True)
    

    Results in:

    <node1><node2><node3>foo  </node3></node2></node1>
    

    Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:

    parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
    
    : If CDATA is present then this method would html encode everything inside CDATA block, e.g. converting < into < etc.
    : Works now with the changes to the line creating the parser.
  • Pretty straightforward with BeautifulSoup.

    This solution assumes it is ok to strip whitespace from the tail ends of character data.
    Example: <foo> bar </foo> becomes <foo>bar</foo>

    It will correctly ignore comments and CDATA.

    import BeautifulSoup
    
    s = """
    <node1>
        <node2>
            <node3>foo</node3>
        </node2>
        <node3>
          <!-- I'm a comment! Leave me be! -->
        </node3>
        <node4>
        <![CDATA[
          I'm CDATA!  Changing me would be bad!
        ]]>
        </node4>
    </node1>
    """
    
    soup = BeautifulSoup.BeautifulStoneSoup(s)
    
    for t in soup.findAll(text=True):
       if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
          t.replaceWith(t.strip())
    
    print soup
    
    Van Gale : I don't believe this is quite right because it will strip valid whitespace at the end of contents. But, it reminded me that my snippet does the wrong thing with CDATA so thanks for that! :)
    : Thanks! This does exactly what I wanted
    Johannes Weiß : But that does CHANGE the document! It's not an equal XML document anymore...
    Van Gale : @Johannes Weiß: exactly

0 comments:

Post a Comment