Tuesday, April 5, 2011

XmlDocument dropping encoded characters

My C# application loads XML documents using the following code:

XmlDocument doc = new XmlDocument();
doc.Load(path);

Some of these documents contain encoded characters, for example:

<xsl:text>&#10;</xsl:text>

I notice that when these documents are loaded, &#10; gets dropped.

My question: How can I preserve <xsl:text>&#10;</xsl:text>?

FYI - The XML declaration used for these documents:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
From stackoverflow
  • Are you sure the character is dropped? character 10 is just a line feed- it wouldn't exactly show up in your debugger window. It could also be treated as whitespace. Have you tried playing with the whitespace settings on your xmldocument?


    If you need to preserve the encoding you only have two choices: a CDATA section or reading as plain text rather than Xml. I suspect you have absolutely 0 control over the documents that come into the system, therefore eliminating the CDATA option.

    Plain-text rather than Xml is probably distasteful as well, but it's all you have left. If you need to do validation or other processing you could first load and verify the xml, and then concatenate your files using simple file streams as a separate step. Again: not ideal, but it's all that's left.

    Brian Singh : With `PreserveWhitespace = true;` I see it in the debug window (inner xml) and also when the file is saved but it is unencoded. My app is an intermediary; it combines a number of smaller xml documents into a single larger document so I need to preserve the encoded form.
    Brian Singh : I would just append them all together if there were not business requirements that dictate modifications to the smaller xml as the larger is constructed.
    Jon Skeet : I don't see why you need to preserve the encoded form - every XML parser should treat the two as being the same. Could you explain the requirement in more detail?
    Brian Singh : Joel – Correct. I have no control over the input documents. I am leaning towards using file streams and regular expressions to achieve what I need to do.
    Brian Singh : Jon – The purpose of my application is to automate a manual process done by the front-end team (creators of the XSLTs). I am taking files as input and generating files as output.
    Brian Singh : From the front-end team perspective I am simply building a larger XSLT using their existing smaller XSLTs and applying some rules on what gets included in the final file.
    Brian Singh : They are determining the correctness of the results based on the input they had provided me and it makes them very nervous when things are missing ( is only one example)
    Joel Coehoorn : I advise against regular expressions. They're just not suited for evaulating this kind of document. The two step approach, while perhaps slower, will result in cleaner, simpler, more maintainable code.
  • &#10; is a linefeed - i.e. whitespace. The XML parser will load it in as a linefeed, and thereafter ignore the fact that it was originally encoded. The encoding is just part of the serialization of the data to text format - it's not part of the data itself.

    Now, XML sometimes ignores whitespace and sometimes doesn't, depending on context, API etc. As Joel says you may find that it's not missing at all - or you may find that using it with an API which allows you to preserve whitespace fixes the problem. I wouldn't be at all surprised to see it turned into an unencoded linefeed character when you output the data though.

    Brian Singh : Yes - It is indeed an unencoded linefeed character once the data is outputed - unfortunately I need to keep the encoded form.
    U62 : Does doc.PreserveWhitespace = True; help?
    bobince : No, it won't. A conforming XML processor may not distinguish between a newline character and a character reference to code 10 in element content, full stop. (It's different in attribute values.) Why do you need to keep the encoded form?
    Brian Singh : bobince - see comments section in Joel Coehoorn answer
  • maybe it would be better to keep data in ![CDATA] ?

    http://www.w3schools.com/XML/xml_cdata.asp

0 comments:

Post a Comment