Friday, May 6, 2011

C#: "Swedish" characters in Xpath when parsing Lating1Encoded docs.

I've a set of html docs that I need to parse. They are encoded in Latin1Encoded. I'm using HtmlAgiliy pack for "parsing".

I have a Xpath query (with swedish characters) that I can't get to work because of different encodings between the docs and the encoding VS stores the XPath query in??

Xpath query:

doc.DocumentNode.SelectNodes(@"//h2[text()='Företag']/../div//span[text()='Resultat:']/../div");

The xpath query works fine in the Firefox extension xpath checker.

From stackoverflow
  • Could you provide more sample code and some input XML document? From the information given I wrote a little sample program which just works as expected. Does the following work for you?

    Sample document:

    <?xml version="1.0" encoding="iso-8859-1"?>
    <doc>
      <test>Företag</test>
      <test>Hallå</test>
    </doc>
    

    C#

    using System;
    using System.Xml.XPath;
    
    class Program
    {
        static void Main(string[] args)
        {
            XPathDocument xpdoc = new XPathDocument(@"sample.xml");
            XPathNavigator nav = xpdoc.CreateNavigator();
            XPathNodeIterator iter = nav.Select("//*[text() = 'Företag']");
    
            while (iter.MoveNext())
            {
                Console.WriteLine(iter.Current.ToString());
            }
        }
    }
    

    Output

    Företag
    

    From the sample code given it seems that you are using the Microsoft.Windows.Design.Documents.Trees.DocumentNode class. However, the documentation states that this class is not intended to be used directly. May I ask what you are trying to do?

    Update: It might be that you are facing an issue with whitespace normalization (which might be done by your FireFox add-in and not in your code). Have you tried to change your XPath by replacing the test text() = 'Företag' by normalize-space() = 'Företag' (Just to exclude the case that there is additional leading or trailing whitespace)?

    Tomalak : +1 I was also thinking of "normalize-space()".

0 comments:

Post a Comment