Tuesday, April 5, 2011

Reading and editing HTML in .Net

Is there a .Net class for reading and manipulating html other than System.Windows.Forms.HtmlDocument.

If not, are there any open source libraries for this.

From stackoverflow
  • I would do something like this if it XHTML compliant:

    System.Xml.XmlDocument xDoc = new System.Xml.XmlDocument();
    xDoc.LoadXml(html);
    

    And edit it that way. If it needs some cleaning up(XHtml Conversion) you can use HtmlTidy or Ntidy. Additionally, you can use this HTMLTidy wrapper example below:

    string input = "<p>broken html<br <img src=test></div>";
    HtmlTidy tidy = new HtmlTidy()
    string output = tidy.CleanHtml(input, HtmlTidyOptions.ConvertToXhtml);
    XmlDocument doc = new XmlDocument();
    doc.LoadXml(output);
    

    StackOverFlow Reference

    EDIT above will be converted to XHtml

    ChrisW : Surely that only works with XHTML: not with HTML.
    cgreeno : Y is this down voted? Is it not a valid option????
    hmcclungiii : I'd imagine it was down voted because the question had nothing to do with XML.
    cgreeno : YES but the question asks for other OPTIONS on how to manipulate HTML! XHTML is just a reformulation of HTML in XML.
    Cyril Gupta : I don't think it deserves a down vote. so I voted it up.
    hmcclungiii : Then he'll fall into the trap of XML validation among many other things, that I'd guess by his wording would be way more than he is bargaining for. Instead of manipulating straight HTML, you would suggest he "reformulate" it? Sorry, I just don't agree, and I think your CAPS are a bit rude.
    cgreeno : Reformulating it? XHtML is valid HTML as well.... so by turning HTML to XHTML you would not only be manipulating the required data but outputting something better.... You may not agree, but it is a valid option.
    hmcclungiii : Oh, I didn't down vote it. Without knowing exactly what his purposes are, I would say that XHTML is overkill, to put it more simply.
  • Why does you like not System.Windows.Forms.HtmlDocument and Microsoft.mshtml ?

    mdresser : Because it requires a reference to System.Windows.Forms which isn't so appropriate for a class library or for asp.net.
  • You could use the MSHTML library. However, it is COM/ActiveX, but if you are using Visual Studio, it will create a managed wrapper for you automatically.

    ChrisW : Is the (unmanaged) MSHTML library the same thing as the (managed) System.Windows.Forms.HtmlDocument?
    ChrisW : I assumed that HtmlDocument is a managed wrapper around the unmanaged MSHTML ... you're saying this isn't so?
  • you can always use the LiteralControl:

    PlaceHolder.Controls.Add(new LiteralControl("<div>some html</div>"));
    
  • It seems that the best option for parsing Html in .Net apps is to use the Html Agility Pack library found on codeplex. This provides full DOM access to the HTML and is very straightforward to use.

0 comments:

Post a Comment