Thursday, March 3, 2011

Python Screenscraping

I'm wanting to create a REST API for TV listings in my country. While online aggregations of TV listings do exist they're too tied to the presentation to be of any use to software developers.

In order to get hold of this information I'm thinking of going to each source and scraping the relevant information. While I've obtained similar information from HTML pages before it was an extremely cumbersome process. Do any Python features/libraries exist that would make this process easier?

From stackoverflow
  • Beautiful Soup will save you a great deal of pain.

    ayaz : Seconded. BS is the first thing that naturally comes to mind.
    efotinis : I've also just recently discovered BeautifulSoup. Up until then I didn't know it was possible to fall in love with a piece of code... :b
  • Another option is to use lxml.html. I've occasionally found this to handle some pages better than BeautifulSoup (odd HTML comment corner cases), and the API may be more familiar if you've worked with XML. If BeautifulSoup does handle certain pages better, you can still use it while retaining the same interface by using soupparser module.

    Prairiedogg : For all the good press BeautifulSoup gets in the python community, I've found that 4 of the 6 sites I've scraped today make the latest version of BS choke, while lxml.html works perfectly. I could be doing something wrong tho I reckon...
    Kurt : I'm finding the biggest problem is CSS/JavaScript insanity (www.ebay.com for example makes BeautifulSoup choke horribly, weird quoting, etc. Slashdot is another site, all their links start with // instead of http://).
  • While BeautifulSoup is a good piece of code, depending on what you are trying to extract from the web page, you may not need that much intelligence. The data you're looking for may be easily picked out by a regular expression, for example.

    Acorn : You're succumbing to the temptations of the dark god Cthulhu! http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
  • This isn't answering you question directly, but you may want to think about getting your data from another service, such as Schedules Direct. They provide XML, and it's the recommended data provider for xmltv.

  • Use mechanize to automate browsing, and BeautifulSoup to parse the HTML. (I do lots of stuff like what you described.)

0 comments:

Post a Comment