Friday, April 8, 2011

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.

After using selectorgadget I get the string- .desc

How do I use this?

Thanks :)

From stackoverflow
  • Inspecting the page, I can see that the specifications are placed in a div with the ID pcraSpecs:

    <div id="pcraSpecs">
      <script type="text/javascript">...</script>
      <TABLE cellpadding="0" cellspacing="0" class="specification">
        <TR>
          <TD colspan="2" class="title">Model</TD>
        </TR>
        <TR>
          <TD class="name">Brand</TD>
          <TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Intel'));</script></TD>
        </TR>
        <TR>
          <TD class="name">Processors Type</TD>
          <TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Desktop'));</script></TD>    
        </TR>
        ...
      </TABLE>
    </div>
    

    desc is the class of the table cells.

    What you want to do is to extract the contents of this table.

    soup.find(id="pcraSpecs").findAll("td") should get you started.

  • Have you tried using Feedity - http://feedity.com for creating a custom RSS feed from any webpage.

0 comments:

Post a Comment