Thursday, February 3, 2011

Get size of a file before downloading in Python

I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server?

import urllib
import re

url = "http://www.someurl.com"

# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()

f = open ("temp.htm", "w")
f.write (html)
f.close()

# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)

for fname in fnames:
print fname, "..."

f = urllib.urlopen(url + "/" + fname)

#### Here I want to check the filesize to download or not ####
file = f.read()
f.close()

f = open (fname, "w")
f.write (file)
f.close()
  • The size of the file is sent as the Content-Length header. Here is how to get it with urllib:

    >>> site = urllib.urlopen("http://python.org")
    >>> meta = site.info()
    >>> print meta.getheaders("Content-Length")
    ['16535']
    >>>
    From Jon Works
  • Using the returned-urllib-object method info(), you can get various information on the retrived document. Example of grabbing the current Google logo:

    >>> import urllib
    >>> d = urllib.urlopen("http://www.google.co.uk/logos/olympics08_opening.gif")
    >>> print d.info()

    Content-Type: image/gif
    Last-Modified: Thu, 07 Aug 2008 16:20:19 GMT
    Expires: Sun, 17 Jan 2038 19:14:07 GMT
    Cache-Control: public
    Date: Fri, 08 Aug 2008 13:40:41 GMT
    Server: gws
    Content-Length: 20172
    Connection: Close

    It's a dict, so to get the size of the file, you do urllibobject.info()['Content-Length']

    print f.info()['Content-Length']

    And to get the size of the local file (for comparison), you can use the os.stat() command:

    os.stat("/the/local/file.zip").st_size
    From dbr
  • Also if the server you are connecting to supports it, look at Etags and the If-Modified-Since and If-None-Match headers.

    Using these will take advantage of the webserver's caching rules and will return a 304 Not Modified status code if the content hasn't changed.

    From Jon Works
  • @Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file.

    Examples:

    Local Size  Server Size
    2.223.533 2.115.516
    664.603 662.121

    It has anything to do with the CR/LF conversion?

    From PabloG
  • Possibly. Can you run diff on it and see a difference? Also do you see the file size difference in the binary (.zip) files?

    Edit:

    This is where things like Etags comes in handy. The server will tell you when something changes, so you don't have to download the complete file to figure it out.

    From Jon Works
  • I have reproduced what you are seeing:

    import urllib, os
    link = "http://python.org"
    print "opening url:", link
    site = urllib.urlopen(link)
    meta = site.info()
    print "Content-Length:", meta.getheaders("Content-Length")[0]

    f = open("out.txt", "r")
    print "File on disk:",len(f.read())
    f.close()


    f = open("out.txt", "w")
    f.write(site.read())
    site.close()
    f.close()

    f = open("out.txt", "r")
    print "File on disk after download:",len(f.read())
    f.close()

    print "os.stat().st_size returns:", os.stat("out.txt").st_size

    Outputs this:

    opening url: http://python.org
    Content-Length: 16535
    File on disk: 16535
    File on disk after download: 16535
    os.stat().st_size returns: 16861

    What am I doing wrong here? Is os.stat().st_size not returning the correct size?


    Edit: OK, I figured out what the problem was:

    import urllib, os
    link = "http://python.org"
    print "opening url:", link
    site = urllib.urlopen(link)
    meta = site.info()
    print "Content-Length:", meta.getheaders("Content-Length")[0]

    f = open("out.txt", "rb")
    print "File on disk:",len(f.read())
    f.close()


    f = open("out.txt", "wb")
    f.write(site.read())
    site.close()
    f.close()

    f = open("out.txt", "rb")
    print "File on disk after download:",len(f.read())
    f.close()

    print "os.stat().st_size returns:", os.stat("out.txt").st_size

    this outputs:

    $ python test.py
    opening url: http://python.org
    Content-Length: 16535
    File on disk: 16535
    File on disk after download: 16535
    os.stat().st_size returns: 16535

    Make sure you are opening both files for binary read/write.

    // open for binary write
    open(filename, "wb")
    // open for binary read
    open(filename, "rb")
    From Jon Works
  • @Jon: you're right, I wasn't using "wb" when opening the local file for writing. Works like a charm! Thx

    From PabloG

0 comments:

Post a Comment