I need to do a program to download the webpages, that is, I give a webpage to the software and it would download all the files in the website.
I would pass also a level of depth, that is, the level where the software goes download each file of the website.
I will develop this software in Java and I need to use concurrency also.
Please, tell me your opinion about how to do this.
Thanks for the help.
Thanks to everyone for the help.
I need to ask one more thing. How do I do to download a file from the website?
Thaks one more time. =D
-
Well, this is a bit hard to answer without knowing how detailed guidance you need, but here's an overview. :)
Java makes such applications quite easy, actually, since both HTTP-requests and threading are readily available. My solution would probably involve a global stack containing new urls, and a farm of a constant number of threads that pop urls from the stack. I'd store the urls as a custom object, so that I could keep track of the depth.
I think your main issue here will be with sites that doesn't respond, or doesn't follow the HTTP standard. I've noticed many times in similiar applications that sometimes these doesn't time out properly, and eventually they end up blocking all the threads. Unfortunately I don't have any good solutions here.
A few useful classes as a starting point:
http://java.sun.com/javase/6/docs/api/java/lang/Thread.html
http://java.sun.com/javase/6/docs/api/java/lang/ThreadGroup.html
http://java.sun.com/javase/6/docs/api/java/net/URL.html
http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html -
I would look at this recourses:
http://hc.apache.org/httpclient-3.x/
http://java.sun.com/javase/6/docs/api/java/util/concurrent/package-summary.html
http://java.sun.com/javase/6/docs/api/java/util/concurrent/locks/package-summary.html
-
I would have a look at the Java Executors package. You create a set of tasks (Runnables) and pass them to a suitable chosen Executor. You get a Future back and you can then query this for its result.
The Executor will coordinate when this Runnable is executed. Implementations exist for single-threaded executors, executors with a pool of threads etc. So you don't need to worry (too much) wrt. the threading intricacies. The concurrency utilities will look after this for you.
Apache HTTP Client will look after the HTTP querying for you.
-
A very useful library for spiders and bots: htmlunit
lucas : While there are oodles of libraries to do this sort of thing, htmlunit has always been my favorite also, maybe combined with a different parser like tagsoup (html) or xom (xml).
0 comments:
Post a Comment