Here is all about the spider. It’s small but funny.
After I finish this web scrawler, I learn that. The most important thing to build a web scrawler is not the code but the inverse process. How to build your program like a web browser.
Static web page is easy to get but how about a dynamic web page. Nowadays, more and more website use ajax to meet their bussiness requirement. If you just send a simple request to the server of that web site, you will find than the returned page is just only part of the total user page that you saw in the web browser.
Here is my demonstration. duitang.com is a funny images web site. I love a cute virtual character who is named as Mr.zhangcao . There is a lot of images of that guy. So I want build a web crawler to get that images all.
If you send a request to the main page and anlysis the returned page (Python’s library BeautifulSoup is helpful to analysis html files), you will find there are some links in the html file. But you will only get about 20+ images but not all images that you viewed in the web browser. When you pull down the button of the brower, you will see more and more images coming out to user.
The monitor in browswer will help programmer who is writing a web spider.
Dig into the detail of http transport, you will find a special request which are sent to server by the client browser.