Web Crawler
Finished my first web scrawler … =_=|
Here is all about the spider. It’s small but funny.
After I finish this web scrawler, I learn that. The most important thing to build a web scrawler is not the code but the inverse process. How to build your program like a web browser.
Static web page is easy to get but how about a dynamic web page. Nowadays, more and more website use ajax
to meet their bussiness requirement. If you just send a simple request to the server of that web site, you will find than the returned page is just only part of the total user page that you saw in the web browser.
Here is my demonstration.
duitang.com is a funny images web site. I love a cute virtual character who is named as Mr.zhangcao . There is a lot of images of that guy. So I want build a web crawler to get that images all.
If you send a request to the main page and anlysis the returned page (Python’s library BeautifulSoup is helpful to analysis html files), you will find there are some links in the html file. But you will only get about 20+ images but not all images that you viewed in the web browser. When you pull down the button of the brower, you will see more and more images coming out to user.
Why the returned page don’t have all links of images? The answer is ajax
(asynchronous JavaScript and XML) is a set of web development techniques using many web technologies on the client-side to create asynchronous Web application.
The monitor in browswer will help programmer who is writing a web spider.
Dig into the detail of http transport, you will find a special request which are sent to server by the client browser.
This is that special url.
|
|
If client request that url to the server, the server will return a json data object. The job left is to analysis the json object and then get the resource links.
The important part of building a web crawler is not the code but how to find the useful url and get more useful information.
Here is a screenshut for what my web scrawler got.
|
|
Download some pdf documents on a website
|
|
Photo by Jason Leaster. LiuYe Lake in ChangDe, HuNan, China.
作者: Jason Leaster
来源: http://jasonleaster.github.io
链接: http://jasonleaster.github.io/2016/02/19/web-crawler/
本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可