Python Crawler Note 2

 

Besides what we have seen in Note 1, we can add some details in our codes: Sometimes we need to pretend as the browser to obtain the content of the page, we can add headers:

import urllib2
url = 'http://drapor.me'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} 
request = urllib2.Request(url, headers)
response = urllib2.urlopen(request)
print response.read()

More about headers, you can check this article.


And if you need to post some data, like your username or password to the sites, try this:

import urllib2

values = {"username":"drapor","password":"*****"}

data = urllib.urlencode(values) 
url = "http://www.heibanke.com/lesson/crawler_ex01/" #This is a crawler game I found before which is quite interesting.
request = urllib2.Request(url,data)
response = urllib2.urlopen(request)
print response.read()

If you find there’s something wrong with your code, you can try this:

import urllib2

requset = urllib2.Request('http://www.xxxxx.com')
try:
    urllib2.urlopen(request)
except urllib2.URLError, e:
    print e.reason
And you will get the info of the error.


And if you need to use cookies, try this:

import urllib2
import cookielib

cookie = cookielib.CookieJar() #Declar a CookieJar object to save cookie
handler = urllib2.HTTPCookieProcessor(cookie) #use HTTPCookieProcessor object in urllib2 to create cookie processor
opener = urllib2.build_opener(handler) #build opener by Handler
response = opener.open('http://drapor.me') # this opener.open() is the same as urllib2.urlopen()

for item in cookie:
    print 'Name = '+item.name
    print 'Value = '+item.value

Contents


本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可。

知识共享许可协议