首页 > Python > python3_02spider抓取网页内容

python3_02spider抓取网页内容

  这个也研究了半天,主要是编码的问题,
  问题1:显示的全部是字符,中文显示的是乱码,查一下文档,原来urllib.request.urlopen打开的是一个对象,调用read()后,返回的是二进制的字符流,需要转换成汉字。
  问题2:在转换的时候,我用decode()方法,结果出现错误:UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb0 in position 101: invalid start byte。因为decode()方法是把字符串转换成utf-8的编码,一般的情况中文的字符编码是GB2312,我就转换一下试试,用str()的方法转化,结果OK!
  代码解析:

# ------- spider.py -------  
import urllib.request
from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)
    def handle_endtag(self, tag):
        print("End tag  :", tag)
    def handle_data(self, data):
        print("Data     :", data)
    def handle_comment(self, data):
        print("Comment  :", data)
    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)
    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)
    def handle_decl(self, data):
        print("Decl     :", data)

f = urllib.request.urlopen("http://www.baidu.com")
if f.code == 200:
    parser = MyHTMLParser(strict=False);
    parser.feed(str(f.read(),'GB2312'))
    f.close()

参考:
handle_charref(self, name)
# Overridable — handle character reference
This method is called to process decimal and hexadecimal numeric character references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent for > is >, whereas the hexadecimal is >; in this case the method will receive ’62’ or ‘x3E’.

handle_comment(self, data)
# Overridable — handle comment

handle_data(self, data)
# Overridable — handle data

handle_decl(self, decl)
# Overridable — handle declaration

handle_endtag(self, tag)
# Overridable — handle end tag

handle_entityref(self, name)
# Overridable — handle entity reference
This method is called to process a named character reference of the form &name; (e.g. >), where name is a general entity reference (e.g. ‘gt’).

handle_starttag(self, tag, attrs)
# Overridable — handle start tag

评论 ( 0 )
  1. 还没有评论
评论已关闭.
Trackbacks & Pingbacks ( 0 )
  1. 还没有 trackbacks
  2. Trackbacks 已关闭