新建项目
新建项目 douban
scrapy startproject douban
生成爬虫主文件
cd douban
scrapy genspider douban_spider movie.douban.com
明确目标,编写 items 文件
spider 文件的编写
测试下抓取
scrapy crawl douban_spider
Windows 版提示错误 ModuleNotFoundError: No module named 'win32api'
解决办法:安装 pywin32
python3.6.4
https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/pywin32-220.win-amd64-py3.6.exe/download
其他版本
https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/
安装完成后运行依然提示错误,然后
pip install pypiwin32
然后执行 scrapy crawl douban_spider 报错 403
需要在 setting.py 设置 USER_AGENT
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
Linux 可能会提示
ModuleNotFoundError: No module named '_sqlite3'
yum -y install sqlite*
然后重新编译 Python
./configure –prefix=’/usr/local/python3′ –with-ssl