python scrapy 传入自定义参数需要注意的几点

  • A+
所属分类:Python

(5条消息)python scrapy 传入自定义参数需要注意的几点 - 行者刘6 - CSDN博客

关于如何在使用scrapy时传入自定义参数,百度了很久,基本都是这种:

在命令行用crawl控制spider爬取的时候,加上-a选项,例如:
scrapy crawl myspider -a category=electronics
然后在spider里这样写:

1.
import scrapy
2.
class MySpider(scrapy.Spider):
3.
name = 'myspider'
4.
def init(self, category=None, args, kwargs):
5.
super(MySpider, self).init(
args, **kwargs)
6.
self.start_urls = ['http://www.example.com/categories/%s' % category]
也就是在spider的构造函数里加上带入的参数即可。

==================================================================================

不过这种方法感觉不够完美漂亮,后面在github查找百度贴吧的爬虫时,终于找到了一种完美的传入自定义参数 的方法,先看看那位大佬的写法:
5d8339e743c87d5f80000000_html_.png

scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter

使用只看楼主模式爬仙剑五外传吧精品帖中第5页到第12页的帖子,其中能通过过滤器filter.py中的thread_filter函数的帖子及其内容会被存入数据库。

地址是:https://github.com/Aqua-Dream/Tieba_Spider

让我稍微讲解下他的传入参数的过程跟方法:
先是作为启动的commands文件夹(跟spider文件夹同一级)里run.py文件:
1.
import scrapy.commands.crawl as crawl
2.
from scrapy.exceptions import UsageError
3.
from scrapy.commands import ScrapyCommand
4.
import config
5.
import filter
6.

7.
class Command(crawl.Command):
8.
def syntax(self):
9.
return " "
10.

11.
def short_desc(self):
12.
return "Crawl tieba"
13.

14.
def long_desc(self):
15.
return "Crawl baidu tieba data to a MySQL database."
16.

17.
def add_options(self, parser):
18.
ScrapyCommand.add_options(self, parser)
19.

parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",

20.
help="set spider argument (may be repeated)")
21.
parser.add_option("-o", "--output", metavar="FILE",
22.
help="dump scraped items into FILE (use - for stdout)")
23.
parser.add_option("-t", "--output-format", metavar="FORMAT",
24.
help="format to use for dumping items with -o")
25.

26.

parser.add_option("-p", "--pages", nargs = 2, type="int", dest="pages", default=[],

27.
help="set the range of pages you want to crawl")
28.

parser.add_option("-g", "--good", action="store_true", dest="good_only", default=False,

29.
help="only crawl good threads and their posts and comments")
30.
parser.add_option("-f", "--filter", type="str", dest="filter", default="",
31.
help='set function name in "filter.py" to filter threads')
32.

parser.add_option("-s", "--see_lz", action="store_true", dest="see_lz", default=False,

33.
help='enable "only see lz" mode')
34.

35.
def set_pages(self, pages):
36.
if len(pages) == 0:
37.
begin_page = 1
38.
end_page = 999999
39.
else:
40.
begin_page = pages[0]
41.
end_page = pages[1]
42.
if begin_page <= 0:
43.
raise UsageError("The number of begin page must not be less than 1!")
44.
if begin_page > end_page:
45.

raise UsageError("The number of end page must not be less than that of begin page!")

46.
self.settings.set('BEGIN_PAGE', begin_page, priority='cmdline')
47.
self.settings.set('END_PAGE', end_page, priority='cmdline')
48.

49.
def run(self, args, opts):
50.
self.set_pages(opts.pages)
51.
self.settings.set('GOOD_ONLY', opts.good_only)
52.
self.settings.set('SEE_LZ', opts.see_lz)
53.
if opts.filter:
54.
try:
55.
opts.filter = eval('filter.' + opts.filter)
56.
except:
57.
raise UsageError("Invalid filter function name!")
58.
self.settings.set("FILTER", opts.filter)
59.
cfg = config.config()
60.
if len(args) >= 3:
61.
raise UsageError("Too many arguments!")
62.

63.
for i in range(len(args)):
64.
if isinstance(args[i], bytes):
65.
args[i] = args[i].decode("utf8")
66.

67.
self.settings.set('MYSQL_HOST', cfg.config['MYSQL_HOST'])
68.
self.settings.set('MYSQL_USER', cfg.config['MYSQL_USER'])
69.
self.settings.set('MYSQL_PASSWD', cfg.config['MYSQL_PASSWD'])
70.

71.
tbname = cfg.config['DEFAULT_TIEBA']
72.
if len(args) >= 1:
73.
tbname = args[0]
74.

75.
dbname = None
76.
if tbname in cfg.config['MYSQL_DBNAME'].keys():
77.
dbname = cfg.config['MYSQL_DBNAME'][tbname]
78.
if len(args) >= 2:
79.
dbname = args[1]
80.
cfg.config['MYSQL_DBNAME'][tbname] = dbname
81.
if not dbname:
82.
raise UsageError("Please input database name!")
83.

84.
self.settings.set('TIEBA_NAME', tbname, priority='cmdline')
85.
self.settings.set('MYSQL_DBNAME', dbname, priority='cmdline')
86.

87.

config.init_database(cfg.config['MYSQL_HOST'], cfg.config['MYSQL_USER'], cfg.config['MYSQL_PASSWD'], dbname)

88.

89.

log = config.log(tbname, dbname, self.settings['BEGIN_PAGE'], opts.good_only, opts.see_lz)

90.
self.settings.set('SIMPLE_LOG', log)
91.
self.crawler_process.crawl('tieba', **opts.spargs)
92.
self.crawler_process.start()
93.

94.
cfg.save()
第一次看,确实有点儿复杂,实际上,其实就是scrapy原文件里的commands目录下的crawl.py文件的修改版
现在我来说说重点:
a。设定parser.option
1.

parser.add_option("-p", "--pages", nargs = 2, type="int", dest="pages", default=[],

2.
help="set the range of pages you want to crawl")
3.

parser.add_option("-g", "--good", action="store_true", dest="good_only", default=False,

4.
help="only crawl good threads and their posts and comments")
5.
parser.add_option("-f", "--filter", type="str", dest="filter", default="",
6.
help='set function name in "filter.py" to filter threads')
7.

parser.add_option("-s", "--see_lz", action="store_true", dest="see_lz", default=False,

8.
help='enable "only see lz" mode')
这部分就是设定选项参数了,
scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter
可以设定输入哪些参数
b.把输入了的 参数在settings里设置

1.
        self.settings.set('BEGIN_PAGE', begin_page, priority='cmdline')
2.
self.settings.set('END_PAGE', end_page, priority='cmdline')
3.
self.settings.set('GOOD_ONLY', opts.good_only)
4.
self.settings.set('SEE_LZ', opts.see_lz)
priority 就是优先级别,,我也不太懂
c.启动贴吧

1.
self.crawler_process.crawl('tieba', **opts.spargs)
2.
self.crawler_process.start()

不过!到这里还没完成!还得在settings.py跟pipeline.py两个文件设定启动跟配置参数
pipeline.py

1.
class TiebaPipeline(object):
2.
@classmethod #初始化时会调用这个函数?
3.

def from_settings(cls, settings): #cls就是这个类,cls(settings)就是相当于TiebaPipeline(settings)

4.
return cls(settings)
5.

6.
def init(self, settings):#获取settings的信息
7.
dbname = settings['MYSQL_DBNAME']
8.
tbname = settings['TIEBA_NAME']
9.
if not dbname.strip():
10.
raise ValueError("No database name!")
11.
if not tbname.strip():
12.
raise ValueError("No tieba name!")
13.
if isinstance(tbname, unicode):
14.
settings['TIEBA_NAME'] = tbname.encode('utf8')
15.

16.
self.settings = settings
17.

18.
self.dbpool = adbapi.ConnectionPool('MySQLdb',
19.
host=settings['MYSQL_HOST'],
20.
db=settings['MYSQL_DBNAME'],
21.
user=settings['MYSQL_USER'],
22.
passwd=settings['MYSQL_PASSWD'],
23.
charset='utf8mb4',
24.
cursorclass = MySQLdb.cursors.DictCursor,
25.
init_command = 'set foreign_key_checks=0' #异步容易冲突
26.
)
27.

28.
def open_spider(self, spider):#设定了多个spider参数
29.
spider.cur_page = begin_page = self.settings['BEGIN_PAGE']
30.
spider.end_page = self.settings['END_PAGE']
31.
spider.filter = self.settings['FILTER']
32.
spider.see_lz = self.settings['SEE_LZ']
33.
start_url = "http://tieba.baidu.com/f?kw=%s&pn=%d" \
34.
%(quote(self.settings['TIEBA_NAME']), 50 * (begin_page - 1))
35.
if self.settings['GOOD_ONLY']:
36.
start_url += '&tab=good'
37.

38.
spider.start_urls = [start_url]
39.

40.
def close_spider(self, spider):
41.
self.settings['SIMPLE_LOG'].log(spider.cur_page - 1) #调用了config里的log函数的log方法
说几个重点:
1.
1.
@classmethod #初始化时会调用这个函数?
2.

def from_settings(cls, settings): #cls就是这个类,cls(settings)就是相当于TiebaPipeline(settings)

3.
return cls(settings)
这个是调用spider里的settings配置,包括之前设定好的参数
2.参数传入spider爬虫里
1.
def open_spider(self, spider):#设定了多个spider参数
2.
spider.cur_page = begin_page = self.settings['BEGIN_PAGE']
3.
spider.end_page = self.settings['END_PAGE']
4.
spider.filter = self.settings['FILTER']
5.
spider.see_lz = self.settings['SEE_LZ']
6.
start_url = "http://tieba.baidu.com/f?kw=%s&pn=%d" \
7.
%(quote(self.settings['TIEBA_NAME']), 50 * (begin_page - 1))
8.
if self.settings['GOOD_ONLY']:
9.
start_url += '&tab=good'
10.

11.
spider.start_urls = [start_url]
12.

13.
def close_spider(self, spider):
14.
self.settings['SIMPLE_LOG'].log(spider.cur_page - 1) #调用了config里的log函数的log方法
在pipeline的类里面 open_spider和close_spider两个方法,分别是在爬虫启动和结束时的回调方法。
在爬虫spider主体传入参数,就是在这里设定的

settings.py
最后,再把配置文件设定好就行了!

1.
ITEM_PIPELINES = {
2.
'ds1.pipelines.Ds1Pipeline': 4,
3.
}
4.

5.
COMMANDS_MODULE = 'ds1.commands'


此时就能类似这样传入参数启动爬虫了!

scrapy run 仙剑五外传 -gs -p 5 12


说几个我自己测试过程中发现的几点
一。在使用配置settings时(在pipeline.py文件),好像有两种方法:

1.
@classmethod
2.
def from_settings(cls, settings):
3.
return cls(settings)
4.

5.
@classmethod
6.
def from_crawler(cls, crawler):
7.
return cls(crawler.settings)
后面调用了下,发现,,原来是等价的!

二。在pipeline的类里面 open_spider和close_spider两个方法,分别是在爬虫启动和结束时的回调方法。
但是一般Pipeline的类一般都有好几个的,如果每一个类都设定回调方法,会怎样?
5d8339e743c87d5f80000001_html_.png
调试了下,很有意思,爬虫启动时传入的spider.start_url,是以最后传入(settings.py设定的顺序) 为准,就是后面会覆盖前面的,
然而结束时,最后启动的close_spider居然是第一个启动的Pipeline类,这就比较有意思了!

三。到底传入的参数,运行顺序是怎样的?
spider.py 有参数,pipeline.py的open_spider可以设定参数,但是爬虫时又可以设定start_requst,那么那个先那个后呢?
经过调试,以start_url为目标,我发现参数的变化是:

启动爬虫后,先获取spider.py 文件的class spider() 下的start_url ,然后检查pipeline.py 是否调用open_spider,如果设定了spider.start_url,那么就覆盖掉,此时再把参数回调到start_requst,正式进入爬虫!

也就是说,参数的优先等级:start_request>open_spider>class  spider的属性

  • 我的微信
  • 这是我的微信扫一扫
  • weinxin
  • 我的微信公众号
  • 我的微信公众号扫一扫
  • weinxin