python - 如何从 scrapy 运行中获取统计信息?

标签 python mysql web-scraping scrapy

我正在按照 scrapy 文档中的示例从外部文件运行 scrapy spider。我想抓取 Core API 提供的统计信息,并在抓取完成后将其存储到 mysql 表中。

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from test.spiders.myspider import *
from scrapy.utils.project import get_project_settings
from test.pipelines import MySQLStorePipeline
import datetime

spider = MySpider()


def run_spider(spider):        
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    mysql_insert = MySQLStorePipeline()
        mysql_insert.cursor.execute(
            'insert into crawler_stats(sites_id, start_time,end_time,page_scraped,finish_reason) 
              values(%s,%s,%s, %s,%s)',
                  (1,datetime.datetime.now(),datetime.datetime.now(),100,'test'))

    mysql_insert.conn.commit()

run_spider(spider)

如何获取上述代码中的 start_time、end_time、pages_scraped、finish_reason 等统计值?

最佳答案

crawler.stats collector 获取它们:

stats = crawler.stats.get_stats()

示例代码(在 spider_closed 信号处理程序中收集统计信息):

def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # stats is a dictionary

    # write stats to the database here

    reactor.stop()


def run_spider(spider):        
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(callback, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()


run_spider(spider)

关于python - 如何从 scrapy 运行中获取统计信息?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27739380/

相关文章:

python - 从python中的多行字符串中删除模式匹配行

python - 无法使用python客户端连接到go grpc服务器

mysql - 如何从 dbr golang 查询生成器中提取原始查询

php - 如何将动态 (PHP) 网站存档为静态 HTML?

python - numpy.where 相当于 csr_matrix python

python - 是否可以使用 python matplotlib 垂直绘制绘图?

mysql - 构造多重选择连接 OOP 类

mysql - 我怎样才能在mysql中获得第二个最大ID?

python - Headless Python Selenium MacOS 通过 Chromium 单击/下载文档

regex - R 正则表达式匹配从 HTML 中删除注释脚本