python - Flask和Scrapy如何整合?

标签 python flask scrapy

我正在使用 scrapy 获取数据,我想使用 flask web 框架在网页中显示结果。但是我不知道如何在 flask 应用程序中调用蜘蛛。我尝试使用 CrawlerProcess 来调用我的蜘蛛,但我得到了这样的错误:

ValueError
ValueError: signal only works in main thread

Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread

我的 scrapy 代码是这样的:

class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]

db = DB_Con()
collection = db.getcollection(name, term)

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

我的 flask 代码是这样的:

@app.route('/', methods=['GET', 'POST'])
def index():
    process = CrawlerProcess()
    process.crawl(EPGD_spider)
    return redirect(url_for('details'))


@app.route('/details', methods = ['GET'])
def epgd():
    if request.method == 'GET':
        results = db['EPGD_test'].find()
        json_results= []
        for result in results:
            json_results.append(result)
        return toJson(json_results)

在使用 flask web 框架时如何调用我的 scrapy 蜘蛛?

最佳答案

在蜘蛛程序前添加 HTTP 服务器并不是那么容易。有几个选项。

1。 Python子进程

如果你真的仅限于 Flask,如果你不能使用任何其他东西,那么将 Scrapy 与 Flask 集成的唯一方法是按照其他答案的建议为每个蜘蛛爬行启动外部进程(注意你的子进程需要在正确的 Scrapy 项目目录)。

所有示例的目录结构应如下所示,我使用的是 dirbot test project

> tree -L 1                                                                                                                                                              

├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py

这是在新进程中启动 Scrapy 的代码示例:

# server.py
import subprocess

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    """
    Run spider in another process and store items in file. Simply issue command:

    > scrapy crawl dmoz -o "output.json"

    wait for  this command to finish, and read output.json to client.
    """
    spider_name = "dmoz"
    subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
    with open("output.json") as items_file:
        return items_file.read()

if __name__ == '__main__':
    app.run(debug=True)

将上面的内容另存为 server.py 并访问 localhost:5000,你应该可以看到抓取的项目。

2。 Twisted-Klein + Scrapy

其他更好的方法是使用一些现有的项目,将 Twisted 与 Werkzeug 集成并显示类似于 Flask 的 API,例如Twisted-Klein . Twisted-Klein 允许您在与 Web 服务器相同的进程中异步运行您的蜘蛛程序。更好的是它不会阻塞每个请求,它允许您简单地从 HTTP 路由请求处理程序返回 Scrapy/Twisted 延迟。

以下代码片段将 Twisted-Klein 与 Scrapy 集成,请注意,您需要创建自己的 CrawlerRunner 基类,以便爬虫收集项目并将其返回给调用者。这个选项有点高级,你在与 Python 服务器相同的进程中运行 Scrapy 蜘蛛,项目不存储在文件中,而是存储在内存中(因此没有像前面示例中那样的磁盘写入/读取)。最重要的是它是异步的,并且都在一个 Twisted react 器中运行。

# server.py
import json

from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner

from dirbot.spiders.dmoz import DmozSpider


class MyCrawlerRunner(CrawlerRunner):
    """
    Crawler object that collects items and returns output after finishing crawl.
    """
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    """
    :param output: items scraped by CrawlerRunner
    :return: json with list of items
    """
    # this just turns items into dictionaries
    # you may want to use Scrapy JSON serializer here
    return json.dumps([dict(item) for item in output])


@route("/")
def schedule(request):
    runner = MyCrawlerRunner()
    spider = DmozSpider()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred


run("localhost", 8080)

将上面的内容保存到文件 server.py 中,并在你的 Scrapy 项目目录中找到它, 现在打开 localhost:8080,它将启动 dmoz 蜘蛛并将抓取的项目作为 json 返回给浏览器。

3。 ScrapyRT

当您尝试在您的蜘蛛程序前添加 HTTP 应用程序时会出现一些问题。例如,有时您需要处理蜘蛛日志(在某些情况下您可能需要它们),您需要以某种方式处理蜘蛛异常等。有些项目允许您以更简单的方式向蜘蛛添加 HTTP API,例如ScrapyRT .这是一个将 HTTP 服务器添加到您的 Scrapy 蜘蛛并为您处理所有问题(例如处理日志记录、处理蜘蛛错误等)的应用程序。

所以安装好ScrapyRT后你只需要做:

> scrapyrt 

在你的 Scrapy 项目目录中,它会启动 HTTP 服务器监听你的请求。然后你访问http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com它应该启动你的蜘蛛为你抓取给定的 url。

免责声明:我是 ScrapyRt 的作者之一。

关于python - Flask和Scrapy如何整合?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36384286/

相关文章:

python - 子类实例化期间出现 AttributeError

python - 错误: The function either returned None or ended without a return statement.但在哪里找不到?

python - 在 python scrapy 中选择所有具有特定 id 模式的元素

python - Flask 登录 AttributeError : 'User' object has no attribute 'is_active'

python - 在 scrapy 的管道中注入(inject)参数

python - 在 OSX 10.11 (El Capitan) (系统完整性保护) 中安装 Scrapy 时出现 "OSError: [Errno 1] Operation not permitted"

python - Visual Studio 中的 Windows 窗体与 Python

python - 使用 python lxml.etree 反转元素的嵌套

javascript - 重组 CSV 文件?

javascript - 在 Flask 中使用 AJAX 添加上下文变量