我需要做一些简短的实时抓取并将结果数据返回到我的 Django REST Controller 中。
尝试使用 scrapy:
import scrapy
from scrapy.selector import Selector
from . models import Product
class MysiteSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.something.com/browse?q=dfd',
]
allowed_domains = ['something.com']
def parse(self, response):
items_list = Selector(response).xpath('//li[@itemprop="itemListElement"]')
for value in items_list:
item = Product()
item['picture_url'] = value.xpath('img/@src').extract()[0]
item['title'] = value.xpath('h2').text()
item['price'] = value.xpath('p[contains(@class, "ad-price")]').text()
yield item
项目模型
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
picture_url = scrapy.Field()
published_date = scrapy.Field(serializer=str)
根据Scrapy的架构,items会被返回到Item Pipeline
( https://doc.scrapy.org/en/1.2/topics/item-pipeline.html ) 用于将数据存储到DB/保存到文件等。
但是我一直被这个问题所困扰——如何通过 Django REST APIview 返回已抓取的项目列表?
预期用法示例:
from rest_framework.views import APIView
from rest_framework.response import Response
from .service.mysite_spider import MysiteSpider
class AggregatorView(APIView):
mysite_spider = MysiteSpider()
def get(self, request, *args, **kwargs):
self.mysite_spider.parse()
return Response('good')
最佳答案
我实际上并没有测试与 Django REST 框架的集成,但以下代码片段将允许您从 python 脚本运行 Spider,收集生成的项目以便稍后处理它们。
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess
from ... import MysiteSpider
items = []
def collect_items(item, response, spider):
items.append(item)
crawler = Crawler(MysiteSpider)
crawler.signals.connect(collect_items, signals.item_scraped)
process = CrawlerProcess()
process.crawl(crawler)
process.start() # the script will block here until the crawling is finished
# at this point, the "items" variable holds the scraped items
郑重声明,这可行,但可能有更好的方法:-)
进一步阅读:
关于python django scrapy 将项目返回到 Controller ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41390396/