我对 Django 还是很陌生 我正在关注 this关于如何集成 scrapy 和 django 的教程。
问题是当我尝试使用我自己的蜘蛛时,它根本无法工作。 我已经在 django 之外尝试过蜘蛛,它工作得很好,一些帮助会很有帮助。
这是我的spider.py文件
import scrapy
from scrapy_splash import SplashRequest
class NewsSpider(scrapy.Spider):
name = 'detik'
allowed_domains = ['news.detik.com']
start_urls = ['https://news.detik.com/indeks/all/?date=02/28/2018']
def parse(self, response):
urls = response.xpath("//div/article/a/@href").extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_detail)
# follow pagination link
page_next = response.xpath("//a[@class = 'last']/@href").extract_first()
if page_next:
page_next = response.urljoin(page_next)
yield scrapy.Request(url=page_next, callback=self.parse)
def parse_detail(self,response):
x = {}
x['breadcrumbs'] = response.xpath("//div[@class='breadcrumb']/a/text()").extract(),
x['tanggal'] = response.xpath("//div[@class='date']/text()").extract_first(),
x['penulis'] = response.xpath("//div[@class='author']/text()").extract_first(),
x['judul'] = response.xpath("//h1/text()").extract_first(),
x['berita'] = response.xpath("normalize-space(//div[@class='detail_text'])").extract_first(),
x['tag'] = response.xpath("//div[@class='detail_tag']/a/text()").extract(),
x['url'] = response.request.url,
return x
这是我的管道文件
class DetikAppPipeline(object):
def process_item(self, item, spider):
item = detikNewsItem()
self.items.append(item['breadcrumbs'])
self.items.append(item['tanggal'])
self.items.append(item['penulis'])
self.items.append(item['judul'])
self.items.append(item['berita'])
self.items.append(item['tag'])
self.items.append(item['url'])
item.save()
这是django中的模型文件
class detikNewsItem(models.Model):
breadcrumbs = models.TextField()
tanggal = models.TextField()
penulis = models.TextField()
judul = models.TextField()
berita = models.TextField()
tag = models.TextField()
url = models.TextField()
@property
def to_dict(self):
data = {
'url': json.loads(self.url),
'tanggal': self.tanggal
}
return data
def __str__(self):
return self.url
最佳答案
在Django项目中如何编写Scrapy pileline的例子。
from <YOU_APP_NAME>.models import detikNewsItem
class DetikAppPipeline(object):
def process_item(self, item, spider):
d, created = detikNewsItem.objects.get_or_create(breadcrumbs=item['breadcrumbs'], url=item['url'])
if created:
d.tanggal = item['tanggal']
d.penulis = item['penulis']
d.judul = item['judul']
d.berita = item['berita']
d.tag = item['tag']
d.save()
return item
顺便说一句,你需要在 Django 环境中运行 Scrapy。有几种方法可以做到这一点:
使用
django-extensions
模块。 需要创建新文件:<DJANG_PROJECT>/scripts/__init__.py
<DJANG_PROJECT>/scripts/run_scrapy.py
里面有代码:
from scrapy.cmdline import execute
execute(['run_scrapy.py', 'crawl', 'detik'])
另一种方法是使用 Django Managment。需要在带有文件的项目中创建文件夹:
<folder_of_app>/management/commands/__init__.py
<folder_of_app>/management/commands/scrapy.py
scrapy.py
文件应包含代码:from scrapy.cmdline import execute from django.core.management.base import BaseCommand class Command(BaseCommand): help = 'Run scrapy.' def add_arguments(self, parser): parser.add_argument('arguments', nargs='+', type=str) def handle(self, *args, **options): args = [] args.append('scrapy.py') args.extend(options['arguments']) execute(args)
它允许像这样在 Django 环境中运行 Scrapy:
python manage.py scrapy crawl detik
python manage.py scrapy shell 'https://news.detik.com/indeks/all/?date=02/28/2018'
关于python - 将 Scrapy 与 Django 集成 : How to,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50637920/