python - 如何用scrapy抓取每个链接的所有内容?

标签 python web-scraping scrapy web-crawler scrapy-spider

我是 scrapy 的新手,我想从这个 website 中提取每个广告的所有内容.所以我尝试了以下方法:

from scrapy.spiders import Spider
from craigslist_sample.items import CraigslistSampleItem

from scrapy.selector import Selector
class MySpider(Spider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        links = response.selector.xpath(".//*[@id='sortable-results']//ul//li//p")
        for link in links:
            content = link.xpath(".//*[@id='titletextonly']").extract()
            title = link.xpath("a/@href").extract()
            print(title,content)

项目:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    link = Field()

但是,当我运行爬虫时,我什么也没得到:

$ scrapy crawl --nolog craig
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

因此,我的问题是:如何遍历每个 url,进入每个链接并抓取内容和标题?哪种方法最好?

最佳答案

要构建基本的 scrapy 项目,您可以使用 command :

scrapy startproject craig

然后添加蜘蛛和元素:

craig/spiders/spider.py

from scrapy import Spider
from craig.items import CraigslistSampleItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy import Request
import urlparse, re

class CraigSpider(Spider):
    name = "craig"
    start_url = "https://sfbay.craigslist.org/search/npo"

    def start_requests(self):

        yield Request(self.start_url, callback=self.parse_results_page)


    def parse_results_page(self, response):

        sel = Selector(response)

        # Browse paging.
        page_urls = sel.xpath(""".//span[@class='buttons']/a[@class='button next']/@href""").getall()

        for page_url in page_urls + [response.url]:
            page_url = urlparse.urljoin(self.start_url, page_url)

            # Yield a request for the next page of the list, with callback to this same function: self.parse_results_page().
            yield Request(page_url, callback=self.parse_results_page)

        # Browse items.
        item_urls = sel.xpath(""".//*[@id='sortable-results']//li//a/@href""").getall()

        for item_url in item_urls:
            item_url = urlparse.urljoin(self.start_url, item_url)

            # Yield a request for each item page, with callback self.parse_item().
            yield Request(item_url, callback=self.parse_item)


    def parse_item(self, response):

        sel = Selector(response)

        item = CraigslistSampleItem()

        item['title'] = sel.xpath('//*[@id="titletextonly"]').extract_first()
        item['body'] = sel.xpath('//*[@id="postingbody"]').extract_first()
        item['link'] = response.url

        yield item

craig/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    body = Field()
    link = Field()

craig/settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'craig'

SPIDER_MODULES = ['craig.spiders']
NEWSPIDER_MODULE = 'craig.spiders'

ITEM_PIPELINES = {
   'craig.pipelines.CraigPipeline': 300,
}

craig/pipelines.py

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class CraigPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_ads.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

您可以通过运行 command 来运行蜘蛛程序:

scrapy runspider craig/spiders/spider.py

从项目的根目录。

它应该在项目的根目录中创建一个 craig_ads.csv

关于python - 如何用scrapy抓取每个链接的所有内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40479789/

相关文章:

c# - HTML Agility Pack 屏幕抓取 XPATH 未返回数据

python - 使用 Eclipse PyDev 让 Cython 在 Windows 7 中工作

php - 从 php 或 python 调用时 Bash 脚本挂起

python-3.x - 无法将我的 Pandas 数据框导出到 excel

html - 如何使用 getELementsbyTagName 修复 'for each' 迭代?

web-scraping - scrapy-如何停止重定向(302)

python - Scrapy 忽略 allowed_domains?

python - 构造Scrapy Request对象时是否可以指定任何方法作为回调?

python - Numpy array item order - 序列的平均分布

python - Pandas Join 未提供确切的结果