python - Scrapy 蜘蛛获取链接内的信息

我已经做了一个蜘蛛，可以获取这个page的信息它可以跟随“下一页”链接。现在，蜘蛛只获取我在以下结构中显示的信息。

页面的结构是这样的

Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1

Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2

Next page

然后，我想要的是蜘蛛程序继续访问每个 URL 链接并获取完整信息。我想我必须生成另一个规则来指定我想要做这样的事情。

蜘蛛的行为应该是:

转到 URL1(获取信息)
转到 URL2(获取信息)
...
下一页

但我不知道如何实现它。有人可以指导我吗？

我的蜘蛛代码:

class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=(re.escape("index.php")),
            restrict_xpaths=("//div[@class='paginador']")),
        callback="parse_item",
        follow=True),
)

def parse_item(self, response):
    self.log("parse_item")
    sel = Selector(response)
    sites = sel.xpath("//div[@id='llista-resultats']/div")
    items = []
    cont = 0
    for site in sites:
        item = BcnItem()
        item['id'] = cont
        item['title'] = u''.join(site.xpath('h3/a/text()').extract())
        item['url'] = u''.join(site.xpath('h3/a/@href').extract())
        item['when'] = u''.join(site.xpath('div[@class="dades"]/dl/dd[1]/text()').extract())
        item['where'] = u''.join(site.xpath('div[@class="dades"]/dl/dd[2]/span/a/text()').extract())
        item['street'] = u''.join(site.xpath('div[@class="dades"]/dl/dd[3]/span/text()').extract())
        item['phone'] = u''.join(site.xpath('div[@class="dades"]/dl/dd[4]/text()').extract())
        items.append(item)
        cont = cont + 1
    return items

编辑在互联网上搜索后，我找到了一个可以做到这一点的代码。

首先，我必须获取所有链接，然后我必须调用另一个解析方法。

def parse(self, response):
    #Get all URL's

    yield Request( url= _url, callback=self.parse_details )

def parse_details(self, response):
    #Detailed information of each page

如果由于页面有分页器而需要使用Rules，则应将def parse更改为def parse_start_url，然后通过Rule调用该方法。通过此更改，您可以确保解析器从 parse_start_url 开始，并且代码如下所示:

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=(re.escape("index.php")),
        restrict_xpaths=("//div[@class='paginador']")),
        callback="parse_start_url",
        follow=True),
)

def parse_start_url(self, response):
    #Get all URL's

    yield Request( url= _url, callback=self.parse_details )

def parse_details(self, response):
    #Detailed information of each page

这就是大家

最佳答案

有一种更简单的方法可以实现这一目标。单击链接上的“下一步”，然后仔细阅读新网址:

http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10

通过查看 url 中的获取数据(问号后面的所有内容)，并进行一些测试，我们发现这些意味着

from=10 - 起始索引
q=*:* - 搜索查询
nr=10 - 要显示的项目数

这就是我会做的事情:

设置 nr=100 或更高。 (1000 也可以，只要确保没有超时即可)
从 from=0 循环到 34300。这高于当前的条目数。您可能需要先提取该值。

示例代码:

entries = 34246
step = 100
stop = entries - entries % step + step

for x in xrange(0, stop, step):
    url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
    # Loop over all entries, and open links if needed

关于python - Scrapy 蜘蛛获取链接内的信息，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21118518/

python - Scrapy 蜘蛛获取链接内的信息

上一篇：c# - 如何使用 C# .NET CORE 在 NSwag 文档中添加自定义 header ？

下一篇：selenium-webdriver - 如何使用 Selenium WebDriver 截取屏幕截图？