python - 为什么我定义的项目没有从 Scrapy 填充和存储?

标签 python html-parsing web-scraping web-crawler scrapy

假设我有以下网站结构:

  1. 起始网址:http://thomas.loc.gov/cgi-bin/query/z?c107:H.R .%s:其中%s是索引1-50(用于说明目的的示例)。
  2. “第一层”:账单文本或多个版本的链接...
  3. “第二层”:帐单文本,带有指向“打印机友好”(纯文本)版本的链接。

脚本的最终目标:

  1. 浏览起始 URL;解析 URL、标题和正文;将它们保存到starts.txt文件
  2. 从起始网址正文中提取“第一层”链接;导航至这些链接;解析 URL、标题和正文;将它们保存到 bills.txt 文件
  3. 从“第一层”网址正文中提取“第二层”链接;导航至这些链接;解析 URL、标题和正文;将它们保存到 versions.txt 文件

假设我有以下脚本:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class StartItem(Item):
    url = Field()
    title = Field()
    body = Field()

class BillItem(Item):
    url = Field()
    title = Field()
    body = Field()

class VersionItem(Item):
    url = Field()
    title = Field()
    body = Field()

class Lrn2CrawlSpider(CrawlSpider):
    name = "lrn2crawl"
    allowed_domains = ["thomas.loc.gov"]
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00050,00001) ### Sample of 40 bills; Total range of bills is 1-5767

    ]

    rules = (
            # Extract links matching /query/D fragment (restricting tho those inside the content body of the url); follow; & scrape all bill text.
            # and follow links from them (since no callback means follow=True by default).
            # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
            Rule(SgmlLinkExtractor(allow=(r'/query/D'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),

            # Extract links in the body of a bill-version & follow them.
           #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
            Rule(SgmlLinkExtractor(allow=(r'/query/C'), restrict_xpaths=('//table/tr/td[2]/a/@href')), callback='parse_versions', follow=True)
        )

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        starts = hxs.select('//div[@id="content"]')
        scraped_starts = []
        for start in starts:
            scraped_start = StartItem() ### Start object defined previously
            scraped_start['url'] = response.url
            scraped_start['title'] = start.select('//h1/text()').extract()
            scraped_start['body'] = response.body
            scraped_starts.append(scraped_start)
            with open('starts.txt', 'a') as f:
                f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_start['url'], scraped_start['title'], scraped_start['body']))
        return scraped_starts

    def parse_bills(self, response):
        hxs = HtmlXPathSelector(response)
        bills = hxs.select('//div[@id="content"]')
        scraped_bills = []
        for bill in bills:
            scraped_bill = BillItem() ### Bill object defined previously
            scraped_bill['url'] = response.url
            scraped_bill['title'] = bill.select('//h1/text()').extract()
            scraped_bill['body'] = response.body
            scraped_bills.append(scraped_bill)
            with open('bills.txt', 'a') as f:
                f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_bill['url'], scraped_bill['title'], scraped_bill['body']))
        return scraped_bills

    def parse_versions(self, response):
        hxs = HtmlXPathSelector(response)
        versions = hxs.select('//div[@id="content"]')
        scraped_versions = []
        for version in versions:
            scraped_version = VersionItem() ### Version object defined previously
            scraped_version['url'] = response.url
            scraped_version['title'] = version.select('//h1/text()').extract()
            scraped_version['body'] = response.body
            scraped_versions.append(scraped_version)
            with open('versions.txt', 'a') as f:
                f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_version['url'], scraped_version['title'], scraped_version['body']))
        return scraped_versions

这个脚本似乎正在做我想做的一切,除了导航到“第二层”链接并解析这些网站的项目(URL、标题和正文)。换句话说,Scrapy 不会抓取或解析我的“第二层”。

更简单地重申我的问题:为什么 Scrapy 不填充我的 VersionItem 并将其输出到我想要的文件:version.txt?

最佳答案

问题出在第二个 SgmlLinkExtractor 上的 restrict_xpaths 设置中。将其更改为:

restrict_xpaths=('//div[@id="content"]',) 

希望有帮助。

关于python - 为什么我定义的项目没有从 Scrapy 填充和存储?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17795348/

相关文章:

python - 使代码不那么复杂,更具可读性

python - 查找具有特定格式的数字

python - 使用 spark ml(数据框)进行逻辑回归

python - 如何使用 python 从该 html 段中获取字符串

python - 如何在 beautifulsoup 中查找 <div><span>text</span></div> 的文本?

perl - 哪些 Perl 模块适合数据处理?

python - 如何扩展这个搜索和替换 python 脚本以接受来自命令行的变量?

Python 和 Beautifulsoup 4 - 无法过滤类?

mysql - Scrapy Pipeline 无法插入 MySQL

python - 属性错误 : module 'sys' has no attribute 'setdefaultencoding'