python - 使用Scrapy从详细信息页面提取数据

标签 python screen-scraping scrapy web-crawler

我正在尝试从此网站抓取代理机构的电话号码:

ListView http://www.authoradvance.com/agencies/

详细 View http://www.authoradvance.com/agencies/b-personal-management/

电话号码隐藏在详细信息页面中。

那么是否可以通过像上面的详细 View url这样的url访问网站并抓取电话号码?

我对此代码的尝试是:

from scrapy.item import Item, Field

class AgencyItem(Item):
    Phone = Field()

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from agentquery.items import AgencyItem


class AgencySpider(CrawlSpider):
   name = "agency"
   allowed_domains = ["authoradvance.com"]
   start_urls = ["http://www.authoradvance.com/agencies/"]
   rules = (Rule(SgmlLinkExtractor(allow=[r'agencies/*$']), callback='parse_item'),)

   def parse_item(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select("//div[@class='section-content']")
       items = []
       for site in sites:
           item = AgencyItem()
           item['Phone'] = site.select('div[@class="phone"]/text()').extract()
           items.append(item)
       return(items)

然后我运行了“scrapy scrapagency -o items.csv -t csv” 结果爬取了0页。

怎么了?提前感谢您的帮助!

最佳答案

页面上只有一个链接满足您的正则表达式 (agencies/*$):

stav@maia:~$ scrapy shell http://www.authoradvance.com/agencies/
2013-04-24 13:14:13-0500 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapybot)

>>> SgmlLinkExtractor(allow=[r'agencies/*$']).extract_links(response)
[Link(url='http://www.authoradvance.com/agencies', text=u'Agencies', fragment='', nofollow=False)]

这只是一个指向iteself的链接,并且它没有带有section-content类的div:

>>> fetch('http://www.authoradvance.com/agencies')
2013-04-24 13:15:22-0500 [default] DEBUG: Crawled (200) <GET http://www.authoradvance.com/agencies> (referer: None)

>>> hxs.select("//div[@class='section-content']")
[]

因此您的循环不会迭代,并且 items 永远不会被附加。

因此将正则表达式更改为 /agencies/.+

>>> len(SgmlLinkExtractor(allow=[r'/agencies/.+']).extract_links(response))
20

>>> fetch('http://www.authoradvance.com/agencies/agency-group')
2013-04-24 13:25:02-0500 [default] DEBUG: Crawled (200) <GET http://www.authoradvance.com/agencies/agency-group> (referer: None)

>>> hxs.select("//div[@class='section-content']")
[<HtmlXPathSelector xpath="//div[@class='section-content']" data=u'<div
class="section-content">\n\t      <di'>, <HtmlXPathSelector xpath="//div
[@class='section-content']" data=u'<div class="section-content"><div class='>]

关于python - 使用Scrapy从详细信息页面提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16194303/

相关文章:

python - 将 Os.system 结果存储在变量中

python - GAE Python - 'Clamp' 日期一起形成 'tree' 结构

perl - 如何使用 Perl 登录 YouTube?

python - 为 scrapy 安装 python - 符号链接(symbolic link)和权限问题

python - 如何循环遍历下拉列表Scrapy

python - Scrapy 登录身份验证不起作用

python - 仅以 1 对 3 <tr>

mysql - 将 MySQL 查询发送到我从另一个站点拥有的站点(使用 Google Chrome 扩展)

html - 从网页中提取背景图像/解析 HTML+CSS

python - Python Pandas : read_html and python3-lxml installation 的问题