python - 使用 Scrapy 抓取网站时使用 Xpath 的混淆

标签 python xpath web-scraping scrapy

在尝试抓取网站的某些元素时,我无法理解应该选择 Xpath 的哪一部分。在这种情况下,我试图抓取本文中链接的所有网站(例如,xpath 的这一部分:

data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">

我的 spider 可以工作,但它没有抓取任何东西!

我的代码如下:

import scrapy
from scrapy.selector import Selector

from nymag.items import nymagItem

class nymagSpider(scrapy.Spider):
    name = 'nymag'
    allowed_domains = ['http://wwww.nymag.com']
    start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]

    def parse(self, response):
        #I'm pretty sure the below line is the issue
        links = Selector(response).xpath(//*[@id="primary"]/main/article/div/span)
        for link in links:
            item = nymagItem()
            #This might also be wrong - am trying to extract the href section
            item['link'] = question.xpath('a/@href').extract()
            yield item

最佳答案

有一个更简单的方法。获取所有具有 data-trackhref 属性的 a 元素:

In [1]: for link in response.xpath("//div[@id = 'primary']/main/article//a[@data-track and @href]"):
    print link.xpath("@href").extract()[0]
   ...:     
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
... 

关于python - 使用 Scrapy 抓取网站时使用 Xpath 的混淆,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32504067/

相关文章:

python - AttributeError : 'GridSearchCV' object has no attribute 'best_params_'

python - 你如何在 boost::python 中使用 "from __future__ import division"?

xpath - selenium 简单测试说 xpath 不是合法表达式

r - 从 stats.nba.com 抓取数据,在 curl::curl_fetch_memory(url, handle = handle) 中获取错误

curl - 将 cURL 与两个匹配范围结合使用

python - Selenium Python webscraper 真的很慢

python - 距起始顶点一定距离内的顶点数

python - 在 Python 中使用 Selenium 单击并查看更多页面

python-3.x - python 中的 Selenium 网络抓取无法读取元素的.text

python - Pandas 返回数据框中不在其他数据框中的列