python - Scrapy传入 anchor 文本链接

标签 python web-crawler scrapy

我想获得引用 anchor 文本链接。 我将如何从引用 URL 获取传入的 anchor 文本链接?

感谢您的宝贵时间!

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "mydomain"
    allowed_domains = ["www.mydomain"]
    start_urls = ["http://www.mydomain/cp/133162",]

    rules = (Rule (SgmlLinkExtractor(allow=('133162', ),deny=('/ip/', 'search_sort=', 'ic=60_0', 'customer_rating', 'special_offers', ),)
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*')
        items = []

        for site in sites:
            item = Website()
            item['referer'] = response.request.headers.get('Referer')
            item['url'] = response.url
            item['title'] = site.xpath('/html/head/title/text()').extract()
            item['description'] = site.select('//meta[@name="Description"]/@content').extract()
            items.append(item)

        return items

更新:以下是我根据大家的建议新编写的代码:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from wallspider.items import Website
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class anchorspider(CrawlSpider):
    name = "anchor"
    allowed_domains = ["mydomain.com"]
    start_urls = ["http://www.mydomain.com/"]

    extractor = SgmlLinkExtractor()

    rules = (Rule (SgmlLinkExtractor(allow=('133162', ),deny=('/ip/', 'search_sort=', 'ic=60_0', 'customer_rating', 'special_offers', ),)
, callback="parse_items", follow= True),
)

    def parse_start_url(self, response):
        list(self.parse_links(response))

    def parse_links(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        for link in links:
            anchor_text = ''.join(link.select('./text()').extract())
            title = ''.join(link.select('./@title').extract())
            url = ''.join(link.select('./@href').extract())
            meta={'title':title,}
            meta={'anchor_text':anchor_text,}
            yield Request(url, callback = self.parse_page, meta=meta,)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        item = Website()
        item['anchor_text']=response.meta['anchor_text']
        item['url'] = response.url
        item['title'] = response.meta['title']
        item['referer'] = response.request.headers.get('Referer')
        item['description'] = site.select('//meta[@name="Description"]/@content').extract()

        return item

我收到以下错误:raise ValueError('Missing scheme in request url: %s' % self._url)

最佳答案

实际上在响应对象中有response.meta.get('link_text')

关于python - Scrapy传入 anchor 文本链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20482684/

相关文章:

python - 由于命名空间,使用 Scrapy Python 无法从带有 xpath 的响应 html 中提取数据

python - Scrapy + 飞溅 : connection refused

python scrapy 蜘蛛 : pass additional information in parse() method for each start_url

android - 为 Kivy 应用程序保存登录屏幕用户名和密码

python - Scrapy - 蜘蛛爬取重复的网址

python - Beautifulsoup find_all 没有找到全部

java - Selenium 等待 Ajax 内容加载 - 通用方法

python - python 3.8 中奇怪的 bool 值评估

python - 使用 curve_fit 的多维拟合,其中函数在网格上

python - 将RGB图像转换为黑白PIL手部识别