python - 如何在 scrapy 中覆盖/使用 cookie

标签 python scrapy

我要抓取http://www.3andena.com/ ,该网站首先以阿拉伯语启动,并将语言设置存储在 cookie 中。如果您尝试直接通过 URL ( http://www.3andena.com/home.php?sl=en ) 访问语言版本,则会出现问题并返回服务器错误。

因此,我想将 cookie 值“store_language”设置为“en”,然后开始使用该 cookie 值废弃网站。

我正在使用 CrawlSpider 和一些规则。

这是代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re

class AndenaSpider(CrawlSpider):
  name = "andena"
  domain_name = "3andena.com"
  start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]

  product_urls = []

  rules = (
     # The following rule is for pagination
     Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
     # The following rule is for produt details
     Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
     )

  def start_requests(self):
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})

    for url in self.start_urls:
        yield Request(url, callback=self.parse_category)


  def parse_category(self, response):
    hxs = HtmlXPathSelector(response)

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())

    for product in self.product_urls:
        yield Request(product, callback=self.parse_product)  


  def parse_product(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    item = Product()

    '''
    some parsing
    '''

    items.append(item)
    return items

SPIDER = AndenaSpider()

这是日志:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)

最佳答案

修改你的代码如下:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)

Scrapy.Request 对象接受可选的cookies 关键字参数,see documentation here

关于python - 如何在 scrapy 中覆盖/使用 cookie,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10667202/

相关文章:

python - 使用 matplotlib 和子图绘制多个条形图

python - Flask 循环未按预期在 Bootstrap 下拉菜单中迭代

python - Scrapy Python For 语句

python - scrapy 无法在我的 Mac 上加载 libxslt.1.dylib

Scrapy无法下载图片

python - 子进程中的多个管道

python - Flask 日志记录 - 调试设置

python - 是否有一组简单的脚本来操作某处可用的 csv 文件?

python - Scrapy - 在脚本上下文中打印管道数据

python - Scrapy:爬取但不被抓取