我正在使用 Scrapy 来抓取数据 this site .我需要从 parse
调用 getlink
。使用 yield
时,正常调用无法正常工作,出现此错误:
2015-11-16 10:12:34 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://www.coldwellbankerhomes.com/fl/miami-dad
e-county/kvc-17_1,17_3,17_2,17_8/incl-22/>
从 parse
返回 getlink
函数有效,但即使在返回后我也需要执行一些代码。我很困惑任何帮助都会非常可观。
# -*- coding: utf-8 -*-
from scrapy.spiders import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request,Response
import re
import csv
import time
from selenium import webdriver
class ColdWellSpider(BaseSpider):
name = "cwspider"
allowed_domains = ["coldwellbankerhomes.com"]
#start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))]
#start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/']
start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/']
def parse(self,response):
#browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
#to extract all the links from a page and send request to those links
#this works but even after returning i need to execute the while loop
return self.getlink(response)
#for clicking the load more button in the page
while True:
try:
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
self.getlink(response)
except:
break
def getlink(self,response):
print 'hhelo'
c = open('data_getlink.csv', 'a')
d = csv.writer(c, lineterminator='\n')
print 'hello2'
listclass = response.xpath('//div[@class="list-items"]/div[contains(@id,"snapshot")]')
for l in listclass:
link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/@href').extract())
d.writerow([link])
yield Request(url = str(link),callback=self.parse_link)
#callback function of Request
def parse_link(self,response):
b = open('data_parselink.csv', 'a')
a = csv.writer(b, lineterminator='\n')
a.writerow([response.url])
最佳答案
Spider must return Request, BaseItem, dict or None, got 'generator'
getlink()
是一个生成器。您正在尝试从 parse()
生成器中产生
它。
相反,您可以/应该迭代 getlink()
调用的结果:
def parse(self, response):
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
while True:
try:
for request in self.getlink(response):
yield request
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
except:
break
此外,我注意到您同时拥有 self.getlink(response)
和 self.getlink(browser)
。后者不会工作,因为在 webdriver 实例上没有 xpath()
方法 - 你可能是想 make a Scrapy Selector
从您的 webdriver 控制的浏览器加载的页面源代码中获取,例如:
selector = scrapy.Selector(text=browser.page_source)
self.getlink(selector)
您还应该看看 Explicit Waits with Expected Conditions而不是通过 time.sleep()
使用不可靠且缓慢的人为延迟。
另外,我不确定您手动写入 CSV 而不是使用内置 Scrapy Items 的原因是什么和 Item Exporters .而且,您没有正确关闭文件,也没有使用 with()
上下文管理器。
此外, try catch 更具体的异常和 avoid having a bare try/expect block .
关于python返回多次,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33728743/