python - 循环遍历多个 URL 以从 Scrapy 中的 CSV 文件中抓取数据不起作用

标签 python web-scraping scrapy

当我尝试执行此循环时出现错误,请帮忙 我想使用 csv 文件抓取多个链接,但卡在 start_urls 中 我正在使用 scrapy 2.5 和 python 3.9.7

from scrapy import Request
from scrapy.http import request
import pandas as pd


class PagedataSpider(scrapy.Spider):
    name = 'pagedata'
    allowed_domains = ['www.imdb.com']

    def start_requests(self):
        df = pd.read_csv('list1.csv')
        #Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
        # contains all the urls which you want to loop over. 
        urlList = df['link'].values.to_list()
        for i in urlList:
            yield scrapy.Request(url = i, callback=self.parse)

错误:

2021-11-09 22:06:45 [scrapy.core.engine] INFO: Spider opened
2021-11-09 22:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-09 22:06:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-09 22:06:45 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\Vivek\Desktop\Scrapy\myenv\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\Vivek\Desktop\Scrapy\moviepages\moviepages\spiders\pagedata.py", line 18, in start_requests
    urlList = df['link'].values.to_list()
AttributeError: 'numpy.ndarray' object has no attribute 'to_list'
2021-11-09 22:06:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-09 22:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.007159,
 'finish_reason': 'finished',

最佳答案

您收到的错误相当简单; numpy 数组没有 to_list 方法。

相反,您应该简单地迭代 numpy 数组:

from scrapy.http import request
import pandas as pd


class PagedataSpider(scrapy.Spider):
    name = 'pagedata'
    allowed_domains = ['www.imdb.com']

    def start_requests(self):
        df = pd.read_csv('list1.csv')

        urls = df['link']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

关于python - 循环遍历多个 URL 以从 Scrapy 中的 CSV 文件中抓取数据不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69902187/

相关文章:

python - 在浏览器选项卡或 ipython 窗口中打开 scrapy 输出

python - 混淆 "as"关键字如何在 except 语句中工作

python - 有人可以举一个 "operator.index ()"的例子吗?

python - 无法修改字典内的页码

python - 如何在scrapy中给出每个请求之间的延迟?

python - 向 Scrapy Spider 添加 header

python - 在 Python 中为 MySQL 转义 unicode 字符串(避免异常。UnicodeEncodeError)

使用 Pandas 进行 Python 字典理解

ruby - throttle Mechanize gem

javascript - 使用 puppeteer 永远抓取同一页面