Python:html scraper,用于从具有相同基本 url 的多个页面中提取某些单词之间的信息。这是我到目前为止所拥有的

标签 python html screen-scraping bots

import csv
import requests 
from bs4 import BeautifulSoup
from itertools import izip

grant_number = ['0901289','0901282','0901260']
#IMPORTANT NOTE: PLACE GRANT NUMBERS BETWEEN STRINGS WITH NO SPACES

start = 'this site'
end = 'Please report errors'
#start and end show the words that come right before the publication data
my_string = []
#my_string is an empty list for the publication data


for x in grant_number:      # Number of pages plus one 
    url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    soup_string = str(soup)
    my_string[int(x)] = soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)]
with open('NSF.csv', 'wb') as f:
    #Default Filename is NSF.csv ; This can be changed by editing the first field after 'open('
    writer = csv.writer(f)
    writer.writerows(izip(grant_number, my_string))
#this imports the lists into a csv file with two columns, grant number on left, publication data on right

Python 告诉我这一点

line 26, in my_string[int(x)] = soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)] IndexError: list assignment index out of range

我该如何解决这个问题?

最佳答案

问题是 my_string[x] 正在尝试获取 my_string 的 x 列表索引,但根据您对 grant_number 列表的定义,x 是一个字符串。

您可能想附加到最初的空字符串。

for x in grant_number:      # Number of pages plus one 
    url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    soup_string = str(soup)
    my_string.append(soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)])

关于Python:html scraper,用于从具有相同基本 url 的多个页面中提取某些单词之间的信息。这是我到目前为止所拥有的,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37645988/

相关文章:

python - BeautifulSoup 省略了页面正文

Python - 如何将 ttkthemes 包中的主题添加到 guizero 应用程序?

python - 如何将 Tornado 与线程池一起使用?

python - 使用Python需要将数据从多列转换为单列并重复A列

javascript - 具有嵌套子菜单的 jQuery 导航项应在单击时打开

c# - 寻找 OO 大师,在设计我的编程逻辑时需要一些帮助。没什么特别的,只是新的

python - 添加到 : How do I protect my Python codebase so that guests can't see certain modules but so it still works?

javascript - z-index 的问题

php - 为什么我不能获取 HTML 表单来将值传递给 PHP?

python - 用于 python 的 CSS 感知智能 html 解析器