python - 从多个网页(文本文件中的 URL)提取文本

(环境:Python 2.7 + BeautifulSoup 4.3.2)

我正在使用 Python 和 BeautifulSoup 来获取此网页及其后续页面上的新闻标题。我不知道如何让它自动跟随后续/下一页，因此我将所有 URL 放入一个文本文件 web list.txt 中。

http://www.legaldaily.com.cn/locality/node_32245.htm
http://www.legaldaily.com.cn/locality/node_32245_2.htm
http://www.legaldaily.com.cn/locality/node_32245_3.htm

。。 .

这是我到目前为止的成果:

from bs4 import BeautifulSoup
import re
import urllib2
import urllib


list_open = open("web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")


i = 0
while i < len(line_in_list):
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()
i + = 1

它会弹出一条错误消息，指出语法无效。

出了什么问题？

最佳答案

i + = 1

这是无效的语法。

如果您想使用增强赋值运算符+=，则加号和等号之间不能有空格。

i += 1

您将收到的下一个错误是:

NameError: name 'url' is not defined

因为在 soup = 行中使用 url 之前，您从未定义过它。您可以通过直接迭代 url 列表来解决此问题，而不是增加 i。

for url in line_in_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()

关于python - 从多个网页(文本文件中的 URL)提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21274283/

上一篇：python - 在 Python 中使用 YouTube Data API v3 将 YouTube 视频添加到播放列表

下一篇：python - 嵌入式 Neo4j Python 创建或检查唯一关系

python - LSTM输入形状错误: Input 0 is incompatible with layer sequential_1

python - 使用网页抓取来检查商品是否有库存

python - 从 HTML 中提取标签之间的特定文本部分

python - 需要将 python 脚本的输出附加到文件

Python MySQLdb 在一行中多次使用 %S

python - Data Nitro 尝试复制行时出错

python - BeautifulSoup 解析 Python

python - 没有名为 'bs4' 错误的模块

python - Web Scraping Rap lyrics on Rap Genius w/Python