python - 在python中逐行读取html url

我想对网页应用字符串操作，就像我逐行处理本地普通文件一样:

save = []
ins = open("my_file.html", "r")

for line in ins:
    if "/html/" in line and "thumbs" in line:
        print(line)

但是，当我尝试直接获取网页然后将其转换为 utf-8 时，我无法再逐行解析，这是我的代码:

fp = urllib.request.urlopen(base + ".html")
mystr = fp.read()
mystr = mystr.decode("utf-8")

for line in mystr:
    if "/html/" in line and "thumbs" in line:
        print(line)

那么我在这里做错了什么，这是我在收到页面后解码页面的方式，我正在使用的库，我使用字符串或其他东西的方式？

这是cat my_file.html | 的结果头

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-4477008-1']);
  _gaq.push(['_trackPageview']);

  (function() {

最佳答案

So whats am I doing wrong here

迭代文件对象会给出文件的行，但迭代字符串会给出单个字符(作为长度为 1 的字符串)。

您需要自己将字符串拆分回行，例如与.splitlines()。

关于python - 在python中逐行读取html url，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58529361/

上一篇：python - 问题是，您必须检查字符串列表，并且每个项目都必须比前一个项目长

下一篇：python - 在 Amazon Comprehend Medical 中识别单词的不同时态

相关文章：

Python:如何使用plotly制作阴影区域或交替背景颜色？

python - 将 Twitter 导入 Elasticsearch 时出现 Illegal_argument_Exception

html - 为什么第 n 个 child 选择器不起作用？

java - 如何返回一个字符串？

c - 将负整数转换为字符串时如何处理？

python - sleep 而不中断程序

python - 如何使用python运行带有参数的exe文件

html - 获取图像顶部的背景颜色

Windows 8.1 上 IE11 的 HTML 文本字段大小

PHP - 从字符串创建数组