python - 如何在python中获取html中的文本

我想使用 python 捕获 html 中的一些文本。例子..

#!/usr/bin/python
import urllib

open = urllib.urlopen('http://localhost/main.php')
read = open.read()
print read

和这个目标url的源码

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
<title>Untitled Document</title>
</head>

<body>
This is body!
</body>
</html>

如果我只想听到“This is body!”这句话怎么办？仅有的!？请大家帮我解决这个问题!

例如，HTML 被替换成这个:

<table width=90% align=center>
  <tr>
    <td>The information available on this site freely accessible to the public</td>
  <tr>
</table>
<table class=adminlist border=0 width=90% cellpadding=3 cellspacing=0 align=center>
  <tr>
    <td rowspan=5 colspan=2><img src=images/Forum.png><br></td>
  </tr>
  <tr>
    <td><i><b>Phone</b></td><td>: +61 2 4446 5552</td>
  </tr>
  <tr>
    <td><i><b>Name</b></td><td>: Stundet</td>
  </tr>
  <tr>
    <td><i><b>Class</b></td>
    <td>: Summer</td>
  </tr>
  <tr>
    <td><i><b>Email</b></td>
    <td>: student@localhost.com</td>
  </tr>
</table>

我想做这个输出:

Phone : +61 2 4446 5552
Name  : Student
Class : Summer
Email : student@localhost.com

只捕获了 html 的核心部分.. :)

最佳答案

尝试 beautiful soup .

from BeautifulSoup import BeautifulSoup

...

soup = BeautifulSoup(html)
soup.findAll("body").string

关于python - 如何在python中获取html中的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7673007/

上一篇：python - 是否可以在类范围之外创建属性？

下一篇：在两个标记之间查找字符串的 Python 函数

python - 在 Maya 中设置对象的大小 - 比例值与精确单位

python - 子图中的 axvspan 不起作用

python - 石头、剪刀、布 - 平局时如何开始新游戏

python - Django - 'ContactForm' 对象没有属性 'get'

python - 将串行链接的输入和输出转发到其他进程的标准输入和标准输出

python - 你如何在 python-docx 中将表行保持在一起？

python3.3 matplotlib无法导入名称_tkagg

python - 除了父类(super class)之外，两个相同的 Python 类 - 如何避免重复？

python - 如何知道变量是元组、字符串还是整数？