我已将网页下载到 html 文件中。我想知道获取该页面内容的最简单方法是什么。关于内容,我的意思是我需要浏览器会显示的字符串。
要清楚:
输入:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
输出:
Page title This is paragraph one. This is paragraph two.
放在一起:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
相关
- Python HTML removal
- Extracting text from HTML file using Python
- What is a light python library that can eliminate HTML tags? (and only text)
- Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
- RegEx match open tags except XHTML self-contained tags (著名的 don't use regex to parse html 咆哮)
最佳答案
关于python - Python获取Html页面内容的方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2416823/