我正在学习如何解析和操作html
使用beautiful soup
像这样:
from lxml.html import parse
import urllib2
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'some-url-here'
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
parsed = urllib2.urlopen( req )
soup = BeautifulSoup(parsed)
for elem in soup.findAll(['script', 'style', 'i']):
elem.extract()
for main_body in soup.findAll("div", {"role" : "main"}):
print main_body.getText(separator=u' ')
结果包含<i>
标签,我不知道如何删除它们。如何实现这一点?为什么上面的代码没有删除唯一的标签?
最佳答案
问题实际上是您正在使用已弃用的Beautifulsoup3,安装 bs4一切都会正常进行:
In [10]: import urllib2
In [11]: from bs4 import BeautifulSoup # bs4
In [12]: url = 'https://www.gwr.com/'
In [13]: req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
In [14]: parsed = urllib2.urlopen(req)
In [15]: soup = BeautifulSoup(parsed,"html.parser")
In [16]: tags = soup.find_all(['script','style','i'])
In [17]: print(len(tags))
25
In [18]: for elem in tags:
....: elem.extract()
....:
In [19]: assert len(soup.find_all(['script','style','i'])) == 0
In [20]:
关于python - BeautifulSoup 不会删除 i 元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38684392/