Python 使用 Beautiful Soup 对特定内容进行 HTML 处理

标签 python html parsing beautifulsoup

因此,当我决定解析网站内容时。例如,http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我想将成分解析为文本文件。成分位于:

< div class="ingredients" style="margin-top: 10px;">

在其中,每种成分都存储在

< li class="plaincharacterwrap">

有人很友好地提供了使用正则表达式的代码,但是当您从一个站点到另一个站点进行修改时,它会变得困惑。所以我想使用 Beautiful Soup,因为它有很多内置功能。除了我可能对如何实际操作感到困惑。

代码:

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

您就是这样开始的吗?我想找到实际的 div 类,然后解析出位于 li 类中的所有成分。

任何帮助将不胜感激!谢谢!

最佳答案

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

.


对@eyquem 的后续回复:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

给予

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

正则表达式要快得多(错误时除外);但是如果你考虑加载页面并一起解析,BeautifulSoup 仍然只有 20% 的运行时间。如果您非常在意速度,我推荐使用 lxml。

关于Python 使用 Beautiful Soup 对特定内容进行 HTML 处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5615647/

相关文章:

scala - 如何创建一个解析器组合器,其中行结尾很重要?

java - 递归下降解析

python - 在 python 中使用 openCV 检测睁眼或闭眼

python - pip install --no-build-isolation 返回没有这样的选项 : --no-build-isolation

python - 用于网格旋转的 Numpy einsum()

python - 如何在 Python webapp (Google App Engine) 中导入 gflags?

javascript - 使 HTML5 header 与 IE 兼容 — HTML5 Shiv 无法修复我的布局

javascript - 如何使用 setInterval 和 HTML5 Canvas 解决无响应的重绘和控件问题

javascript - 导航子页面时保持下拉菜单打开

.net - 是否有任何为 .NET 编写的 REXX PARSE 实现?