Python 使用 Beautiful Soup 对特定内容进行 HTML 处理

因此，当我决定解析网站内容时。例如，http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我想将成分解析为文本文件。成分位于:

< div class="ingredients" style="margin-top: 10px;">

在其中，每种成分都存储在

< li class="plaincharacterwrap">

有人很友好地提供了使用正则表达式的代码，但是当您从一个站点到另一个站点进行修改时，它会变得困惑。所以我想使用 Beautiful Soup，因为它有很多内置功能。除了我可能对如何实际操作感到困惑。

代码:

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

您就是这样开始的吗？我想找到实际的 div 类，然后解析出位于 li 类中的所有成分。

任何帮助将不胜感激!谢谢!

最佳答案

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

对@eyquem 的后续回复:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

给予

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

正则表达式要快得多(错误时除外)；但是如果你考虑加载页面并一起解析，BeautifulSoup 仍然只有 20% 的运行时间。如果您非常在意速度，我推荐使用 lxml。

关于Python 使用 Beautiful Soup 对特定内容进行 HTML 处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5615647/

Python 使用 Beautiful Soup 对特定内容进行 HTML 处理

上一篇：html - css 或 xpath 选择器 : elements which have any attribute with specific value

下一篇：html - html5的音视频元素是行内元素还是 block 元素？