python - BeautifulSoup 爬行并从中间提取文本<br>

标签 python html python-2.7 beautifulsoup

我的 html 代码如下所示:

 <br><a href="/drink12xy569.html">Alien Suicide</a>
 <br><a href="/drink792.html">All Jacked Up</a>
 <br><a href="/drink3805.html">All Night Hunter</a>
 <br><a href="/drink796.html">Alley Shooter</a>
 <br><a href="/drink10013.html">Alligator Sperm</a>
 <br><a href="/drink804.html">Almond Delight</a>
 <br><a href="/drink11135.html">Almond Gravy</a>
 <br><a href="/drink7519.html">Almond Joy #2</a>
 <br><a href="/drinks1r2563.html">Almond Kiss</a>
 <br><a href="/drink12xy578.html">Amaretto Pie</a>
 <br><a href="/drink11144.html">Amaretto Sourball</a>
 <br><a href="/drinkp15q144.html">Ambuco Cinnamon Shooter</a>
 <br><a href="/drink835.html">Amenie Mama</a>
 <br><a href="/drink7521.html">American Death</a>

我需要帮助来提取 <br> 之间的标题然后打印出来。然后,我需要帮助将此信息与我已提取到文本文档中的其他信息一起编写,我可以使用 GUI 界面进行搜索。我有单独的代码,最后可以将它们全部组合在一起,我只需要概念帮助。

我 BeautifulSoup 爬行看起来像这样:

import urllib2
from bs4 import BeautifulSoup
url=[]
for i in range(28):
    url="http://www.drinksmixer.com/cat/3/"
    page = urllib2.urlopen("http://www.drinksmixer.com/cat/3/")
    soup = BeautifulSoup(page.read())
    links=soup.find_all('a')

for link in links:
    if "drink" in link ['href']:
        print link['href']
        print "****\n\n"
        url="http://drinksmixer.com"+link['href']
        page1=urllib2.urlopen(url)
        soup1=BeautifulSoup(page1.read())
        divs=soup1.find('div', {"class":"ingredients"})
        print divs.text.encode("utf-8")

我的 GUI 界面如下所示:

import Tkinter
from Tkinter import *

def show_entry_fields():
   print("Shot Name: %s" % (e1.get()))

master = Tk()
Label(master, text="Shot Name").grid(row=0)

e1 = Entry(master)

e1.grid(row=0, column=1)

Button(master, text='Search', command=show_entry_fields).grid(row=3, column=1, sticky=W, pady=4)

mainloop( )

我只需要帮助在我提取的信息中实现搜索。

最佳答案

设计 UI 并不容易。你的代码几乎没问题。我将其分为功能并添加了您要求的基本搜索。

import urllib2
from bs4 import BeautifulSoup
import Tkinter
from Tkinter import *

e1 = None
links = []

def get_drinks():
    global links
    for i in range(28):
        url="http://www.drinksmixer.com/cat/3/" + i
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        links.append(soup.find_all('a'))

def get_recipe(drink_name):
    print drink_name
    for link in links:
        if "drink" in link ['href'] and drink_name in link.contents:
            #print link['href']
            print "****\n\n"
            url="http://drinksmixer.com"+link['href']
            page1=urllib2.urlopen(url)
            soup1=BeautifulSoup(page1.read())
            divs=soup1.find('div', {"class":"ingredients"})
            recipe = divs.text.encode("utf-8")
            return recipe

def show_entry_fields():
    drink_name = e1.get()
    print("Shot Name: %s" % drink_name)
    recipe = get_recipe(drink_name)
    print recipe # or better yet, popup
    # tkMessageBox.showinfo(drink_name, recipe)

def main():
    global e1
    master = Tk()
    Label(master, text="Shot Name").grid(row=0)
    e1 = Entry(master)
    e1.grid(row=0, column=1)
    Button(master, text='Search', command=show_entry_fields).grid(row=3, column=1, sticky=W, pady=4)
    mainloop()

if __name__ == "__main__":
    get_drinks()
    main()

关于python - BeautifulSoup 爬行并从中间提取文本<br>,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34163303/

相关文章:

python - 列表理解中 "if"的 "groupby"条件

html - 我想在我的 html 页面上打印图标

javascript - ASP.NET - 使用 JS 设置 DropDownList 的值和文本属性

html - Bootstrap 下拉菜单溢出而不是调整

python - 如何将包含符号和数字的字符串转换为 float 或整数列表?

python - 未从混洗数据集中选择 Keras ImageDataGenerator 验证拆分

python - 如何永久 "wire"EC2 ip 地址到 virtualenv 中的 django

Python 多核 CSV 短程序,需要建议/帮助

使用 Scikit-learn 进行拟合时出现 Python MemoryError

python - 使用 BeautifulSoup 删除具有特定类的 div