我想抓取以下页面: http://www.interzum.com/exhibitors-and-products/exhibitor-index/exhibitor-index-15.php
我想循环浏览每个参展商链接,并获取联系方式。然后我需要在所有 77 个页面上执行此操作。
我可以从页面中提取所需的信息,但是当构建函数和循环时,我不断收到错误,并且找不到用于循环多个页面的简单结构。
这是我迄今为止在 jupyter 笔记本中的内容:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://www.interzum.com/exhibitors-and-products/exhibitor-index/exhibitor-index-15.php'
text = requests.get(url).text
page1 = BeautifulSoup(text, "html.parser")
def get_data(url):
text = requests.get(url).text
page2 = BeautifulSoup(text, "html.parser")
title = page2.find('h1', attrs={'class':'hl_2'}).getText()
content = page2.find('div', attrs={'class':'content'}).getText()
phone = page2.find('div', attrs={'class':'sico ico_phone'}).getText()
email = page2.find('a', attrs={'class':'sico ico_email'}).getText
webpage = page2.find('a', attrs={'class':'sico ico_link'}).getText
data = {'Name': [title],
'Address': [content],
'Phone number': [phone],
'Email': [email],
'Web': [web]
}
df = pd.DataFrame()
for a in page1.findAll('a', attrs={'class':'initial_noline'}):
df2 = get_data(a['href'])
df = pd.concat([df, df2])
AttributeError: 'NoneType' object has no attribute 'getText'
我知道我不断收到的错误是因为我在函数和循环的语法上苦苦挣扎。
推荐的结构是什么?
最佳答案
这是经过一些调试的版本。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://www.interzum.com/exhibitors-and-products/exhibitor-index/exhibitor-index-15.php'
text = requests.get(url).text
page1 = BeautifulSoup(text, "html.parser")
def get_data(url):
text = requests.get(url).text
page2 = BeautifulSoup(text, "html.parser")
title = page2.find('h1', attrs={'class':'hl_2'}).getText()
content = page2.find('div', attrs={'class':'content'}).getText()
phone = page2.find('div', attrs={'class':'sico ico_phone'}).getText()
email = page2.find('div', attrs={'class':'sico ico_email'}).getText
webpage = page2.find('div', attrs={'class':'sico ico_link'}).getText
data = [[title, content,phone, email, webpage]]
return data
df = pd.DataFrame()
for a in page1.findAll('a', attrs={'class':'initial_noline'}):
if 'kid=' not in a['href'] : continue
print('http://www.interzum.com' + a['href'])
data = get_data('http://www.interzum.com' + a['href'])
df.append(data)
关于python - BeautifulSoup - 从多个页面获取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55090549/