python - 使用webdriver、python、beautifulsoup检索动态网站

标签 python selenium selenium-webdriver webdriver beautifulsoup

尝试将默认(最相关)的排序形式更改为“最新”,出现问题,我无法保存页面。有什么建议吗?

#start webdriver to open the given product page via chrome browser
driver =webdriver.Chrome()
driver.get('http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371')
time.sleep(2)


#find the drop down list, then select the newest option and click  
m=driver.find_element_by_id("BVRRDisplayContentSelectBVFrameID")
m.find_element_by_xpath("//option[@value='http://homedepot.ugc.bazaarvoice.com/1999m/205080371/reviews.djs?format=embeddedhtml&sort=submissionTime']").click()
time.sleep(2)

#save the search result into the python
html = driver.page_source
file_object = open("samplereview.txt", "a")
file_object.write(str(html))
file_object.close( )
time.sleep(2)

soup=BeautifulSoup(html)

#quit from driver
driver.quit

最佳答案

您缺少两个关键的 Selenium 特定的东西:

  • 不要使用time.sleep() - 使用Waits
  • 使用Select class在处理select/option时 - 它提供了一个非常好的抽象

这是修改后的代码(看看它的可读性和简短程度):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

driver = webdriver.Chrome()
driver.get('http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371')

# waiting until reviews are loaded
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, 'BVRRDisplayContentSelectBVFrameID'))
)

select = Select(element)
select.select_by_visible_text('Newest')

现在我看到评论从最新到最旧排序:

enter image description here

<小时/>

要解析评论,您不一定需要将页面源传递到 BeautifulSoup 进行进一步处理 - selenium 本身在定位元素方面功能强大:

reviews = []
for review in driver.find_elements_by_xpath('//span[@itemprop="review"]'):
    name = review.find_element_by_xpath('.//span[@itemprop="name"]').text.strip()
    stars = review.find_element_by_xpath('.//span[@itemprop="ratingValue"]').text.strip()
    description = review.find_element_by_xpath('.//div[@itemprop="description"]').text.strip()

    reviews.append({
        'name': name,
        'stars': stars,
        'description': description
    })

print(reviews)

打印:

[
    {'description': u'Very durable product. Worth the money. My husband loves it',
     'name': u'Excellent product',
     'stars': u'5.0'},

    {'description': u'I now have all my tools in one well organized box instead of several boxes and have a handy charging station for cordless tools on the top . Money well spent. Solid box!',
     'name': u'Great!',
     'stars': u'5.0'},

    ...
]

关于python - 使用webdriver、python、beautifulsoup检索动态网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27717411/

相关文章:

python - IntEnum 返回 AttributeError : can't set attribute

python - 将数字添加到该数组的特定数据后如何获取数组?

c# - Selenium 无法处理 IE 中的确认证书弹出窗口

selenium - org.openqa.selenium.InvalidSelectorException - [对象文本]。它应该是一个元素

java - 在月份下拉列表中选择值 - Selenium Webdriver

java - 使用 IEDriverServer 3.9.0.0 运行 selenium 时无法启动 IE 11

python - 将字典列表拆分为多个字典列表

Python 字符串以 # 开头

python - Selenium 超时异常: Message: python

java - 创建了两个 WebDriver 实例