python - 无法使用 BeautifulSoup 从页面获取实际标记

标签 python selenium python-3.x web-scraping beautifulsoup

我正在尝试结合使用 BeautifulSoupSelinium

来抓取此 URL
http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

我试过这段代码

active_review_page_html  = browser.page_source
active_review_page_html = active_review_page_html.replace('\\', "")
hotel_page_soup = BeautifulSoup(active_review_page_html)
print(hotel_page_soup)

但是它返回给我的数据是什么意思

;<span class="BVRRReviewText">Hotel accommodations and staff were fine ....

但我必须从该页面中删除该跨度

for review_div in hotel_page_soup.select("span .BVRRReviewText"):

如何从该 URL 获取真正的标记?

最佳答案

首先,您给我们的链接是错误的,而不是 actual page你试图抓取,你给我们一个链接到参与页面加载的 js 文件,这将是一个不必要的解析挑战。

其次,在这种情况下您不需要BeautifulSoupselenium 本身擅长定位元素和提取文本或属性。此处无需额外步骤。

这是一个工作示例,使用包含您想要获得的评论的实际页面:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # or webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')

# wait for the reviews to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.BVRRReviewText")))

# get reviews
for review_div in driver.find_elements_by_css_selector("span.BVRRReviewText"):
    print(review_div.text)
    print("---")

driver.close()

打印:

This is not a low budget hotel . Yet the hotel offers no amenities. Nothing and no WiFi. In fact, you block the wifi that comes with my celluar plan. I am a part of 2 groups that are loyal to the Sheraton, Alabama A&M and the 9th Episcopal District AMEChurch but the Sheraton is not loyal to us.
---
We are a company that had (5) guest rooms at the hotel. Despite having a credit card on file for room and tax charges, my guest was charged the entire amount to her personal credit card. It has taken me (5) PHONE CALLS and my own time and energy to get this bill reversed. I guess leaving a message with information and a phone number numerous times is IGNORED at this hotel. You can guarantee that we will not return with our business. YOu may thank Kimerlin or Kimberly in your accounting office for her lack of personal service and follow through for the lost business in the future.
---
...

我特意让您处理分页 - 如果您遇到困难,请告诉我。

关于python - 无法使用 BeautifulSoup 从页面获取实际标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27134612/

相关文章:

Python:简洁的属性

python - 如何在 pandas 中创建与多个列组合的数据框列

python - Django:更改可选 ImageField 的 url

javascript - Protractor:如何迭代和比较使用转发器从应用程序和场景表中获得的值

python - PostgreSQL-ModuleNotFoundError : No module named 'psycopg2'

python - 尝试在多线程中处理链接时出错

python - 反转 'newsdate',没有找到任何参数。尝试了 1 种模式 : ['newsdate/(?P<year>[0-9]+)$' ]

python - vim 与 python : how to map :silent make | copen to function key in vimrc

Android Studio - 谷歌网络驱动程序

Python:Selenium send_key 不起作用