python - 无法使用 BeautifulSoup 从页面获取实际标记

我正在尝试结合使用 BeautifulSoup 和 Selinium

来抓取此 URL

http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

我试过这段代码

active_review_page_html  = browser.page_source
active_review_page_html = active_review_page_html.replace('\\', "")
hotel_page_soup = BeautifulSoup(active_review_page_html)
print(hotel_page_soup)

但是它返回给我的数据是什么意思

;&lt;span class="BVRRReviewText"&gt;Hotel accommodations and staff were fine ....

但我必须从该页面中删除该跨度

for review_div in hotel_page_soup.select("span .BVRRReviewText"):

如何从该 URL 获取真正的标记？

最佳答案

首先，您给我们的链接是错误的，而不是 actual page你试图抓取，你给我们一个链接到参与页面加载的 js 文件，这将是一个不必要的解析挑战。

其次，在这种情况下您不需要BeautifulSoup，selenium 本身擅长定位元素和提取文本或属性。此处无需额外步骤。

这是一个工作示例，使用包含您想要获得的评论的实际页面:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # or webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')

# wait for the reviews to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.BVRRReviewText")))

# get reviews
for review_div in driver.find_elements_by_css_selector("span.BVRRReviewText"):
    print(review_div.text)
    print("---")

driver.close()

打印:

This is not a low budget hotel . Yet the hotel offers no amenities. Nothing and no WiFi. In fact, you block the wifi that comes with my celluar plan. I am a part of 2 groups that are loyal to the Sheraton, Alabama A&M and the 9th Episcopal District AMEChurch but the Sheraton is not loyal to us.
---
We are a company that had (5) guest rooms at the hotel. Despite having a credit card on file for room and tax charges, my guest was charged the entire amount to her personal credit card. It has taken me (5) PHONE CALLS and my own time and energy to get this bill reversed. I guess leaving a message with information and a phone number numerous times is IGNORED at this hotel. You can guarantee that we will not return with our business. YOu may thank Kimerlin or Kimberly in your accounting office for her lack of personal service and follow through for the lost business in the future.
---
...

^{我特意让您处理分页 - 如果您遇到困难，请告诉我。}

关于python - 无法使用 BeautifulSoup 从页面获取实际标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27134612/

python - 无法使用 BeautifulSoup 从页面获取实际标记

上一篇：Python Regex - 在日期之前或之后高效

下一篇：python - 如何在 SST Python 中设置 results_directory