python - 使用 BeautifulSoup 和 Selenium 抓取网站多个网页的内容

标签 python selenium selenium-webdriver beautifulsoup screen-scraping

我想要废弃的网站是:

http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061

我想获取上面链接的最后页码以继续进行,在截图时为 499。

The screenshot showing the last page number which I get as my output as of now

我的代码:

   from bs4 import BeautifulSoup 
   from urllib.request import urlopen as uReq
   from selenium import webdriver;import time
   from selenium.webdriver.common.by import By
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC
   from selenium.webdriver.common.desired_capabilities import         DesiredCapabilities

   firefox_capabilities = DesiredCapabilities.FIREFOX
   firefox_capabilities['marionette'] = True
   firefox_capabilities['binary'] = '/etc/firefox'

   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"

   driver.get(url)
   wait = WebDriverWait(driver, 10)
   soup=BeautifulSoup(driver.page_source,"lxml")
   containers = soup.findAll("ul",{"class":"pages table"})
   containers[0] = soup.findAll("li")
   li_len = len(containers[0])
   for item in soup.find("ul",{"class":"pages table"}) : 
   li_text = item.select("li")[li_len].text
   print("li_text : {}\n".format(li_text))
   driver.quit()

我需要帮助来找出获取最后页码的代码中的错误。另外,如果有人给出相同的替代解决方案并提出实现我的意图的方法,我将不胜感激。

最佳答案

如果您想获取上述链接的最后页码以继续操作,即 499,您可以使用 SeleniumBeautifulsoup如下:

<小时/>

Selenium :

from selenium import webdriver

driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
element = driver.find_element_by_xpath("//div[@class='row pagination']//p/span[contains(.,'Reviews on Reliance Jio')]")
driver.execute_script("return arguments[0].scrollIntoView(true);", element)
print(driver.find_element_by_xpath("//ul[@class='pagination table']/li/ul[@class='pages table']//li[last()]/a").get_attribute("innerHTML"))
driver.quit()

控制台输出:

499
<小时/>

美丽汤:

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find("ul",{"class":"pages table"})
all_li = container.findAll("li")
last_div = None
for last_div in all_li:pass
if last_div:
    content = last_div.getText()
    print(content)

控制台输出:

499

关于python - 使用 BeautifulSoup 和 Selenium 抓取网站多个网页的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59458065/

相关文章:

python - 为什么这些基本转换器给出不同的答案?

python - 如何使用 selenium、chrome 驱动程序和 python 关闭新建的选项卡

java - 使用 Selenium 时出现下拉列表问题

java - 何时停止在对象中定义 FindBy 并将其移至测试

python - Selenium 使用 Send_keys 自动提交表单?

python - 保存到 csv 文件 python 时日期时间格式发生变化

python - 如何编码序列以在keras中对RNN进行排序?

python - 如何将具有对象 dtype 的 Numpy 二维数组转换为常规的二维 float 组

python - 用 Selenium python 刮取产品颜色

java - 无法在 OSX 上从 java.app 启动 Phantomjs 驱动程序