javascript - Python Selenium 无法通过链接。帕斯特宾爬行

标签 javascript python selenium web-crawler pastebin

您好,我正在尝试提取给定的 10 个页面中的所有链接,用于搜索 ssh

加载 JavaScript 后,我​​可以从第一页提取前 10 个链接,然后,我可以单击第一页一次,并提取接下来的 10 个链接,但是,当尝试转到第三页时,我收到错误。

这是我的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import re

links = []
driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")

# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))
for link in driver.find_elements_by_xpath("//div[@class='gs-title']/a[@class='gs-title']"):
        if link.get_attribute("href") != None:
            print link.get_attribute("href")
# get all search results links
for page in driver.find_elements_by_xpath("//div[@class='gsc-cursor-page']"):
    driver.implicitly_wait(10) # seconds
    page.click()

    for link in driver.find_elements_by_xpath("//div[@class='gs-title']/a[@class='gs-title']"):
        if link.get_attribute("href") != None:
            print link.get_attribute("href")

这是我能够获得的,以及我所犯的错误:

python pastebinselenium.py 
http://pastebin.com/u/ssh
http://pastebin.com/gsQWBEZP
http://pastebin.com/gfA12TWk
http://pastebin.com/udWMWdPR
http://pastebin.com/J55238CB
http://pastebin.com/DN2aHvRr
http://pastebin.com/f0rh66kU
http://pastebin.com/3zvY3DSm
http://pastebin.com/fqHVJGEm
http://pastebin.com/3aB7h0fm
http://pastebin.com/3uBAxXu3
http://pastebin.com/cxjRqeSh
http://pastebin.com/5nJPNr3Q
http://pastebin.com/qV0rPNfP
http://pastebin.com/zubt2Yc7
http://pastebin.com/jFrjWYpE
http://pastebin.com/DU7yqjQ1
http://pastebin.com/AFtWHmtE
http://pastebin.com/UVP5behK
http://pastebin.com/hP7XTyv1
Traceback (most recent call last):
  File "pastebinselenium.py", line 21, in <module>
    page.click()
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 74, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 457, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
Stacktrace:
    at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9454)
    at Utils.getElementAt (file:///tmp/tmpzhZSEC/extensions/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1f79677b6d76697a6d5f78707078737a7c707b7a317c7072" rel="noreferrer noopener nofollow">[email protected]</a>/components/command-processor.js:9039)
    at fxdriver.preconditions.visible (file:///tmp/tmpzhZSEC/extensions/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cfa9b7abbda6b9aabd8fa8a0a0a8a3aaaca0abaae1aca0a2" rel="noreferrer noopener nofollow">[email protected]</a>/components/command-processor.js:10090)
    at DelayedCommand.prototype.checkPreconditions_ (file:///tmp/tmpzhZSEC/extensions/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cfae4f8eef5eaf9eedcfbf3f3fbf0f9fff3f8f9b2fff3f1" rel="noreferrer noopener nofollow">[email protected]</a>/components/command-processor.js:12644)
    at DelayedCommand.prototype.executeInternal_/h (file:///tmp/tmpzhZSEC/extensions/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="13756b77617a65766153747c7c747f76707c77763d707c7e" rel="noreferrer noopener nofollow">[email protected]</a>/components/command-processor.js:12661)
    at fxdriver.Timer.prototype.setTimeout/<.notify (file:///tmp/tmpzhZSEC/extensions/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="53352b37213a25362113343c3c343f36303c37367d303c3e" rel="noreferrer noopener nofollow">[email protected]</a>/components/command-processor.js:625)

我想从 10 个页面(总共 100 个)中取出 10 个链接,但我只能提取 20 个 =(

我也尝试过这个:

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-cursor-box")))

就在点击之前,但没有成功。

最佳答案

这个想法是循环单击分页链接,等待下一个页码在途中变为事件收集链接。实现:

from pprint import pprint

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")

# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))

links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector(".gsc-results .gs-result > .gsc-thumbnail-inside > .gs-title > a.gs-title")]
for page_number in range(2, 11):
    driver.find_element_by_xpath("//div[@class='gsc-cursor-page' and . = '%d']" % page_number).click()

    wait.until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'gsc-cursor-current-page') and . = '%d']" % page_number)))

    links.extend([link.get_attribute("href") for link in driver.find_elements_by_css_selector(".gsc-results .gs-result > .gsc-thumbnail-inside > .gs-title > a.gs-title")])

print(len(links))
pprint(links)

打印:

100
['http://pastebin.com/u/ssh',
 'http://pastebin.com/gsQWBEZP',
  ...
 'http://pastebin.com/vtBgrndi',
 'http://pastebin.com/WgXrebLq',
 'http://pastebin.com/Nxui56Gh',
 'http://pastebin.com/Qef0LZPR',
 'http://pastebin.com/yNUh1fRe',
 'http://pastebin.com/2j0d8FzL',
 'http://pastebin.com/g92A2jAq']

关于javascript - Python Selenium 无法通过链接。帕斯特宾爬行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36577338/

相关文章:

javascript - 检查值是否是 JavaScript 中的对象

用于测试自动化框架的Java空指针异常

java - 等待驱动程序直到元素消失

javascript - 使用 Javascript 更改按钮标签

javascript - 将多个数组合并为一个的算法

python - 将 Pandas 日期列转换为天

python - 使 DataFrame 相对于特定列保持平衡

java - Selenium Grid,如何将 WebDriver 与 ThreadSafeSeleniumSessionStorage.session() 一起使用

javascript - javascript中访问对象属性的快捷方式

python - 从 pyspark 中的数据框构建 StructType