python - 我如何遍历页面并使用 selenium 从每个页面获取数据?

标签 python loops selenium web-scraping

我想进行谷歌搜索并收集所有点击的链接,以便在收集所有链接后单击这些链接并从中提取数据。如何从每次点击中获取链接?

我尝试了多种解决方案,例如使用 for 循环和 while True 语句。我将在下面展示一些代码示例。我要么根本没有数据,要么只从 1 个网页获得数据(链接)。有人可以帮我弄清楚如何遍历谷歌搜索的每一页并获取所有链接,以便我可以继续抓取这些页面吗?我是使用 Selenium 的新手,所以如果代码没有多大意义,我很抱歉,我真的把自己弄糊涂了。

driver.get('https://www.google.com')
search = driver.find_element_by_name('q')
search.send_keys('condition')
sleep(0.5)
search.send_keys(Keys.RETURN)
sleep(0.5)

while True:
    try:
        urls = driver.find_elements_by_class_name('iUh30')
        for url in urls
        urls = [url.text for url in urls]

    sleep(0.5)

    element = driver.find_element_by_id('pnnext')
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    sleep(0.5)
    element.click()
urls = driver.find_elements_by_class_name('iUh30')
urls = [url.text for url in urls]
sleep(0.5)

element = driver.find_element_by_id('pnnext')
driver.execute_script("return arguments[0].scrollIntoView();", element)
sleep(0.5)
element.click()
while True:
    next_page_btn = driver.find_element_by_id('pnnext')
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else: 
        urls = driver.find_elements_by_class_name('iUh30')
        urls = [url.text for url in urls]
    sleep(0.5)

    element = driver.find_element_by_id('pnnext')
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    sleep(0.5)
    element.click()

我希望有一个来自 google 搜索的所有 url 的列表,这些 url 可以被 Selenium 打开,这样 Selenium 就可以从这些页面中获取数据。

我只从一个页面获取 url 列表。下一步(抓取那些页面)工作正常。但是由于这个限制,我只能得到 10 个结果,而我想查看所有结果。

最佳答案

试试下面的代码。我改变了一点。希望这对您有所帮助。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver=webdriver.Chrome()
driver.get('https://www.google.com')
search = driver.find_element_by_name('q')
search.send_keys('condition')
search.submit()

while True:
    next_page_btn =driver.find_elements_by_xpath("//a[@id='pnnext']")
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else:
        urls = driver.find_elements_by_xpath("//*[@class='iUh30']")
        urls = [url.text for url in urls]
        print(urls)

    element =WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'pnnext')))
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    element.click()

输出:

['https://dictionary.cambridge.org/dictionary/english/condition', 'https://www.thesaurus.com/browse/condition', 'https://en.oxforddictionaries.com/definition/condition', 'https://www.dictionary.com/browse/condition', 'https://www.merriam-webster.com/dictionary/condition', 'https://www.collinsdictionary.com/dictionary/english/condition', 'https://en.wiktionary.org/wiki/condition', 'www.businessdictionary.com/definition/condition.html', 'https://en.wikipedia.org/wiki/Condition', 'https://www.definitions.net/definition/condition', '', '', '', '']
['https://www.thefreedictionary.com/condition', 'https://www.thefreedictionary.com/conditions', 'https://www.yourdictionary.com/condition', 'https://www.foxnews.com/.../woman-battling-rare-suicide-disease-says-chronic-pain-con...', 'https://youngminds.org.uk/find-help/conditions/', 'www.road.is/travel-info/road-conditions-and-weather/', 'https://roll20.net/compendium/dnd5e/Conditions', 'https://www.home-assistant.io/docs/scripts/conditions/', 'https://www.bhf.org.uk/informationsupport/conditions', 'https://www.gov.uk/driving-medical-conditions']
['https://immi.homeaffairs.gov.au/visas/already-have.../check-visa-details-and-condition...', 'https://www.d20pfsrd.com/gamemastering/conditions/', 'https://www.ofgem.gov.uk/licences-industry-codes-and.../licence-conditions', 'https://www.healthychildren.org/English/health-issues/conditions/Pages/default.aspx', 'https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements.html', 'https://www.ofcom.org.uk/phones-telecoms.../general-conditions-of-entitlement', 'https://www.rnib.org.uk/eye-health/eye-conditions', 'https://www.mdt.mt.gov/travinfo/map/mtmap_frame.html', 'https://www.mayoclinic.org/diseases-conditions', 'https://www.w3schools.com/python/python_conditions.asp']
['https://www.tremblant.ca/mountain-village/mountain-report', 'https://www.equibase.com/static/horsemen/horsemenareaCB.html', 'https://www.abebooks.com/books/rarebooks/...guide/.../guide-book-conditions.shtml', 'https://nces.ed.gov/programs/coe/', 'https://www.cdc.gov/wtc/conditions.html', 'https://snowcrows.com/raids/builds/engineer/engineer/condition/']
['https://www.millenniumassessment.org/en/Condition.html', 'https://ghr.nlm.nih.gov/condition', 'horsemen.ustrotting.com/conditions.cfm', 'https://lb.511ia.org/ialb/', 'https://www.nps.gov/deva/planyourvisit/conditions.htm', 'https://www.allaboutvision.com/conditions/', 'https://www.spine-health.com/conditions', 'https://www.tripcheck.com/', 'https://hb.511.nebraska.gov/', 'https://www.gamblingcommission.gov.uk/.../licence-conditions-and-codes-of-practice....']
['https://sports.yahoo.com/andrew-bogut-credits-beer-improved-022043569.html', 'https://ant.apache.org/manual/Tasks/conditions.html', 'https://www.disability-benefits-help.org/disabling-conditions', 'https://www.planningportal.co.uk/info/200126/applications/60/consent_types/12', 'https://www.leafly.com/news/.../qualifying-conditions-for-medical-marijuana-by-state', 'https://www.hhs.gov/healthcare/about-the-aca/pre-existing-conditions/index.html', 'https://books.google.co.uk/books?id=tRcHAAAAQAAJ', 'www.onr.org.uk/documents/licence-condition-handbook.pdf', 'https://books.google.co.uk/books?id=S0sGAAAAQAAJ']
['https://books.google.co.uk/books?id=KSjLDvXH6iUC', 'https://www.arcgis.com/apps/Viewer/index.html?appid...', 'https://www.trappfamily.com/trail-conditions.htm', 'https://books.google.co.uk/books?id=n_g0AQAAMAAJ', 'https://books.google.co.uk/books?isbn=1492586277', 'https://books.google.co.uk/books?id=JDjQ2-HV3l8C', 'https://www.newsshopper.co.uk/.../17529825.teenager-no-longer-in-critical-condition...', 'https://nbcpalmsprings.com/.../bicyclist-who-collided-with-minivan-hospitalized-in-cri...']
['https://www.stuff.co.nz/.../4yearold-christchurch-terrorist-attack-victim-in-serious-but-...', 'https://www.shropshirestar.com/.../woman-in-serious-condition-after-fall-from-motor...', 'https://www.expressandstar.com/.../woman-in-serious-condition-after-fall-from-motor...', 'https://www.independent.ie/.../toddler-rushed-to-hospital-in-serious-condition-after-hit...', 'https://www.nhsinform.scot/illnesses-and-conditions/ears-nose-and-throat/vertigo', 'https://www.rochdaleonline.co.uk/.../teenage-cyclist-in-serious-condition-after-collisio...', 'https://www.irishexaminer.com/.../baby-of-woman-found-dead-in-cumh-in-critical-cond...', 'https://touch.nihe.gov.uk/index/corporate/housing.../house_condition_survey.htm', 'https://www.nami.org/Learn-More/Mental-Health-Conditions', 'https://www.weny.com/.../update-woman-in-critical-but-stable-condition-after-being-s...']

关于python - 我如何遍历页面并使用 selenium 从每个页面获取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55381806/

相关文章:

python - django.core.exceptions 在 python 中运行 selenium 功能测试时出现 ImproperlyConfigured 错误

r - 迭代存储在 R 中 data.frame 中的列表

java - 提交表单时,Selenium 发送 key 无法使用 java

excel - 在excel VBA中通过 Selenium 为每个谷歌搜索创建新标签

python - 如果找不到与列表匹配的任何内容,如何追加到列表中?

python - 如何从 sys.path 中永久删除值?

python - 如何优化这些嵌套循环?

loops - 继续而不是 Perl 中的 'or die'

mysql - 从代码循环到 Sql 查询?

java - 无法创建新的远程 session - Selenium webdriver