python - 获取selenium中具有相同类名的所有值

标签 python selenium selenium-webdriver web-scraping web-crawler

我想获取具有相同类名的文章的文章名称和url。 问题是,它一次又一次地只打印一条信息,而不是所有文章。

from selenium import webdriver
driver = webdriver.Chrome(r'C:\Users\muhammad.usman\Downloads\chromedriver_win32\chromedriver.exe')
driver.get('https://www.aljazeera.com/news/')
# to get the current location ...
driver.current_url
button = driver.find_element_by_id('btn_showmore_b1_418')
driver.execute_script("arguments[0].click();", button)
content = driver.find_element_by_class_name('topics-sec-block')
print(content)
container = content.find_elements_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]')
print(container)
i=0
for i in range(0, 12):
    title = []
    url = []
    heading=container[i].find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a/h2').text
    link = container[i].find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a')
    title.append(heading)
    url.append(link.get_attribute('href'))
    print(title)
    print(url)
    i += 1
names = driver.find_elements_by_css_selector('div.topics-sec-item-cont')
for name in names:

    heading=name.find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a/h2').text
    link = name.find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a')
    print(heading)
    print(link.get_attribute('href'))

最佳答案

使用 Selenium 和 BeautifulSoup

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get('https://www.aljazeera.com/news/')
# to get the current location ...
driver.current_url
button = driver.find_element_by_id('btn_showmore_b1_418')
driver.execute_script("arguments[0].click();", button)
content = driver.find_element_by_class_name('topics-sec-block')
print(content)

soup = BeautifulSoup(driver.page_source, 'html.parser')
container = soup.select('div.topics-sec-item-cont')

titleList = []
urlList = []
for item in container:
    heading=item.find('h2').text
    link = item.find('a')['href']
    titleList.append(heading)
    urlList.append(link)
    print('HEADLINE: %s\nUrl: https://www.aljazeera.com%s\n' %(heading, link) + '-'*70 + '\n' )



driver.close()

输出:

HEADLINE: Trump's Remain in Mexico policy endangers migrants headed to US
Url: https://www.aljazeera.com/news/2020/03/trumps-remain-mexico-policy-endangers-migrants-headed-200306102155930.html
----------------------------------------------------------------------

HEADLINE: India, South Korea report new coronavirus cases: Live updates
Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
----------------------------------------------------------------------

HEADLINE: Clashes between Greek police, migrants reported on Turkish border
Url: https://www.aljazeera.com/topics/subjects/refugees.html
----------------------------------------------------------------------

HEADLINE: Congo protests against unpaid pensions as gov't debt balloons
Url: https://www.aljazeera.com/topics/regions/africa.html
----------------------------------------------------------------------

HEADLINE: Is India prepared for coronavirus outbreak?
Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
----------------------------------------------------------------------

HEADLINE: India protest violence leaves thousands displaced
Url: https://www.aljazeera.com/topics/regions/asia.html
----------------------------------------------------------------------

HEADLINE: Guinea protests: One dead in anti-government demonstration
Url: https://www.aljazeera.com/topics/regions/africa.html
----------------------------------------------------------------------

HEADLINE: Brazil recalls diplomats, officials from Venezuela
Url: https://www.aljazeera.com/topics/country/brazil.html
----------------------------------------------------------------------

HEADLINE: US coronavirus: rise in cases in New York state
Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
----------------------------------------------------------------------

HEADLINE: Australia urged to take action amid rising violence against women
Url: https://www.aljazeera.com/topics/country/australia.html
----------------------------------------------------------------------

HEADLINE: Turkey, Russia announce ceasefire in Syria's Idlib
Url: https://www.aljazeera.com/topics/regions/middleeast.html
----------------------------------------------------------------------

HEADLINE: 'Good morning, Codogno!': A coronavirus radio station in Italy
Url: https://www.aljazeera.com/topics/country/italy.html
----------------------------------------------------------------------

关于python - 获取selenium中具有相同类名的所有值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60562525/

相关文章:

python - 在字典中访问 Pandas 面具

python - Nginx 使用带有 Gunicorn 的 Django 发送网络响应花费的时间太长

java - 如何检索文本字段内的文本

java - 哪个 selenium 版本与 Firefox ESR 45.2.0 兼容

javascript - browser.pause() 和 browser.enterRepl() 有什么区别?

java - 无法点击 webelement 按钮

python - 如何使用 mechanize 获取网页上的链接并打开这些链接

python - 对象 (ctx: )>"上的 ValueError : Invalid field u'field' in leaf "<osv.ExtendedLeaf: (u' field', u'in', [59]) - Odoo v8

javascript - JAVA 使用 Selenium 时如何禁用 Javascript?

java - google 搜索中的 org.openqa.selenium.ElementNotVisibleException