python - 无法使用selenium从页面中名为 'heading'的每个类获取数据

标签 python python-3.x selenium selenium-webdriver web-scraping

嗨,我是数据抓取方面的新手。在这里,我尝试从具有 'heading' 属性的所有类中抓取数据。但在我的代码中,即使我使用 for 循环进行迭代,它也只打印第一个元素。

预期输出 - 从具有属性'heading'的所有页面类中抓取数据

实际输出 - 仅从类名称为“标题”的第一个元素中抓取数据,甚至不单击下一个按钮。

我用于测试的网站是 here

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
from openpyxl.workbook import Workbook


DRIVER_PATH = 'C:/Users/Aishwary/Downloads/chromedriver_win32/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get('https://www.fundoodata.com/citiesindustry/19/2/list-of-information-technology-(it)-companies-in-noida')

# get all classes which has heading as a class name 
company_names = driver.find_elements_by_class_name('heading')

# to store all companies names from heading class name
names_list = []

while True:

    try:
        for name in company_names: # iterate each name in all div classes named as heading
            text = name.text    # get text data from those elements
            names_list.append(text)
            print(text)
            # Click on next button to get data from next pages as well
            driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main-container"]/div[2]/div[4]/div[2]/div[44]/div[1]/ul/li[7]/a'))))
            driver.find_element_by_xpath('//*[@id="main-container"]/div[2]/div[4]/div[2]/div[44]/div[1]/ul/li[7]/a').click()

    except (TimeoutException, WebDriverException) as e:
        print("Last page reached")
        break


driver.quit()

# Store those data in excel sheet
df = pd.DataFrame(names_list)
writer = pd.ExcelWriter('companies_names.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='List')
writer.save()

最佳答案

此脚本将从页面获取所有企业名称:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.fundoodata.com/citiesindustry/19/2/list-of-information-technology-(it)-companies-in-noida'

all_data = []
while True:
    print(url)

    soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
    for h in soup.select('div.heading'):
        all_data.append({'Name' : h.text})
        print(h.text)

    next_page = soup.select_one('a:contains("Next")')
    if not next_page:
        break

    url = 'https://www.fundoodata.com' + next_page['href']

df = pd.DataFrame(all_data)
print(df)

df.to_csv('data.csv')

打印:

                              Name
0                   BirlaSoft Ltd
1             HCL Infosystems Ltd
2            HCL Technologies Ltd
3           NIIT Technologies Ltd
4          3Pillar Global Pvt Ltd
..                             ...
481  Innovaccer Analytics Pvt Ltd
482         Kratikal Tech Pvt Ltd
483          Sofocle Technologies
484    SquadRun Solutions Pvt Ltd
485   Zaptas Technologies Pvt Ltd

[486 rows x 1 columns]

并保存data.csv(来自 LibreOffice 的屏幕截图):

enter image description here

关于python - 无法使用selenium从页面中名为 'heading'的每个类获取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63325463/

相关文章:

python - iphone 推送通知密码问题 (pyAPns)

python - 什么时候检查非局部变量的存在?

selenium - 如何解决 org.openqa.selenium.WebDriverException?

python - 将大型 csv 文件读入字典时出现内存错误

python - Pandas 抓取的数据在 Pandas 中不起作用

python - 机器人框架-生成随机数据

python - Pandas ,值错误

python - 使用多维键进行索引 pandas 错误

python - 为什么结果在 python Sympy.limit 中不同?

python-3.x - 无法使用 Tensorflow 数据集分割疟疾数据集