python - 如何将特定链接存储为列表然后单击它们

标签 python html list matrix web-scraping

我一直在关注有关如何抓取网页的教程 http://kanview.ks.gov/PayRates/PayRates_Agency.aspx 。该图解可以在这里找到:https://medium.freecodecamp.org/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251 。这个的布局类似于我想抓取信息的网站:https://www.giiresearch.com/topics/TL11.shtml 。我唯一的问题是 giiresearch 网站上报告标题的链接不按时间顺序排列,例如。以下内容来自gii研究

a href="/report/an806147-fixed-mobile-convergence-from-challenger-operators.html">Fixed-Mobile Convergence from Challenger Operators: Case Studies and Analysis</a>

a href="/annual/an378138-convergence-strategies.html">Convergence Strategies</a>

kanview网站上的链接遵循一定的顺序,例如:

a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_1" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl03$uxJobTitleBtn','')">Academic Advisor</a

a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_2" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl04$uxJobTitleBtn','')">Academic Program Specialist</a>

这意味着我无法在我的项目中使用他们的代码行中使用的方法:

python_button = driver.find_element_by_id('MainContent_uxLevel2_JobTitles_uxJobTitleBtn_' + str(x))

我尝试通过类名查找元素,但所有链接都具有相同的类名“列表标题”,​​因此 for 循环仅打开第一个链接,不再继续。

我认为应该有一种方法将报告标题链接存储在列表中,以便我可以逐个打开它们以检索有关每个报告的更多信息并将其保存在 Excel 工作表中。

在这个项目中,我想编写一份竞争对手报告的 Excel 表,其中包含有关其标题、价格、出版商、发布日期等的统计信息,以进行市场分析。

这是我的代码:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
from tabulate import tabulate
import os

#launch url
url = "https://www.giiresearch.com/topics/TL11.shtml"

# create a new Chrome session
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(30)
driver.get(url)

#Selenium hands the page source to Beautiful Soup
soup_level1=BeautifulSoup(driver.page_source, 'lxml')


datalist = [] #empty list
x = 0 #counter

#Beautiful Soup finds all Job Title links on the agency page and the loop begins
for link in soup_level1.find_all("div", {"class": "list_title"}):

    #Selenium visits each Job Title page
    python_button = driver.find_elements_by_class_name('list_title')
    python_button.click() #click link

    #Selenium hands of the source of the specific job page to Beautiful Soup
    soup_level2=BeautifulSoup(driver.page_source, 'lxml')

    #Beautiful Soup grabs the HTML table on the page
    table = soup_level2.find_all('table')[0]

    #Giving the HTML table to pandas to put in a dataframe object
    df = pd.read_html(str(table),header=0)

    #Store the dataframe in a list
    datalist.append(df[0])

    #Ask Selenium to click the back button
    driver.execute_script("window.history.go(-1)")

    #increment the counter variable before starting the loop over
    x += 1

    #end loop block

#loop has completed

#end the Selenium browser session
driver.quit()

#combine all pandas dataframes in the list into one big dataframe
result = pd.concat([pd.DataFrame(datalist[i]) for i in range(len(datalist))],ignore_index=True)

#convert the pandas dataframe to JSON
json_records = result.to_json(orient='records')

#pretty print to CLI with tabulate
#converts to an ascii table
print(tabulate(result, headers=["Report Title","Publisher","Published Date","Price"],tablefmt='psql'))

#get current working directory
path = os.getcwd()

#open, write, and close the file
f = open(path + "\\fhsu_payroll_data.json","w") #FHSU
f.write(json_records)
f.close()

最佳答案

您可以使用 css 选择器中的子 a 标记元素与父类的关系来存储报表标题链接。虽然您可以从访问的页面获取标题,但您还可以在列表理解中收集元组 [(link.get_attribute('href') , link.text) 中的链接...... 然后解压成可以循环的单独元组。

代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.giiresearch.com/topics/TL11.shtml'
driver = webdriver.Chrome()
driver.get(url)
titles, links = zip(*[(link.text, link.get_attribute('href')) for link in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list_title a")))]) #separate tuples
dates = [item.text for item in driver.find_elements_by_css_selector('.plist_dateinfo .plist_info_dd2')]
data = list(zip(dates, titles, links))

然后,您可以循环 links 元组和 driver.get 每个单独的元组。

您有标题和日期信息,以防您想做其他事情。例如:

print(data[3])

给出

('January 30, 2019', 'Global Telco Converged Plans & Bundles Insights 2019: A Look at Bundling Strategies from Around the World', 'https://www.giiresearch.com/report/wise780587-global-telco-converged-plans-bundles-insights-look.html')

关于python - 如何将特定链接存储为列表然后单击它们,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55329789/

相关文章:

html - 如何在 Firefox 中获取我自己的光标坐标?

python - 合并切片列表

python - 如何计算元组列表的累加和

python - 日志回滚时的主管异常导致应用服务器卡住?

python - 停止无限 while 循环重复调用 os.system

python - 以 h5py 对象作为实例变量的令人费解的赋值行为

javascript - 仅更改 <object> 标签的属性,其中包含使用 jQuery 的 SVG 元素

javascript - 确定从 dragenter 和 dragover 事件中拖动的内容

python - 为什么 dicts 比任何 int 都大?

c# - 列表排序的烦恼