python - 如何使我的 session.get() 链接到变量?

标签 python web-scraping beautifulsoup

我的目标是抓取多个配置文件链接,然后抓取每个配置文件上的特定数据。

这是我获取多个个人资料链接的代码(应该可以正常工作):

from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)

soup=BeautifulSoup(r.html.html,'html.parser')

profiles = soup.find_all(href=re.compile("/profile/kaid"))

for links in profiles:
    links_no_list = links.extract()
    text_link = links_no_list['href']
    text_link_nodiscussion = text_link[:-10]
    final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
    print(final_profile_link)

现在这是我的代码,用于仅获取一个配置文件上的特定数据(它应该也可以正常工作):

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
import re
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)

soup=BeautifulSoup(r.html.html,'html.parser')

user_info_table=soup.find('table', class_='user-statistics-table')

if user_info_table is not None:
    dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
    dates=points=videos='NA'

user_socio_table=soup.find_all('div', class_='discussion-stat')

data = {}
for gettext in user_socio_table:
   category = gettext.find('span')
   category_text = category.text.strip()
   number = category.previousSibling.strip()
   data[category_text] = number

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
    if header_value not in data.keys():
        data[header_value]='NA'

user_calendar = soup.find('div',class_='streak-calendar-scroll-container')

if user_calendar is not None:
    #for getdate in user_calendar:
    last_activity = user_calendar.find('span',class_='streak-cell filled')
    last_activity_date = last_activity['title']
    #print(last_activity)
    #print(last_activity_date)
else:
    last_activity_date='NA'


filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()

我的问题是:如何自动化我的脚本? 换句话说:如何合并这两个脚本?

目标是创建一种变量,每次都会有不同的个人资料链接。

然后为每个配置文件链接获取特定数据,然后将其放入 csv 文件中(每个配置文件一个新行)。

最佳答案

这样做相当简单。我没有打印配置文件链接,而是将它们存储到 list variable 。然后循环遍历列表变量以抓取每个链接,然后写入 csv 文件。有些页面没有所有详细信息,因此您必须 handle那些异常(exception)也是如此。在下面的代码中,我也按照代码中使用的约定将它们标记为“NA”。 future 的另一项注意事项是考虑使用 python 的内置 csv module 用于读取和写入 csv 文件。

合并脚本

from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
    links_no_list = links.extract()
    text_link = links_no_list['href']
    text_link_nodiscussion = text_link[:-10]
    final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
    profile_list.append(final_profile_link)
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
for link in profile_list:
    print("Scraping ",link)
    session = HTMLSession()
    r = session.get(link)
    r.html.render(sleep=5)
    soup=BeautifulSoup(r.html.html,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    if user_info_table is not None:
        dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
    else:
        dates=points=videos='NA'
    user_socio_table=soup.find_all('div', class_='discussion-stat')
    data = {}
    for gettext in user_socio_table:
        category = gettext.find('span')
        category_text = category.text.strip()
        number = category.previousSibling.strip()
        data[category_text] = number
    full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
    for header_value in full_data_keys:
        if header_value not in data.keys():
            data[header_value]='NA'
    user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
    if user_calendar is not None:
        last_activity = user_calendar.find('span',class_='streak-cell filled')
        try:
            last_activity_date = last_activity['title']
        except TypeError:
            last_activity_date='NA'
    else:
        last_activity_date='NA'
    f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()

khanscrapetry1.csv 的示例输出

date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
...

关于python - 如何使我的 session.get() 链接到变量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54890582/

相关文章:

python - Numpy View Reshape Without Copy(二维移动/滑动窗口、步幅、屏蔽内存结构)

rvest 在 R 中使用表单输入进行 Webscraping

beautifulsoup - 如何获取 bs4.element.Tag 最外层标签被移除的原始字符串?

Windows Function App 不支持 Python 项目。部署到 Linux 函数应用程序

python - 如何在python中多次解析

python - 如何从 div 中仅获取姓名和联系电话?

python - python 中用于解析 HTML 标题标签的正则表达式模式

Python BeautifulSoup - 在找到的关键字周围添加标签

python - BS HTML 解析 - & 在打印 URL 字符串时被忽略

python - 如何通过双击文件在空闲编辑器中打开 python 文件 (windows 10)