python - 如何在 beautifultsoup 中处理多个 URL 并将数据转换为 DataFrame?

标签 python pandas dataframe beautifulsoup

我有一个要用来获取数据的 URL 列表。我可以为一个网址做这样的事情:

URL list = ['https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/4722/',
'https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/6318/'


from bs4 import BeautifulSoup
import requests

url = "https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/4479/"
page = requests.get(url)

soup = BeautifulSoup(page.text, "html.parser")

info = soup.find_all("dl", {'class':'c-description-list c-description-list--striped'})

comp_info = pd.DataFrame()
cleaned_id_text = []
for i in info[0].find_all('dt'):
    cleaned_id_text.append(i.text)
cleaned_id__attrb_text = []
for i in info[0].find_all('dd'):
    cleaned_id__attrb_text.append(i.text)


df = pd.DataFrame([cleaned_id__attrb_text], column = cleaned_id_text)

但我不知道如何对多个网址执行此操作并将数据附加到数据框中。每个 URL 都描述了类(class)说明,因此我想创建一个数据框,其中包含所有 URL 中的所有数据...如果我能够将 URL 添加为数据框中的单独列,那就太好了。

最佳答案

import requests
from bs4 import BeautifulSoup
import pandas as pd


numbers = [4722, 6318]


def Main(url):
    with requests.Session() as req:
        for num in numbers:
            r = req.get(url.format(num))
            soup = BeautifulSoup(r.content, 'html.parser')
            target = soup.find(
                "dl", class_="c-description-list c-description-list--striped")
            names = [item.text for item in target.findAll("dt")]
            data = [item.get_text(strip=True) for item in target.findAll("dd")]
            df = pd.DataFrame([data], columns=names)
            df.to_csv("data.csv", index=False, mode="a")


Main("https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/{}/")

根据用户请求更新:

import requests
from bs4 import BeautifulSoup
import pandas as pd


def Main(urls):
    with requests.Session() as req:
        allin = []
        for url in urls:
            r = req.get(url)
            soup = BeautifulSoup(r.content, 'html.parser')
            target = soup.find(
                "dl", class_="c-description-list c-description-list--striped")
            names = [item.text for item in target.findAll("dt")]
            names.append("url")
            data = [item.get_text(strip=True) for item in target.findAll("dd")]
            data.append(url)
            allin.append(data)
        df = pd.DataFrame(allin, columns=names)
        df.to_csv("data.csv", index=False)


urls = ['https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/4722/',
        'https://www2.daad.de/deutschland/studienangebote/international-programmes/en/detail/6318/']
Main(urls)

关于python - 如何在 beautifultsoup 中处理多个 URL 并将数据转换为 DataFrame?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60908216/

相关文章:

python - 如何压缩数据框

python - 获取一组中多个分组的 pandas.DataFrame 聚合中每个子组的计数

python - 如何打印没有索引的 Pandas 数据框

python - 无法传入 lambda 以申请 pandas DataFrame

python - Google colab 文件下载失败获取错误

python - 如何在列表中查找重复项,但忽略第一次出现的项?

python - 使用 pandas 创建平均数据框

python-3.x - 根据条件查找 pandas 数据框中的列数

python - Kivy Python 鼠标位置

python - 选择具有最高值的变量 (Python)