python - 我的年份列表不适用于 BeautifulSoup。为什么?

标签 python beautifulsoup

我是学习 BeautifulSoup 的新手。有人可以看一下下面的代码吗?我想从网站抓取数据,但没有成功。我想创建一个数据框,其中包含每年到达的玩家人数总和以及一列玩家平均年龄。

数据框重复代码: img dataframe error

我的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup


anos_list = list(range(2005, 2018))

anos_lista = []
valor_contratos_lista = []
idade_média_lista = []

    for ano_lista in anos_list:
        url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'+ str(anos_list) + ''
        page = requests.get(url, headers={'User-Agent': 'Custom5'})
        soup = BeautifulSoup(page.text, 'html.parser')

    tag_list = soup.tfoot.find_all('td')
    valor = (tag_list[0].string)
    idade = (tag_list[1].string)
    ano = ano_lista 

    valor_contratos_lista.append(valor)
    idade_media_lista.append(idade)
    anos_lista.append(ano)


flamengo_df = pd.DataFrame({'Ano': ano_lista,
         'Despesa com contratações':valor_contratos_lista,
                        'Média de idade': idade_média_lista
                       })
flamengo_df.to_csv('flamengo.csv', encoding = 'utf-8')`

最佳答案

这是我的方法:

使用 Beautiful Soup + 正则表达式:

import requests
from bs4 import BeautifulSoup
import re
import numpy as np

# Set min and max years as variables
min_year = 2005
max_year = 2019
year_range = list(range(min_year, 2019+1))
base_url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'

# Begin iterating
records = []
for year in year_range:

    url = base_url+str(year)

    # get the page
    page = requests.get(url, headers={'User-Agent': 'Custom5'})
    soup = BeautifulSoup(page.text, 'html.parser')

    # I used the class of "responsive table"
    tables = soup.find_all('div',{'class':'responsive-table'})
    rows = tables[0].find_all('tr')
    cells = [row.find_all('td', {'class':'zentriert'}) for row in rows]

    # get variable names:
    variables = [x.text for x in rows[0].find_all('th')]
    variables_values = {x:[] for x in variables}
    # get values
    for row in rows:
        values = [' '.join(x.text.split()) for x in row.find_all('td')]
        values = [x for x in values if x!='']

        if len(variables)< len(values):
            values.pop(4)
            values.pop(2)  
        for k,v in zip(variables_values.keys(), values):
            variables_values[k].append(v)

    num_pattern = re.compile('[0-9,]+')
    to_float = lambda x: float(x) if x!='' else np.NAN
    get_nums = lambda x: to_float(''.join(num_pattern.findall(x)).replace(',','.'))

    # Add values to an individual record
    rec = {
        'Url':url,
        'Year':year,
        'Total Transfers':len(variables_values['Player']),
        'Avg Age': np.mean([int(x) for x in variables_values['Age']]),
        'Avg Cost': np.nanmean([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
        'Total Cost': np.nansum([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
    }

    # Store record
    records.append(rec)

此后,初始化数据框: 值得注意的是,有些数字代表数百万,需要进行调整。

import pandas as pd

# Drop the URL
df = pd.DataFrame(records, columns=['Year','Total Transfers','Avg Age','Avg Cost','Total Cost'])

    Year  Total Transfers    Avg Age    Avg Cost  Total Cost
0   2005               26  22.038462    2.000000        2.00
1   2006               32  23.906250  240.660000     1203.30
2   2007               37  22.837838  462.750000     1851.00
3   2008               41  22.926829  217.750000      871.00
4   2009               31  23.419355  175.000000      350.00
5   2010               46  23.239130  225.763333     1354.58
6   2011               47  23.042553  340.600000     1703.00
7   2012               45  24.133333  345.820000     1037.46
8   2013               36  24.166667  207.166667      621.50
9   2014               37  24.189189  111.700000      335.10
10  2015               49  23.530612  413.312000     2066.56
11  2016               41  23.341463  241.500000      966.00
12  2017               31  24.000000  101.433333      304.30
13  2018               18  25.388889  123.055000      738.33
14  2019               10  25.300000         NaN        0.00


关于python - 我的年份列表不适用于 BeautifulSoup。为什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56726630/

相关文章:

python - 按对象属性对对象字典进行分组

python - 带有排序列表的 Pandas 列的名字

python - 使用 BeautifulSoup 提取特定的嵌套 div

python - 使用 BeautifulSoup 导入特定列的数据

python - 从源代码安装 Google 或工具时使第三方无法运行 - Windows

python - 为什么将大小为 1 的维度插入 numpy 数组会使其 'contiguous' 标志无效?

python - 删除列表中与另一个列表中的最大值不对应的项目

python - 返回特定内容

python - 如何使用 BeautifulSoup 从网页中获取整个正文文本?

python - 无法解析包含表格数据(iframe)的网站中的元素