python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题

标签 python pandas numpy web-scraping beautifulsoup

我正在尝试从许多不同的网址收集信息,并根据年份和高尔夫球手姓名组合数据。截至目前,我正在尝试将信息写入 csv,然后使用 pd.merge() 进行匹配,但我必须为要合并的每个数据框使用唯一名称。我尝试使用一个 numpy 数组,但我陷入了获取所有要合并的所有单独数据的最终过程。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np

base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']

alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
    data = urlopen(i)
    soup = BeautifulSoup(data, "html.parser")
    for link in soup.find_all('a'):
        if link.has_attr('href'):
            alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
    if inn in i:
        beta.append(i)
#no repeats
gamma= []
for i in beta:
    if i not in gamma:
        gamma.append(i)

#making list of urls with Statistic labels
jan = []
for i in gamma:
    try:
        data = urlopen(i)
        soup = BeautifulSoup(data, "html.parser")
        for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
            for j in table.find_all('h3'):
                y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
                jan.append([i,str(y+'.csv')])
                print([i,str(y+'.csv')])
    except Exception as e:
            print(e)
            pass

# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = [] 
title = []  
for i in jan:
    try:
        with open(i[1], 'w+') as fp:
            writer = csv.writer(fp)
            for y in years:
                data = urlopen(i[0][:-4] +y+ end)
                soup = BeautifulSoup(data, "html.parser")
                data1 = urlopen(i[0])
                soup1 = BeautifulSoup(data1, "html.parser")
                for table in soup1.find_all('table',{'id':'statsTable'}):
                    title.append('year')
                    for k in table.find_all('tr'):
                        for n in k.find_all('th'):
                            title1.append(n.get_text())
                            for l in title1:
                                if l not in title:
                                    title.append(l)
                    rows_sp.append(title)
                for table in soup.find_all('table',{'id':'statsTable'}):
                    for h in table.find_all('tr'):
                        row_sp = [y]
                        for j in h.find_all('td'):
                            row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                        rows_sp.append(row_sp)
                        print(row_sp)
                        writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda  left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

url、统计类型、所需格式
......只是介于两者之间的所有东西
试图在一行上获取数据
以下数据的网址 [' http://www.pgatour.com/stats/stat.02356.html ',' http://www.pgatour.com/stats/stat.02568.html ',...,' http://www.pgatour.com/stats/stat.111.html ']

统计标题
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek  ranklastweek   name         events   rating    rounds avg
2017 2             3             Rickie Fowler  10      8.8       62    .614    
TOTAL SG:APP   MEASURED ROUNDS   .... %     # SAVES    # BUNKERS    TOTAL O/U PAR
26.386         43                ....70.37    76           108          +7.00

最佳答案

更新 (根据评论)
这个问题部分是关于技术方法的(Pandas merge()),但它似乎也是一个讨论数据收集和清理的有用工作流程的机会。因此,我添加了比编码解决方案严格要求的更多细节和解释。

您基本上可以使用与我原始答案相同的方法从不同的 URL 类别中获取数据。我建议保留 {url:data} 的列表dicts 遍历 URL 列表,然后从该 dict 构建清理的数据帧。

设置清理部分涉及一些繁琐的工作,因为您需要针对每个 URL 类别中的不同列进行调整。我已经使用手动方法进行了演示,只使用了几个测试 URL。但是,如果您有数千个不同的 URL 类别,那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这个 OP 的范围。

只要你确定有 yearPLAYER NAME每个 URL 中的字段,以下合并应该可以工作。和以前一样,假设您不需要写入 CSV,现在让我们停止对抓取代码进行任何优化:

首先,定义urls中的url类别.通过 url 类别,我指的是 http://www.pgatour.com/stats/stat.02356.html实际上会通过在 url 本身中插入一系列年份来多次使用,例如:http://www.pgatour.com/stats/stat.02356.2017.html , http://www.pgatour.com/stats/stat.02356.2016.html .在本例中,stat.02356.html是包含多年玩家数据信息的 url 类别。

import pandas as pd

# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
        'http://www.pgatour.com/stats/stat.02568.html',
        'http://www.pgatour.com/stats/stat.111.html']

# we'll store data from each url category in this dict.
url_data = {}

现在迭代 urls .内urls循环,此代码与我的原始答案完全相同,而原始答案又来自 OP-仅调整了一些变量名称以反射(reflect)我们的新捕获方法。
for url in urls:
    print("url: ", url)
    url_data[url] = {"row_sp": [],
                     "rows_sp": [],
                     "title1": [],
                     "title": []}
    try:
        #with open(i[1], 'w+') as fp:
            #writer = csv.writer(fp)
        for y in years:
            current_url = url[:-4] +y+ end
            print("current url is: ", current_url)
            data = urlopen(current_url)
            soup = BeautifulSoup(data, "html.parser")
            data1 = urlopen(url)
            soup1 = BeautifulSoup(data1, "html.parser")
            for table in soup1.find_all('table',{'id':'statsTable'}):
                url_data[url]["title"].append('year')
                for k in table.find_all('tr'):
                    for n in k.find_all('th'):
                        url_data[url]["title1"].append(n.get_text())
                        for l in url_data[url]["title1"]:
                            if l not in url_data[url]["title"]:
                                url_data[url]["title"].append(l)
                url_data[url]["rows_sp"].append(url_data[url]["title"])
            for table in soup.find_all('table',{'id':'statsTable'}):
                for h in table.find_all('tr'):
                    url_data[url]["row_sp"] = [y]
                    for j in h.find_all('td'):
                        url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                    url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
                    #print(row_sp)
                    #writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

现在为每个键 urlurl_data , rows_sp包含您对该特定 url 类别感兴趣的数据。
请注意 rows_sp现在实际上是 url_data[url]["rows_sp"]当我们迭代 url_data ,但接下来的几个代码块来自我原来的答案,所以使用旧的 rows_sp变量名。
# example rows_sp
[['year',
  'RANK THIS WEEK',
  'RANK LAST WEEK',
  'PLAYER NAME',
  'EVENTS',
  'RATING',
  'year',
  'year',
  'year',
  'year'],
 ['2017'],
 ['2017', '1', '1', 'Sam Burns', '1', '9.2'],
 ['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
 ['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
 ['2017', '2', '3', 'Whee Kim', '2', '8.8'],
 ['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
 ...
]

写作 rows_sp直接到数据框表明数据的格式不完全正确:
pd.DataFrame(rows_sp).head()
      0               1               2               3       4       5     6  \
0  year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING  year   
1  2017            None            None            None    None    None  None   
2  2017               1               1       Sam Burns       1     9.2  None   
3  2017               2               3   Rickie Fowler      10     8.8  None   
4  2017               2               2  Dustin Johnson      10     8.8  None   

      7     8     9  
0  year  year  year  
1  None  None  None  
2  None  None  None  
3  None  None  None  
4  None  None  None  

pd.DataFrame(rows_sp).dtypes
0    object
1    object
2    object
3    object
4    object
5    object
6    object
7    object
8    object
9    object
dtype: object

稍微清理一下,我们可以得到 rows_sp放入具有适当数字数据类型的数据框中:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
              "PLAYER NAME","EVENTS","RATING",
              "year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)

df.head()
   year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING
2  2017               1             1.0       Sam Burns       1     9.2
3  2017               2             3.0   Rickie Fowler      10     8.8
4  2017               2             2.0  Dustin Johnson      10     8.8
5  2017               2             3.0        Whee Kim       2     8.8
6  2017               2             3.0  Thomas Pieters       3     8.8

更新清洁
现在我们有一系列 url 类别要处理,每个类别都有一组不同的字段要清理,上面的部分变得有点复杂。如果您只有几页,则只需直观地查看每个类别的字段并将它们存储起来可能是可行的,如下所示:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
                                      'TOTAL SG:APP', 'MEASURED ROUNDS', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
                                      'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
                          },
        'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                    'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
                                    'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
                         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
                                   '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
                        },
        'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
                                      'PLAYER NAME', 'EVENTS', 'RATING', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'EVENTS', 'RATING']
                          }
       }

然后你可以循环 url_data再次存储在 dfs收藏:
dfs = {}

for url in url_data:
    page = url.split("/")[-1]
    colnames = cols[page]["columns"]
    num_cols = cols[page]["numeric"]
    rows_sp = url_data[url]["rows_sp"]
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
    df.columns = colnames
    df.drop(["year1","year2","year3","year4"], 1, inplace=True)
    df = df.loc[df["PLAYER NAME"].notnull()]
    df = df.loc[df.year != "year"]
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
    df[num_cols] = df[num_cols].apply(pd.to_numeric)
    dfs[url] = df 

至此,我们准备好了 merge year 的所有不同数据类别和 PLAYER NAME . (实际上,您可以在清洁循环中迭代合并,但我在这里分开是为了说明。)
master = pd.DataFrame()
for url in dfs:
    if master.empty:
        master = dfs[url]
    else:
        master = master.merge(dfs[url], on=['year','PLAYER NAME'])

现在master包含每个玩家年份的合并数据。这是数据 View ,使用 groupby() :
master.groupby(["PLAYER NAME", "year"]).first().head(4)
                  RANK THIS WEEK_x  RANK LAST WEEK_x  EVENTS  RATING  \
PLAYER NAME year                                                       
Aam Hawin   2015                66              66.0       7     8.2   
            2016                80              80.0      12     8.1   
            2017                72              45.0       8     8.2   
Aam Scott   2013                45              45.0      10     8.2   

                  RANK THIS WEEK_y  RANK LAST WEEK_y  ROUNDS_x  AVERAGE  \
PLAYER NAME year                                                          
Aam Hawin   2015               136               136        95   -0.183   
            2016               122               122        93   -0.061   
            2017                56                52        84    0.296   
Aam Scott   2013                16                16        61    0.548   

                  TOTAL SG:APP  MEASURED ROUNDS  RANK THIS WEEK  \
PLAYER NAME year                                                  
Aam Hawin   2015       -14.805               81              86   
            2016        -5.285               87              39   
            2017        18.067               61               8   
Aam Scott   2013        24.125               44              57   

                  RANK LAST WEEK  ROUNDS_y      %  # SAVES  # BUNKERS  \
PLAYER NAME year                                                        
Aam Hawin   2015              86        95  50.96       80        157   
            2016              39        93  54.78       86        157   
            2017               6        84  61.90       91        147   
Aam Scott   2013              57        61  53.85       49         91   

                  TOTAL O/U PAR  
PLAYER NAME year                 
Aam Hawin   2015           47.0  
            2016           43.0  
            2017           27.0  
Aam Scott   2013           11.0  

您可能希望对合并的列进行更多清理,因为有些列在数据类别中重复(例如 ROUNDS_xROUNDS_y )。据我所知,重复的字段名称似乎包含完全相同的信息,因此您可以直接删除 _y每个版本。

关于python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45697765/

相关文章:

python - 在 Python 中除以 3

python - 如何在不同的机器上设置 celery worker ?

python - 用于创建 collat​​z 图的长时间运行的 python 脚本

python - 如何在 python 中将行标题附加到数据框。行名称应取自字典

python - 将两个不同 dtype 的 numpy 数组组合成一个结构化数组

python - python如何解释命令行参数?

python - 在 Pandas 数据框中查找 asc/desc 序列

python - 是否可以访问 pandas groupby 转换中的列名?

python - scikits-learn pca 降维问题

python - Tensorflow卷积和numpy卷积的区别