python - 使用 BeautifulSoup 循环遍历表行

标签 python loops for-loop beautifulsoup tablerow

我需要帮助循环遍历表行并将它们放入列表中。在这个网站上,有三个表,每个表都有不同的统计数据 - http://www.fangraphs.com/statsplits.aspx?playerid=15640&position=OF&season=0&split=0.4

例如,这三个表包含 2016 年、2017 年的行和总计行。我想要以下内容:

以下列表 -->表 1 - 第 1 行,表 2 - 第 1 行,表 3 - 第 1 行 第二个列表如下 --> 表 1 - 第 2 行、表 2 - 第 2 行、表 3 - 第 2 行 第三个列表:-->表 1 - 第 3 行,表 2 - 第 3 行,表 3 - 第 3 行

我知道我显然需要创建列表,并且需要使用追加功能;但是,我不确定如何让它循环遍历每个表的第一行,然后遍历每个表的第二行,依此类推遍历表的每一行(每个实例中的行数会有所不同 - 这个恰好有 3)。

非常感谢任何帮助。代码如下:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
for id in idList2:
    pos = 'OF'
    for split in splitList:
        url = 'http://www.fangraphs.com/statsplits.aspx?playerid=' + 
            str(id) + '&position=' + str(pos) + '&season=0&split=' + 
            str(split) + ''
        r = requests.get(url)

        for season in range(1,4):
            print(season)
            soup = BeautifulSoup(r.text, "html.parser")
            tableStats = soup.find("table", {"id" :  "SeasonSplits1_dgSeason" + str(season) + "_ctl00"})
            soup.findAll('th')
            column_headers = [th.getText() for th in soup.findAll('th')]                      
            statistics = soup.find("table", {"id" :                     
'"SeasonSplits1_dgSeason" + str(season) + "_ctl00"})'
            tabledata = [td.getText() for td in statistics('td')]                         
            print(tabledata)

最佳答案

这将是我最后一次尝试。它拥有您需要的一切。我创建了对表、行和列被抓取的位置的回溯。这一切都发生在函数 extract_table() 中。遵循回溯标记,不用担心任何其他代码。不要让大文件大小让您担心,它主要是文档和间距。

回溯标记:### ... ###

从第 95 行开始,带有回溯标记 ### START HERE ###

from bs4 import BeautifulSoup as Soup
import requests
import urllib


###### GO TO LINE 95 ######


### IGNORE ###
def generate_urls (idList, splitList):
    """ Using and id list and a split list generate a list urls"""
    urls = []
    url = 'http://www.fangraphs.com/statsplits.aspx'

    for id in idList:
        for split in splitList:
            # The parameters used in creating the url
            url_payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
            # Create the url and store add to the collection of urls
            urls += ['?'.join([url, urllib.urlencode(url_payload)])]
    return urls # Return the list of urls




### IGNORE ###
def extract_player_name (soup):
    """ Extract the player name from the browser title """
    # Browser title contains player name, strip all but name
    player_name = repr(soup.title.text.strip('\r\n\t')) 
    player_name = player_name.split(' \\xbb')[0] # Split on ` »`
    player_name = player_name[2:] # Erase a leading characters from using `repr`
    return player_name



########## FINISH HERE ##########
def extract_table (table_id, soup):
    """ Extract data from a table, return the column headers and the table rows"""

    ### IMPORTANT: THIS CODE IS WHERE ALL THE MAGIC HAPPENS ### 
    # - First: Find lowest level tag of all the data we want (container).
    #
    # - Second: Extract the table column headers, requires minimal mining
    #
    # - Third: Gather a list of tags that represent the tables rows
    #
    # - Fourth: Loop through the list of rows 
    #      A): Mine all columns in the row

    ### IMPORTANT: Get A Reference To The Table ###
    # SCRAPE 1:
    table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})            

    # SCRAPE 2: 
    columns = [th.text for th in table_tag.findAll('th')]

    # SCRAPE 3: 
    rows_tags = table_tag.tbody.findAll('tr'); # All 'tr' tags in the table `tbody` tag are row tags

    ### IMPORTANT: Cycle Through Rows And Collect Column Data ###
    # SCRAPE 4:
    rows = [] # List of all table rows
    for row_tag in rows_tags:

        ### IMPORTANT: Mine All Columns In This Row || LOWEST LEVEL IN THE MINING OPERATION. ###
        # SCRAPE 4.A
        row = [col.text for col in row_tag.findAll('td')] # `td` represents a column in a row.

        rows.append (row) # Add this row to all the other rows of this table  

    # RETURN: The column header and the rows of this table
    return [columns, rows]



### Look Deeper ###
def extract_player (soup):
    """ Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
    player = [] # A list store data in

    # player name is first in player list
    player.append (extract_player_name (soup))

    # Each table is a list entry
    for season in range(1,4): 
        ### IMPORTANT: No Table Related Data Has Been Mined Yet. START HERE ###
        ###     - Line: 37
        table = extract_table (season, soup) # `season` represents the table id 
        player.append(table) # Add this table(list to the player data list

    # Return the player list    
    return player


##################################################
################## START HERE ####################
##################################################
###
### OBJECTIVE: 
###
### - Follow the trail of important lines that extract the data
###     - Important lines will be marked as the following `### ... ###`
### 
### All this code really needs is a url and the `extract_table()` function.
###
### The `main()` function is where the journey starts
###
##################################################
##################################################



def main ():
    """ The main function is the core program code. """

    # Luckily the pages we will scrape all have the same layout making mining easier.    

    all_players = [] # A place to store all the data

    # Values used to alter the url when making requests to access more player statistics
    idList2 = ['15640', '9256']
    splitList=[0.4,0.2,0.3,0.4]

    # Instead of looping through variables that dont tell a story,
    # lets create a list of urls generated from those variables.
    # This way the code is self-explanatory and is human-readable.
    urls = generate_urls(idList2, splitList) # The creation of the url is not important right now

    # Lets scrape each url
    for url in urls:
        print url

        # First Step: get a web page via http request.
        response = requests.get (url)

        # Second step: use a parsing library to create a parsable object 
        soup = Soup(response.text, "html.parser") # Create a soup object (Once)

        ### IMPORTANT: Parsing Starts and Ends Here ###
        ###     - Line: 75
        # Final Step: Given a soup object, mine player data
        player = extract_player (soup)

        # Add the new entry to the list
        all_players += [player]

    return all_players





# If this script is being run, not imported, run the `main()` function.
if __name__ == '__main__':
    all_players = main ()

    print all_players[0][0] # Player List -> Name
    print all_players[0][1] # Player List -> Table 1
    print all_players[0][2] # Player List -> Table 2
    print all_players[0][3] # Player List -> Table 3

    print all_players[0][3][0]       # Player List -> Table 1 -> Columns
    print all_players[0][3][1]       # Player List -> Table 1 -> All Rows
    print all_players[0][3][1][0]    # Player List -> Table 1 -> All Rows -> Row 1
    print all_players[0][3][1][2]    # Player List -> Table 1 -> All Rows -> Row 2
    print all_players[0][3][1][2][0] # Player List -> Table 1 -> All Rows -> Row 2 -> Colum 1

关于python - 使用 BeautifulSoup 循环遍历表行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44686892/

相关文章:

python - 根据条件使mongoDB中的索引过期

loops - 是否可以在每次迭代时步进不同的量而无需创建特殊的迭代器?

loops - ABAP 中的 SELECT FOR ALL ENTRIES 与 LOOP SELECT SINGLE

python - 有没有办法在 Python (3) for 循环中做条件语句?

java - 如何在数组中获取不同的随机对象?

python - 无法通过按键盘上的 Q 键来结束游戏

python - 在 Python 中迭代时如何在字典中选择键

python - 使用 virtualenv 安装 Matplotlib 错误

c - 指数函数图

Java程序使用简单循环查找整数数组的重复值