python - 使用 Python 进行维基百科数据抓取

我正在尝试从以下 wikipedia page 中检索 3 列(NFL 球队、球员姓名、大学球队) .我是 python 的新手，一直在尝试使用 beautifulsoup 来完成这项工作。我只需要属于 QB 的列，但尽管有位置，我什至无法获得所有列。这是我到目前为止所拥有的，它没有输出任何内容，我不完全确定为什么。我相信这是由于 a 标签，但我不知道要更改什么。任何帮助将不胜感激。'

wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""

table = soup.find("table", { "class" : "wikitable sortable" })

#print table

#output = open('output.csv','w')

for row in table.findAll("tr"):
    cells = row.findAll("href")
    print "---"
    print cells.text
    print "---"
    #For each "tr", assign each "td" to a variable.
    #if len(cells) > 1:
        #NFL = cells[1].find(text=True)
        #player = cells[2].find(text = True)
        #pos = cells[3].find(text=True)
        #college = cells[4].find(text=True)
        #write_to_file = player + " " + NFL + " " + college + " " + pos
        #print write_to_file

    #output.write(write_to_file)

#output.close()

我知道很多内容都被注释掉了，因为我试图找到故障所在。

最佳答案

这是我会做的:

找到 Player Selections 段落
使用find_next_sibling() 获取下一个wikitable
找到里面所有的tr标签
对于每一行，找到td和th标签，并通过索引获取所需的单元格

代码如下:

filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
    cells = row.find_all(['td', 'th'])

    try:
        nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
    except IndexError:
        continue

    if position != filter_position:
        continue

    print nfl_team, name, position, college

这是输出(仅过滤四分卫):

Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State

关于python - 使用 Python 进行维基百科数据抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27643738/

python - 使用 Python 进行维基百科数据抓取

上一篇：python - setItemWidget 导致崩溃

下一篇：Python文字游戏。第一个单词的最后一个字母 == 第二个单词的第一个字母。找到最长可能的单词序列