python - XPath - 提取具有不规则模式的表数据

标签 python xpath

扩展现有的问题和答案 here ,我正在尝试提取球员姓名和他的位置。输出如下:

playername, position
EJ Manuel, Quarterbacks
Tyrod Taylor, Quarterbacks
Anthony Dixon, Running backs
...

这是我到目前为止所做的:

tree = html.fromstring(requests.get("https://en.wikipedia.org/wiki/List_of_current_AFC_team_rosters").text)

for h3 in tree.xpath("//table[@class='toccolours']//tr[2]"):
    position = h3.xpath(".//b/text()")
    players = h3.xpath(".//ul/li/a/text()")
    print(position, players)

上面的代码可以提供以下内容,但不是我需要的格式。

(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['EJ Manuel', 'Tyrod Taylor', 'Anthony Dixon', 'Jerome Felton', 'Mike Gillislee', 'LeSean McCoy', 'Karlos Williams', 'Leonard Hankerson', 'Marcus Easley', 'Marquise Goodwin', 'Percy Harvin', 'Dez Lewis', 'Walt Powell', 'Greg Salas', 'Sammy Watkins', 'Robert Woods', 'Charles Clay', 'Chris Gragg', "Nick O'Leary", 'Tyson Chandler', 'Ryan Groy', 'Seantrel Henderson', 'Cyrus Kouandjio', 'John Miller', 'Kraig Urbik', 'Eric Wood', 'T. J. Barnes', 'Marcell Dareus', 'Lavar Edwards', 'IK Enemkpali', 'Jerry Hughes', 'Kyle Williams', 'Mario Williams', 'Jerel Worthy', 'Jarius Wynn', 'Preston Brown', 'Randell Johnson', 'Manny Lawson', 'Kevin Reddick', 'Tony Steward', 'A. J. Tarpley', 'Max Valles', 'Mario Butler', 'Ronald Darby', 'Stephon Gilmore', 'Corey Graham', 'Leodis McKelvin', 'Jonathan Meeks', 'Merrill Noel', 'Nickell Robey', 'Sammy Seamster', 'Cam Thomas', 'Aaron Williams', 'Duke Williams', 'Dan Carpenter', 'Jordan Gay', 'Garrison Sanborn', 'Colton Schmidt', 'Blake Annen', 'Jarrett Boykin', 'Jonathan Dowling', 'Greg Little', 'Jacob Maxwell', 'Ronald Patrick', 'Cedric Reed', 'Cyril Richardson', 'Phillip Thomas', 'James Wilder, Jr.', 'Nigel Bradham', 'Ron Brooks', 'Alex Carrington', 'Cordy Glenn', 'Leonard Hankerson', 'Richie Incognito', 'Josh Johnson', 'Corbin Bryant', 'Stefan Charles', 'MarQueis Gray', 'Chris Hogan', 'Jordan Mills', 'Ty Powell', 'Bacarri Rambo', 'Cierre Wood'])
(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['Zac Dysert', 'Ryan Tannehill', 'Logan Thomas', 'Jay Ajayi', 'Jahwan Edwards', 'Damien Williams', 'Tyler Davis', 'Robert Herron', 'Greg Jennings', 'Jarvis Landry', 'DeVante Parker', 'Kenny Stills', 'Jordan Cameron', 'Dominique Jones', 'Dion Sims', 'Branden Albert', 'Jamil Douglas', "Ja'Wuan James", 'Vinston Painter', 'Mike Pouncey', 'Anthony Steen', 'Dallas Thomas', 'Billy Turner', 'Deandre Coleman', 'Quinton Coples', 'Terrence Fede', 'Dion Jordan', 'Earl Mitchell', 'Damontre Moore', 'Jordan Phillips', 'Ndamukong Suh', 'Charles Tuaau', 'Robert Thomas', 'Cameron Wake', 'Julius Warmsley', 'Jordan Williams', 'Neville Hewitt', 'Mike Hull', 'Jelani Jenkins', 'Terrell Manning', 'Chris McCain', 'Koa Misi', 'Zach Vigil', 'Walt Aikens', 'Damarr Aultman', 'Brent Grimes', 'Reshad Jones', 'Tony Lippett', 'Bobby McCain', 'Brice McCain', 'Tyler Patmon', 'Dax Swanson', 'Jamar Taylor', 'Matt Darr', 'John Denney', 'Andrew Franks', 'Louis Delmas', 'James-Michael Johnson', 'Rishard Matthews', 'Jacques McClendon', 'Lamar Miller', 'Matt Moore', 'Spencer Paysinger', 'Derrick Shelby', 'Kelvin Sheppard', 'Shelley Smith', 'Olivier Vernon', 'Michael Thomas', 'Brandon Williams', 'Shamiel Gary', 'Matt Hazel', 'Ulrick John', 'Jake Stoneburner'])
... 

有什么建议吗?

enter image description here

最佳答案

您可以使用嵌套循环来完成此任务。首先循环遍历位置,然后对于每个位置,循环遍历相应的球员:

#loop through positions
for b in tree.xpath("//table[@class='toccolours']//tr[2]//b"):
    #get current position text
    position = b.xpath("text()")[0]
    #get players that correspond to the current position
    for a in b.xpath("following::ul[1]/li/a[not(*)]"):
        #get current player text
        player = a.xpath("text()")[0]
        #print current position and player together
        print(position, player)

输出的最后部分:

.....
('Reserve lists', 'Chris Watt')
('Reserve lists', 'Eric Weddle')
('Reserve lists', 'Tourek Williams')
('Practice squad', 'Alex Bayer')
('Practice squad', 'Isaiah Burse')
('Practice squad', 'Richard Crawford')
('Practice squad', 'Ben Gardner')
('Practice squad', 'Michael Huey')
('Practice squad', 'Keith Lewis')
('Practice squad', 'Chuka Ndulue')
('Practice squad', 'Tim Semisch')
('Practice squad', 'Brad Sorensen')
('Practice squad', 'Craig Watts')

关于python - XPath - 提取具有不规则模式的表数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34951156/

相关文章:

python - 如何猴子补丁python列表__setitem__方法

python - 如何从csv文件中随机抽样

python - 返回列表中第二大的

xpath - 如何在Google表格中找出适合importxml的xpath?

Python3.2 : Installing MySQL-python fails with error "No module named ConfigParser"

python - SQLAlchemy,同一张表上的一对一关系

java - 通过 Java 使用 Selenium 在 VS Code 中工作时,使用 sendKeys 方法的 ByJava(67108964) 类型未定义方法 sendKeys(String)

php - php xpath 中的句子中间通配符

javascript - 简单的 JavaScript XPath 语法模式验证器(无上下文)?

bash - 以非递归方式列出可用标签的 XPath(在 shell 脚本中使用 xmllint)