python web scrape 具有递增的 id

尝试抓取列出每个学生 (tr) 的数据(表)及其数据 (td) 的页面。

tr 中列出的每个学生都有自己唯一的 ID 标签，每个学生的 ID 标签加 1。

示例:1234-1、1234-2、1234-3 等。

我尝试通过将计数变量增加 1 来添加 id。此外，输出仅提供第一个 td，而不是所有 td。

我对 python 和网络抓取都很陌生，不知道为什么这不起作用。任何帮助将不胜感激

import csv
import requests
from bs4 import BeautifulSoup

url = '' # Has been left blank for a reason
response = requests.get(url)
html = response.content

count = 1

print ('-' * 30)

soup = BeautifulSoup(html, "html.parser")
table = soup.find('tr', attrs={'id': '1234-' + str(count)})

list_of_cells = []

while True:
    for cell in table.findAll('td'):
        text = cell.text.replace('\xa0', '')
        list_of_cells.append(text)
    list_of_cells.append(list_of_cells)

    student_name = list_of_cells[0]
    agent_id = list_of_cells[3].replace('-', '')

    total_hrs = list_of_cells[14]
    total_inc = list_of_cells[15]

    count += 1

    print (student_name, "| ", total_hrs, " ", total_inc)
else:
    print('Done')

表中 tr 的示例..

<tr height="17" id="1234-1" style="height:12.75pt;display:none">
  <td class="xl243045" height="17" style="height:12.75pt;border-top:none">
    <a href="48701">Student Name</a>
  </td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
</tr>

最佳答案

美丽汤让您可以通过正则表达式进行选择。所以你可以这样做:

 import re

 # if you copy and paste this be wary of the "-" it doesn't appear to be a standard "-" on a US keyboard.  Make it match whatever is in the html
 students = soup.find_all("tr",id=re.compile(r'\d{4}-\d+'))
 for student in students:
    cells = student.find_all("td")
    student_name = cells[0].find('a').text
    total_hrs = cells[14].text
    print("{0}|{1}".format(student_name, total_hrs))

但我猜你的 table 上可能只是排满了学生。如果是这样，那么这可能更有意义并且更容易遵循:

#access the actual table holding the rows not the row itself -- notice the parent
table = soup.find('tr', attrs={'id': '1234-1'}).parent

# iterate over each of the rows (students)
for row in table.find_all("tr"):
    cells = row.find_all("td")
    student_name = cells[0].find('a').text
    total_hrs = cells[14].text
    print("{0}|{1}".format(student_name, total_hrs))

顺便说一句，依赖表中的学生 ID 可能不是最好的主意。学生通常会发生变化。找到一些可以识别学生所在 table 的东西，而不是依赖于 table 中的特定学生 ID，这可能是一个更好的主意。

关于python web scrape 具有递增的 id，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44380256/

python web scrape 具有递增的 id

上一篇：python - Google Sheets 通过 API 交替颜色

下一篇：Python - 将残差添加到 for 循环生成的子图中