python - 使用 BeautifulSoup 解析 html 表格

我正在尝试从此时间表中获取给定日期的数据:click here

我已经能够使用 Beautiful Soup 使用以下代码将任何给定日期(在本例中为星期一或“星期一”)的整行添加到列表中:

from BeautifulSoup import BeautifulSoup

day ='Mon'

with open('timetable.txt', 'rt') as input_file:
  html = input_file.read()
  soup = BeautifulSoup(html)
  #finds correct day tag
  starttag = soup.find(text=day).parent.parent
  print starttag
  nexttag = starttag
  row=[]
  x = 0
  #puts all td tags for that day in a list
  while x < 18:
    nexttag = nexttag.nextSibling.nextSibling
    row.append(nexttag)
    x += 1
print row

如您所见，该命令返回一个 TD 标记列表，这些标记构成时间表中的“mon”行。

我的问题是，我不知道如何进一步解析或搜索返回的列表以找到相关信息(COMP1740 等)。

如果我能找出如何在列表中的每个元素中搜索模块代码，我就可以将它们与另一个时间列表连接起来，给出一天的时间表。

欢迎大家帮忙! (包括完全不同的方法)

最佳答案

您可以使用正则表达式(即模式匹配)查找类(class)编号等信息。

我不知道您使用它们的经验，但是 Python 包含一个“re”模块。查看“四个字母 C-O-M-P 后跟一位或多位数字”的模式。给出 COMP\d+ 的正则表达式，其中 \d 是一个数字，下面的 + 表示要尽可能多地查找(在本例中, 4).

from BeautifulSoup import BeautifulSoup
import re

day ='Mon'
codePat = re.compile(r'COMP\d+')

with open('timetable.txt', 'rt') as input_file:
  html = input_file.read()
  soup = BeautifulSoup(html)
  #finds correct day tag
  starttag = soup.find(text=day).parent.parent
#  print starttag
  nexttag = starttag
  row=[]
  x = 0
  #puts all td tags for that day in a list
  while x < 18:
    nexttag = nexttag.nextSibling.nextSibling
    found = codePat.search(repr(nexttag))
    if found:
      row.append(found.group(0))
    x += 1
print row

这给了我输出，

['COMP1940', 'COMP1550', 'COMP1740']

就像我说的，我不知道你对正则表达式的了解在哪里，所以如果你能描述模式，我可以尝试编写它们。 Here's a good resource如果您决定自己做。

关于python - 使用 BeautifulSoup 解析 html 表格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8370389/

python - 使用 BeautifulSoup 解析 html 表格

上一篇：python - python 中混合层次结构级别的 shutil.rmtree

下一篇：更改文件夹名称的 Python 脚本