python - 抓取嵌套标签

我知道此类问题经常出现，但我一直在浏览并没有看到类似的问题。

<div class="casts">
    <table cellpadding="0" cellspacing="0">
        <tbody>
            <tr>
                <td class="">
                    <a class="cast">
                        <span class="title">
                            Nested data 1 
                            <span class="schedule">
                                Nested data 2
                            </span>
                        </span>
                    </a>
                </td>
            </tr>
        </tbody>
    </table>
</div>

有多个具有相同结构的 td，但是为了简单起见，我删除了其余的。假设我想从我使用的跨度中提取数据嵌套数据 1 和 嵌套数据 2:

finda = soup.find_all('a', attrs={'class':'cast'})

for var in finda:
  var2 = var.find_all('span')

使用:

var2[1]

我能够提取所有嵌套数据2

但我无法仅提取嵌套数据1

var2[0]

将返回嵌套数据2嵌套数据1

最佳答案

这可以通过迭代每个跨度的子代以或多或少简单的方式来完成:

stack.html:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>StackO</title>
  <meta charset="utf-8">
</head>
<body>
  <div class="casts">
    <table cellpadding="0" cellspacing="0">
      <tbody>
        <tr>
          <td class="">
            <a class="cast">
              <span class="title">
                Nested data 1 
                <span class="schedule">
                  Nested data 2
                  <span class="moar-nesting">
                    Nested data 3
                  </span>
                </span>
                Nested data 4
              </span>
            </a>
          </td>
        </tr>
      </tbody>
    </table>
  </div>
</body>
</html>

同时，在 ipython 土地上......

In [1]: from bs4 import BeautifulSoup, NavigableString, Comment

In [2]: with open('stack.html', 'r') as f:
   ...:     markup = f.read()
   ...:

In [3]: soup = BeautifulSoup(markup)

In [4]: casts = soup.find_all('a', attrs={'class': 'cast'})

In [5]: cast = casts[0]

In [6]: for span in cast.find_all('span'):
   ...:     for child in span.children:
   ...:         if isinstance(child, NavigableString) and not isinstance(child, Comment) and str(child).strip() != "":
   ...:             print '"{}"'.format(str(child).strip())
   ...:
"Nested data 1"
"Nested data 4"
"Nested data 2"
"Nested data 3"

In [10]:

关于python - 抓取嵌套标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30084673/

python - 抓取嵌套标签

上一篇：python - 将 os.path.join 与 os.path.getsize 一起使用，返回 FileNotFoundError

下一篇：python - 如何使用请求指定表单字段和文件类型