parsing - BeautifulSoup 索引

所以我试图解析 IMDB 页面 http://www.imdb.com/genre/?ref_=nv_ch_gr_3 中流派和子流派的链接

现在已经能够将主要类型标签解析为可用的内容使用以下代码

table = soup.find_all("table", {"class": "genre-table"})

for item in table:
    for x in range(100):

        try:
            print(item.contents[x].find_all("h3"))
            print(len(item.contents[x].find_all("h3")))
        except:
            pass

我的输出是 11 组列表，其中有两个标签，如下所示

[<h3><a href="http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp">Action <span class="normal">»</span></a></h3>, <h3><a href="http://www.imdb.com/genre/adventure/?ref_=gnr_mn_ad_mp">Adventure <span class="normal">»</span></a></h3>]
2

我理解这一点是因为容器有一个“偶数”和“奇数”类，每个容器中有两个 h3 标签，但我没有指定它来区分偶数和奇数，实际上我想我在这里回答我自己的问题，我是否正确地认为，因为它位于奇数或偶数容器类中，所以 bs4 将其放入列表中只是为了显示它，并且由我来将它们分开？

第二个更重要的问题:

如何将每个 h3 链接和标题放入我设置的数据框中

df = pd.DataFrame(columns= ['Genre', 'Sub-Genre', 'Link'])

我已经尝试过

对于范围(2)内的 y:

df.append({'Genre':'item.contents[x].find_all("h3"))[y].text)},     ignore_index = true)

当然，这与 x 一起嵌套在 for 循环中(不是单独的)
但似乎不起作用有什么想法吗？因果报应你的方式!

最佳答案

首先，不需要查找所有表，因为只需要第一个表:

table = soup.find("table", {'class': 'genre-table'})

由于其他所有项目都是多余的(从第一个开始)，您可以像这样迭代该表:

for item in list(table)[1::2]:

在此之后，我们可以获得每一行中的“h3”标签并循环遍历它们:

    row = item.find_all("h3")

    for col in row:

因为每个“h3”元素中的文本都会返回以下格式的流派:“Somegenre\xc2\xbb”，我在获取文本之前删除了 span 元素:

        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

之后只需按索引将元素添加到数据框中:

        df.loc[len(df)]=[genre, None, link]

完整代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

df = pd.DataFrame(columns=['Genre', 'Sub-Genre', 'Link'])

req = requests.get('http://www.imdb.com/genre/?ref_=nv_ch_gr_3')
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find("table", {'class': 'genre-table'})

for item in list(table)[1::2]:
    row = item.find_all("h3")

    for col in row:
        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

        df.loc[len(df)] = [genre, None, link]

关于parsing - BeautifulSoup 索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37676044/

parsing - BeautifulSoup 索引

上一篇：indexing - Solr:同义词文件不应超过多大的大小？

下一篇：ionic2 - 如何在生产服务器上使用 ionic 2 应用程序作为网站？