python - 使用 bs4 和 Python 从 html 表格中提取数据

我以前使用过美丽汤，但这次我需要帮助。我无法提取 html 的某些部分，我总是返回“None”

我想从以下页面抓取详细信息:https://www.coingecko.com/de?page=1

所有可见内容都不是问题，但如果您将鼠标悬停在某些数字上，它会为您提供更详细的信息，(我想要:-))

这就是我整理的内容

from bs4 import BeautifulSoup as soup
import request


response = requests.get(url)                       #request the html 
webpage = soup(fp, "html.parser")                  #parse the html


#extract the 'main Block' with information
developer = (coin.findAll("div", {"class" : "percent"}))

变量“developer”现在是一个列表。我计划用 for 循环遍历这个列表。现在元素的内容如下所示:

<div class="percent" data-toggle="tooltip" data-placement="right" data-html="true" title="" data-original-title="<div style=&quot;text-align: left;  font-size: 12px;&quot;>
    <table>
      <tbody>
        <tr>
          <td>Abonnenten <i class=&quot;fa fa-reddit&quot;></i></td>
          <td style=&quot;text-align: right&quot;>629925</td>                #I want this number

          ...

现在我无法提取数字 629925。通常我只会写 .text，但这在这里不起作用，因为它不是文本。

然后我尝试使用以下(以及许多变体)，它也只返回 [ ]

print(developer[0].findAll("td"))

谁能帮我解释一下如何提取它？

我还快速浏览了 lxml，但我以前从未使用过它，也无法让它工作。

非常感谢任何帮助

最佳答案

所需的 629925 数字是元素属性值内 HTML 代码的一部分。因此，您需要重新解析工具提示中的 HTML:

from bs4 import BeautifulSoup
import requests


url = "https://www.coingecko.com/de?page=1"
response = requests.get(url)                       #request the html
soup = BeautifulSoup(response.content, "html.parser")                  #parse the html

percent_html_data = soup.select_one("td.community .percent")['title']
percent_soup = BeautifulSoup(percent_html_data, "html.parser")
data = {
    row.td.get_text(strip=True): row("td")[1].get_text()
    for row in percent_soup.find_all("tr")
}
print(data)

打印:

{u'Facebook Likes': u'36370', u'Abonnenten': u'629925', u'Twitter Follower': u'627476'}

您可以进一步扩展表中所有行的解决方案:

for row in soup.select("#gecko-table tr")[1:]:
    coin_name = row.select_one(".coin-content-name").get_text()
    percent_html_data = row.select_one("td.community .percent")['title']

    percent_soup = BeautifulSoup(percent_html_data, "html.parser")
    data = {
        row.td.get_text(strip=True): row("td")[1].get_text()
        for row in percent_soup.find_all("tr")
    }
    print(coin_name, data["Abonnenten"])

打印:

(u'Bitcoin', u'629925')
(u'Ripple', u'135263')
(u'Ethereum', u'247487')
...
(u'BlackCoin', u'7659')
(u'Shift', u'1028')

关于python - 使用 bs4 和 Python 从 html 表格中提取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48133252/

python - 使用 bs4 和 Python 从 html 表格中提取数据

上一篇：python - 通过 numpy FFT 进行数值微分

下一篇：python - 如何仅在python selenium中获取第一层子元素