我遵循了许多有关 Javascript Scraping 的教程,但我无法真正设法从该表中取出数字:
http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html
我最后尝试使用以下代码使用 Sentdex 教程:
import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html')
soup = bs.BeautifulSoup(page.html, 'html.parser')
tableSup = soup.find_all("td",{"class": "col2 yellowBack"})
print(tableSup)
if __name__ == '__main__': main()
看起来我超出了目标...每个人总是谈论与出现在网页源中但随后消失在 BeautifulSoup 标签文本中的那些文本相关的脚本...但我真的找不到与上面页面主表中的值关联的脚本..?
关于我应该在哪里进行研究有什么建议吗?
最佳答案
请注意,您要抓取的表格位于 iframe
内,您应该对此 iframe
发出请求,然后继续抓取表格。 iframe
url 是通过对元素的简单检查发现的。使用 requests
的示例代码如下所示:
from bs4 import BeautifulSoup
import requests
iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd"
html = requests.get(iframe).text
soup = BeautifulSoup(html,'html.parser')
column = soup.findAll("td",{"class": "col2 yellowBack"})
values = [row.string for row in column]
您似乎对该列中的值感兴趣,因此 values
是所需的输出:
>>> values
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39']
关于javascript - 动态文本抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45472524/