javascript - 如何解析 JavaScript 类型 Html

标签 javascript python html regex web-scraping

<script type="text/javascript">
var modelData = [{"Id":958,"Date":"20160428","Title":"Design","Description":"London Auction 28 April 2016","Department":"Design","Location":"LONDON","Permalink":"/auctions/auction/UK050116","Year":"2016","Image":"/Xigen/image.ashx?path=\\\\diskstation\\website\\Certificates\\UK050116\\UK050116.jpg\u0026width=308\u0026height=222","addThis":" addthis:url=\"https://www.example.com/auctions/auction/UK050116\" ","results_html":"\u003cli class=\"expandable past-auction-exp closed\"\u003e\u003ca href=\"#\"\u003eVIEW RESULTS\u003c/a\u003e\u003cdiv class=\"panel\" style=\"display:none\"\u003e\u003ca href=\"/auctions/auction/UK050116\"\u003eOnline\u003c/a\u003e\u003ca target=\"_blank\" href=\"/Xigen/file.ashx?path=\\\\diskstation\\website\\Media\\Auction\\auctionResultsFile_UK050116.pdf\"\u003ePDF\u003c/a\u003e\u003c/div\u003e\u003c/li\u003e","Download_catalog_html":"\u003cli class=\"expandable past-auction-exp closed\"\u003e\u003ca href=\"#\"\u003eCATALOGUES\u003c/a\u003e\u003cdiv class=\"panel\" style=\"display:none\"\u003e\u003ca target=\"_blank\" id=\"linkDownloadCatalog\" href=\"http://www.example.com/Xigen/file.ashx?path=\\\\diskstation\\website\\Certificates/UK050116/UK050116_catalog.pdf\"\u003eDownload Catalogue\u003c/a\u003e\u003ca href=\"/catalogues/buy\"\u003ePurchase Catalogue\u003c/a\u003e\u003c/div\u003e\u003c/li\u003e"}]</script>

我想解析日期、标题、链接,我该如何解析它。我尝试使用 PyQt4 但也无法做到。

最佳答案

假设它位于 script 标记内,您可以使用 BeautifulSoup module解析 HTML 并通过与提取 modelData 值相同的正则表达式来定位 script。然后,after fixing the modelData value to be "loadable" with json.loads() ,您将拥有一个可以轻松使用的 Python 数据结构:

import json
from bs4 import BeautifulSoup

import re

data = """
<script>
var modelData = [{"Id":958,"Date":"20160428","Title":"Design","Description":"London Auction 28 April 2016","Department":"Design","Location":"LONDON","Permalink":"/auctions/auction/UK050116","Year":"2016","Image":"/Xigen/image.ashx?path=\\\\diskstation\\website\\Certificates\\UK050116\\UK050116.jpg\u0026width=308\u0026height=222","addThis":" addthis:url=\"https://www.example.com/auctions/auction/UK050116\" ","results_html":"\u003cli class=\"expandable past-auction-exp closed\"\u003e\u003ca href=\"#\"\u003eVIEW RESULTS\u003c/a\u003e\u003cdiv class=\"panel\" style=\"display:none\"\u003e\u003ca href=\"/auctions/auction/UK050116\"\u003eOnline\u003c/a\u003e\u003ca target=\"_blank\" href=\"/Xigen/file.ashx?path=\\\\diskstation\\website\\Media\\Auction\\auctionResultsFile_UK050116.pdf\"\u003ePDF\u003c/a\u003e\u003c/div\u003e\u003c/li\u003e","Download_catalog_html":"\u003cli class=\"expandable past-auction-exp closed\"\u003e\u003ca href=\"#\"\u003eCATALOGUES\u003c/a\u003e\u003cdiv class=\"panel\" style=\"display:none\"\u003e\u003ca target=\"_blank\" id=\"linkDownloadCatalog\" href=\"http://www.example.com/Xigen/file.ashx?path=\\\\diskstation\\website\\Certificates/UK050116/UK050116_catalog.pdf\"\u003eDownload Catalogue\u003c/a\u003e\u003ca href=\"/catalogues/buy\"\u003ePurchase Catalogue\u003c/a\u003e\u003c/div\u003e\u003c/li\u003e"}]
</script>
"""

soup = BeautifulSoup(data, 'lxml')

pattern = re.compile(r"var modelData = (\[.*?\])", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

s = pattern.search(script.text).group(1).encode('unicode_escape')
while True:
    try:
        result = json.loads(s)   # try to parse...
        break                    # parsing worked -> exit loop
    except Exception as e:
        # "Expecting , delimiter: line 34 column 54 (char 1158)"
        # position of unexpected character after '"'
        unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
        # position of unescaped '"' before that
        unesc = s.rfind(r'"', 0, unexp)
        s = s[:unesc] + r'\"' + s[unesc+1:]
        # position of correspondig closing '"' (+2 for inserted '\')
        closg = s.find(r'"', unesc + 2)
        s = s[:closg] + r'\"' + s[closg+1:]

item = result[0]
print(item["Id"])
print(item["Title"])

打印(仅在这种状态下适用于 Python 2):

958
Design

关于javascript - 如何解析 JavaScript 类型 Html,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36962749/

相关文章:

python - np.删除矩阵中包含最大绝对值的所有行和列

html - CSS - 调整窗口大小时如何编码左列变窄?

html - 如何更改事件导航项的背景颜色

javascript - 在表格单元格上设置 onmouseover

javascript - 如何使用 jQuery 获取成功的 ajax 响应并将其分配到变量中?

python - 有没有办法以编程方式确认 python 包版本满足需求说明符?

python - 提取每个音频文件的频谱图

javascript - 使传单工具提示可点击

javascript - 动态表单附加字段

javascript - 在 Visio Web Access ASPX 中自动缩放以适合