python - 在 Python 中使用 BeautifulSoup 从脚本标签中提取数据

标签 python beautifulsoup deezer

我想在 Python 中使用 BeautifulSoup 从“script”标签中的代码中提取“SNG_TITLE”和“ART_NAME”值。 (整个脚本太长就不贴了)

<script>window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276641","UPLOAD_ID":0,"SNG_TITLE":"Heathens","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots","ART_PICTURE":"259dcf52853363d79753ec301377645d","SMARTRADIO":"1","RANK":"487762","LOCALES":[],"__TYPE__":"artist"}],"ALB_ID":"13371165","ALB_TITLE":"Heathens","TYPE":0,"MD5_ORIGIN":"5cea723b83af1ff0a62d65d334b978d4","VIDEO":false,"DURATION":"195","ALB_PICTURE":"3dfc8c9e406cf1bba8ce0695a44a9b7e","ART_PICTURE":"259dcf52853363d79753ec301377645d","RANK_SNG":"967143","SMARTRADIO":"1","FILESIZE_AAC_64":0,"FILESIZE_MP3_64":"0","FILESIZE_MP3_128":"3135946","FILESIZE_MP3_256":0,"FILESIZE_MP3_320":"7839868","FILESIZE_FLAC":"21777150","FILESIZE":"3135946","GAIN":"-12","MEDIA_VERSION":"4","DISK_NUMBER":"1","TRACK_NUMBER":"1","VERSION":"","EXPLICIT_LYRICS":"0","RIGHTS":{"STREAM_ADS_AVAILABLE":true,"STREAM_ADS":"2000-01-01","STREAM_SUB_AVAILABLE":true,"STREAM_SUB":"2000-01-01"},"ISRC":"USAT21601930","DATE_ADD":1497886149,"HIERARCHICAL_TITLE":"","SNG_CONTRIBUTORS":{"mainartist":["Twenty One Pilots"],"engineer":["Adam Hawkins"],"mixer":["Adam Hawkins"],"masterer":["Chris Gehringer"],"drums":["Josh Dun"],"producer":["Mike Elizondo","Tyler Joseph"],"programmer":["Mike Elizondo","Tyler Joseph"],"vocals":["Tyler Joseph"],"writer":["Tyler Joseph"]},"LYRICS_ID":30553991,"__TYPE__":"song"},{"SNG_ID":"99976952","PRODUCT_TRACK_ID":"171067651","UPLOAD_ID":0,"SNG_TITLE":"Stressed Out","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots", ...</script>

代码的想法是打印出用户名、所有可以在给定页面上找到的歌曲和艺术家的名字。

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

user_name = soup.find(class_='user-name')
print(user_name.text)

这会打印用户名。

for script in soup.find_all('script'):
    print(script.contents) 

如果我没有理解错的话,我需要的脚本是一个字典,所以我只需要找到它并获取它的内容即可。问题是我不知道如何具体找到正是这个“脚本”。它没有任何属性或任何使其独一无二的东西。所以我尝试了一个循环来查找页面上的所有脚本并打印出它们的内容,但不确定如何进一步进行。

如何在页面上只找到这个特定的“脚本”?我可以用不同的方式访问这些值吗?

最佳答案

脚本不会更改代码中的位置,因此您可以对它们进行计数并使用索引来获取正确的脚本。

all_scripts[6]

脚本是普通字符串,因此您也可以使用标准字符串函数。

if '{"loved"' in script.text:

使用这两种方法编写代码 - 我使用 [:100] 仅显示字符串的一部分。

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

print('--- first method ---')
print(all_scripts[6].text[:100])

print('--- second method ---')
for number, script in enumerate(all_scripts):
    if '{"loved"' in script.text:
        print(number, script.text[:100])

结果:

--- first method ---
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
--- second method ---
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276

编辑: 当您有正确的脚本时,您可以使用切片仅获取 JSON 字符串并使用模块 json 将其转换为 python字典然后tou就可以得到数据

import requests
from bs4 import BeautifulSoup
import json

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

data = json.loads(all_scripts[6].get_text()[27:])

print('key:', data.keys())
print('key:', data['TAB'].keys())
print('key:', data['DATA'].keys())
print('---')

for item in data['TAB']['loved']['data']:
    print('ART_NAME:', item['ART_NAME'])
    print('SNG_TITLE:', item['SNG_TITLE'])
    print('---')

结果:

key: dict_keys(['TAB', 'DATA'])
key: dict_keys(['loved'])
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Heathens
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Stressed Out
---
ART_NAME: Linkin Park
SNG_TITLE: Numb
---
ART_NAME: Three Days Grace
SNG_TITLE: Animal I Have Become
---
ART_NAME: Three Days Grace
SNG_TITLE: Painkiller
---
ART_NAME: Slipknot
SNG_TITLE: Before I Forget
---
ART_NAME: Slipknot
SNG_TITLE: Duality
---
ART_NAME: Skrillex
SNG_TITLE: Make It Bun Dem
---
ART_NAME: Skrillex
SNG_TITLE: Bangarang (feat. Sirah)
---
ART_NAME: Limp Bizkit
SNG_TITLE: Break Stuff
---
ART_NAME: Three Days Grace
SNG_TITLE: I Hate Everything About You
---
ART_NAME: Three Days Grace
SNG_TITLE: Time of Dying
---
ART_NAME: Three Days Grace
SNG_TITLE: I Am Machine
---
ART_NAME: Three Days Grace
SNG_TITLE: Riot
---
ART_NAME: Three Days Grace
SNG_TITLE: So What
---
ART_NAME: Three Days Grace
SNG_TITLE: Pain
---
ART_NAME: Three Days Grace
SNG_TITLE: Tell Me Why
---
ART_NAME: Three Days Grace
SNG_TITLE: Chalk Outline
---
ART_NAME: Three Days Grace
SNG_TITLE: Gone Forever
---
ART_NAME: Slipknot
SNG_TITLE: The Devil In I
---
ART_NAME: Linkin Park
SNG_TITLE: No More Sorrow
---
ART_NAME: Linkin Park
SNG_TITLE: Bleed It Out
---
ART_NAME: The Doors
SNG_TITLE: Roadhouse Blues
---
ART_NAME: The Doors
SNG_TITLE: Riders On The Storm
---
ART_NAME: The Doors
SNG_TITLE: Break On Through (To The Other Side)
---
ART_NAME: The Doors
SNG_TITLE: Alabama Song (Whisky Bar)
---
ART_NAME: The Doors
SNG_TITLE: People Are Strange
---
ART_NAME: My Chemical Romance
SNG_TITLE: Welcome to the Black Parade
---
ART_NAME: My Chemical Romance
SNG_TITLE: Teenagers
---
ART_NAME: My Chemical Romance
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
---
ART_NAME: My Chemical Romance
SNG_TITLE: Famous Last Words
---
ART_NAME: The Doors
SNG_TITLE: Soul Kitchen
---
ART_NAME: The Black Keys
SNG_TITLE: Lonely Boy
---
ART_NAME: Katy Perry
SNG_TITLE: I Kissed a Girl
---
ART_NAME: Katy Perry
SNG_TITLE: Hot N Cold
---
ART_NAME: Katy Perry
SNG_TITLE: E.T.
---
ART_NAME: Linkin Park
SNG_TITLE: Given Up
---
ART_NAME: My Chemical Romance
SNG_TITLE: Dead!
---
ART_NAME: My Chemical Romance
SNG_TITLE: Mama
---
ART_NAME: My Chemical Romance
SNG_TITLE: The Sharpest Lives
---

关于python - 在 Python 中使用 BeautifulSoup 从脚本标签中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48030726/

相关文章:

python - matplotlib:如何访问内部cm字体?

python - 元组作为字典的键说 : 'tuple' object does not support item assignment

python - InvalidRequestError : Ambiguous column name '***' in result set, 当请求对 mysqldb 有效时?

python - 从表中仅抓取具有今天日期的行

python - Beautifulsoup - 抓取除表数据之外的所有内容

ios - 在 Deezer iOS SDK 中恢复 session

javascript - 将 DZ.api 响应保存在变量中

python - 在 url 中使用 os.sep 而不是 "/"可以吗

python - Beautiful Soup - 抓取表格特定元素的更好方法

api - Deezer 用户状态和国家的可能值