python - 跳过 BeautifulSoup 中的空白行

标签 python beautifulsoup

我目前正在尝试使用 BeautifulSoup 从 1001TrackLists(一个列出 DJ 混音轨道的网站)中抓取数据。

如果混音中的轨道未进行 ID 标识,1001TrackLists 会将其保留为数据表中的“ID - ID”,这会在抓取的代码中显示为空白条目,并弄乱我的 for 循环。

如何让 Python 跳过轨道列表中的“空白”ID 并继续抓取空白 ID 之后的数据?

到目前为止我的代码:


headers = {'User-Agent': 'Chrome/51.0.2704.103'}
page_link  = 'https://www.1001tracklists.com/tracklist/7mzt0y9/boddika-joy-orbison-rinse-fm-hessle-audio-cover-show-2014-01-16.html'
page_response = requests.get(page_link, headers=headers)
soup = bs(page_response.content, "html.parser")

tracknumbers = []
tracknames = []
artistnames = []
mixnames = []
dates = []


tracknames_scrape = soup.find_all("div", class_="tlToogleData", div=True)
artistnames_scrape = soup.find_all("meta", itemprop="byArtist")

for (i, track) in enumerate(tracknames_scrape):
    tracknumbers.append(i+1)
    trackname = track.meta['content']
    tracknames.append(trackname)
    print(str(i+1) + str(". ") + trackname)

目前,我能够返回所有轨道,直到我遇到空白条目,然后出现以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-de6ecd3caa59> in <module>
      1 for (i, track) in enumerate(tracknames_scrape):
      2     tracknumbers.append(i+1)
----> 3     trackname = track.meta['content']

TypeError: 'NoneType' object is not subscriptable

如果我使用没有空白轨道 ID 的 URL,该脚本将完美运行。

最佳答案

使用以下 CSS 选择器来获取轨道名称。

import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Chrome/51.0.2704.103'}
page_link  = 'https://www.1001tracklists.com/tracklist/7mzt0y9/boddika-joy-orbison-rinse-fm-hessle-audio-cover-show-2014-01-16.html'
page_response = requests.get(page_link, headers=headers)
soup = bs(page_response.content, "html.parser")

tracknumbers = []
tracknames = []
artistnames = []
mixnames = []
dates = []


tracknames_scrape =soup.select('div[itemprop="tracks"]>[itemprop="name"]')
#artistnames_scrape = soup.find_all("meta", itemprop="byArtist")

for (i, track) in enumerate(tracknames_scrape):
    tracknumbers.append(i+1)
    trackname = track['content']
    tracknames.append(trackname)
    print(str(i+1) + str(". ") + trackname)

输出:

1. Soft Machine - Snodland
2. Craig Leon - The Customs Of The Age Disturbed
3. Seven Davis Jr. - Thanks
4. Gadi Mizrahi - I'll Set Your House
5. Baby Ford & The iFach Collective - Word For Word
6. Panzer Knacker - Rollin' On The Side Of Psycho
7. 69 - Poi Beats
8. Midi Rain - Shine (DJ Pierre Chicago House Mix)
9. Sunpeople - Check Your Buddha (Sven Väth Remix)
10. Eduardo De La Calle - Madhusudhana
11. Aardvarck - The Antdance
12. Boddika & Joy Orbison - In Here
13. Mike Parker - Lustrations Eight (Contours)
14. Peter Van Hoesen - Axis Mundi
15. Sleeparchive - Bleep 01
16. Conforce - When It Appeared
17. Brommage Dub - Fettwise
18. Matrixxman - Protocol
19. JuJu & Jordash - Powwow
20. Gesloten Cirkel - Yamagic
21. Mike Dehnert - Mischkaa
22. Jerome Sydenham & Joe Claussell - Rhythm
23. Ratchett Traxxx - Nut On U
24. Kenny Dope & Terry Hunter pres. Mass Destruction - No Hook
25. Radio Slave - Don't Stop No Sleep
26. Truncate - Focus
27. Maurizio - Domina (Maurizio Mix Edit)
28. Shed - Atmo - Action
29. AFX - Boxing Day
30. Boddika & Joy Orbison - More Maim

关于python - 跳过 BeautifulSoup 中的空白行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60065505/

相关文章:

python - Soup 没有从 div 中找到特定的类

python - 如何从内部带有 <span> 的 <dt> 标签中获取文本?

python - 使用 Python 的维基百科爬虫

python - 从特定 channel 抓取 YouTube 视频并进行搜索?

python - DFS中c++和Python的区别

python - 如何使用Python子进程捕获子进程的错误?

python - Pygtk:CellRenderer 耗尽所有垂直空间

python - 无法从网页中提取连接到 `see all` 按钮的链接

python - 总结维基百科文章

python - 无法使用非主线程在Python-OpenCV(cv2)中启动相机捕获