python - 使用 python/beautiful soup 从网站上抓取链接作为 Kodi 插件

标签 python web-scraping plugins beautifulsoup kodi

我试图从中抓取媒体链接(对于 Kodi 插件)的网站没有太多的类等标记,但每个链接都采用某种独特的布局。

我已经从另一个工作插件创建了基本的 Kodi 插件,但我在使用 Python/BeautifulSoup 抓取链接时遇到了问题。其他插件使用类等 header ,但我试图从中抓取的网站并没有使用太多这种方式。

我尝试过各种论坛,但没有成功,大多数 Kodi 插件论坛都很旧,而且不太活跃。我看过的指南看起来从步骤 1 到步骤 1000 的速度非常快,而且它给出的示例并不相关。我查看了大约 30 个不同的插件,认为应该有所帮助,但我无法解决。

我试图抓取的媒体链接、剧集标题、描述和图像列在 www.thisiscriminal.com/episodes 上。

到目前为止我完成的完整插件位于 Github-repository

我可以在源代码中看到它们被清楚地列出(参见代码)

我基本上只需要能够解析一个网站,找到每一集的以下部分,将它们填充为 kodi 插件页面上的链接,然后在下面列出下一个。任何帮助将不胜感激。我已经连续 3 天尝试这样做,对于我从 2002 年开始攻读的 IT 学位退学,我既感到非常高兴,又感到恼火。

我需要提取的网站代码

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

代码

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240 错误:窗口 10502 中的控件 55 已被要求聚焦,但它无法聚焦 2019-06-09 00:05:41.312 T:1165988576 错误:异常抛出(PythonToCppException):-->Python 回调/脚本返回以下错误<- - 注意:忽略此可能会导致内存泄漏! 错误类型: 错误内容:“ascii”编解码器无法解码位置 0 中的字节 0xa0:序号不在范围内(128) 回溯(最近一次调用最后一次): 文件“/home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py”,第 44 行,位于 desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError:“ascii”编解码器无法解码位置 0 中的字节 0xa0:序号不在范围内(128) -->Python脚本错误报告结束<-- 2019-06-09 00:05:41.636 T:1130349280 错误:GetDirectory - 获取插件时出错://plugin.audio.abcradionational/ 2019-06-09 00:05:41.636 T:1916360240 错误: CGUIMediaWindow::GetDirectory(plugin://plugin.audio.abcradionational/) 失败

最佳答案

好消息是该页面获取内容的 wp json 源加载,您可以对此发出简单的 xhr 。其他答案似乎很好地涵盖了如何找到它。

然后您可以根据需要从该 json 中解析出信息。文本描述是返回的 json 中的 html,因此您可以将其传递给 bs4 并根据需要进行解析。下面的例子。您可以探索与 Cecilia here 相关的 json 对象,或者,将以下内容粘贴到 json 查看器中:

{'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”</p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get $50 off your first purchase of $100 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to $80 off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}

该请求是一个 queryString url,因此您可以更改要返回的项目数,并且在响应中您将看到列出的页面总数,以便您知道返回所有内容需要多少个请求。

如果你看这里

posts=1000&page=1

您可以看到两个可以相应更改的参数。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()

for post in r['posts']:
    title = post['title']
    soup = bs(post['content'])
    desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
    img = post['image']['full']
    episode_link = post['audioSource'] #sure this is what you wanted?
    episode_number = post['episodeNumber']

关于python - 使用 python/beautiful soup 从网站上抓取链接作为 Kodi 插件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56486493/

相关文章:

python - 索引错误: string index out of range - Django - Why?

web-scraping - Nutch - 克隆网站

c++ - 无需为每个操作系统重建的插件系统?

Python 类型检查

python - 从 Python 中的字符串中去除 HTML

python - 替换 BeautifulSoup 迭代器中的字符串提前退出?

html - r - XMLNodeSet 上的 xpathApply(带有 XML 包)

python - Scrapy在使用crawlerprocess运行时抛出错误

android-studio - Android Studio - 更新到 4.1 后启动时出现插件错误消息

javascript - 什么是 .apply jQuery 函数?