python - 如何在段落标签内与其他一些标签一起抓取文本,然后在段落文本内抓取文本?

标签 python web-scraping beautifulsoup

我想抓取段落标签中的信息。

标签中还有一些其他标签。我将在下面的代码中向您展示。

这是

这是要抓取的 html 页面:

<div class="thecontent">
<p>Here&rsquo;s the schedule of matches for the weekend.</p>
<p>&nbsp;</p>
<p><strong>Saturday, August 17</strong></p>

<p>Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>

<p>pritos vs. baola, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>


<p>timpao vs. quadrsa, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>

<p><strong>Sunday, August 18</strong></p>



<p>Achara vs. timpao, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>

<p>pritos vs. qaudra, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>


<p>timpao vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>
<p>&nbsp;</p>
<p><strong>Monday, August 19</strong></p>


<p>Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> &mdash;&nbsp;Have enjoy it and celebrate it</p>
</p>
<p>&nbsp;</p></div></body></html>

我使用了以下Python代码:

import bs4,requests

getnwp = requests.get('https://url')
nwpcontent = getnwp.content
sp2 = bs4.BeautifulSoup(nwpcontent, 'html5lib')
pta = sp2.find('div', class_ = 'thecontent').find_all('p')
        for i in range(len(pta)):
            if pta[i].get_text().find("vs") != -1:
                print (pta[i].get_text())   

根据上述信息,我只想提取球队之间的比赛及其发生日期。以及如下所示的小消息:

Saturday, August 17

Achara vs. timpao, — Have enjoy it and celebrate it

pritos vs. baola, — Have enjoy it and celebrate it

timpao vs. quadrsa, — Have enjoy it and celebrate it

Sunday, August 18

Achara vs. timpao, — Have enjoy it and celebrate it

pritos vs. qaudra, — Have enjoy it and celebrate it

timpao vs. Buad, — Have enjoy it and celebrate it

Monday, August 19

Achara vs. Buad, — Have enjoy it and celebrate it

我的意思是我不想要有关电视广播的信息(或 anchor 标记中的信息)。

最佳答案

不知道实际来源是什么样的。例如,您可以删除标签并使用 :has:not(:empty) 来定位。需要 bs4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://worldsoccertalk.com/2019/08/16/epl-commentator-assignments-nbc-sports-gameweek-2-3/')
soup = bs(r.content, 'lxml')

for a in soup("a"):
    a.decompose()

for i in soup.select('.thecontent p:has(strong:not(:contains("SEE MORE"))), .thecontent p:has(strong:not(:contains("SEE MORE"))) ~ p:not(:empty)'):
    data = i.text.strip()
    if data:
        if ' vs. ' in data:
            items = data.split(',')
            print(', '.join([items[0], items[-1].split('—')[1]]))
        else:
            print(data)

enter image description here

关于python - 如何在段落标签内与其他一些标签一起抓取文本,然后在段落文本内抓取文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57536962/

相关文章:

python - 最小化缓慢、嘈杂、未明确定义的目标函数

python - 如何使用 PyPDF2 在 Python 3 中将 PDF 中的所有页面作为单个字符串检索

python - 'for' 遍历表单字段并排除其中一个字段为 'if'

python - 如何从 wunderground 中抓取动态表格

html - 使用 getelementbyID 时如何隔离多个 innertext 条目

Python BeautifulSoup 选择网页,相同的代码可以断断续续地工作

python - 使用 BeautifulSoup 在第一个子标签之前提取文本

python - 将python脚本转换为可执行文件,没有名为 'codecs'的模块

python - 名称错误 : name 'Rule' is not defined in python scrapy

PYTHON - BEAUTIFULSOUP 如何将空 TD(表数据)刮取为空值而不是跳过它