python - 使用 beautifulsoup 从结果中删除特定内容

标签 python regex web-scraping beautifulsoup

def get_description(link):
    redditFile = urllib2.urlopen(link)
    redditHtml = redditFile.read()
    redditFile.close()
    soup = BeautifulSoup(redditHtml)
    desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
    return desc

这是从该 html 中提供文本的代码

    <div class="op_gd14 FL">
    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>  </p><p>                                                </p>

</div>

这个结果对我来说很好，我只是想排除

的内容

<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>

从结果来看，即desc在我的脚本中，如果存在则忽略，如果不存在则忽略。我怎样才能做到这一点？

最佳答案

您可以使用extract()从 find() 结果中删除不必要的标签:

descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')]                       # remove <a> tags
return descItem.get_text()                                 # return the text

关于python - 使用 beautifulsoup 从结果中删除特定内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32477042/

上一篇：python - 在 matplotlib 中保留多个条形图中的 xticks

下一篇：python - 如何在Python中的for循环中将数组值分配给列表

python - 将 CSV 文件分割成相等的部分？

python - 如何使用 pandas 从 CSV 文件读取字节数组？

regex - Perl 正则表达式，其中模式是从 linux 命令输出的

java - 什么正则表达式将在收据上列出最后的价格？

Python:A *从具有经度和纬度的数据框路由

java - 正则表达式匹配简单的 Markdown

html - 使用VBA从网站中抓取innerHTML

python - 使用 Python 和 selenium 抓取 URL

perl - 递归网络爬虫 perl