python - BeautifulSoup getText 从 <p> 之间,不拾取后续段落

标签 python python-2.7 beautifulsoup

首先,对于 Python 来说,我是一个十足的新手。不过,我编写了一段代码来查看 RSS 提要、打开链接并从文章中提取文本。这是我到目前为止所拥有的:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

这会产生以下输出:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>> 

问题是这是每篇文章的第一段,但我需要显示整篇文章。如有任何帮助,我们将不胜感激。

最佳答案

你已经很接近了!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

使用find (正如您所注意到的)在找到一个结果后停止。您需要find_all如果你想要所有段落。如果页面格式一致(刚刚查看了一个),您也可以使用类似

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

将文章正文归零。

关于python - BeautifulSoup getText 从 <p> 之间,不拾取后续段落,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12451997/

相关文章:

python - 为什么在 gmail API 中搜索返回的结果与在 gmail 网站中搜索返回的结果不同?

python - 搜索单词,并使用 fileinput 在 Python 文件中替换包含该单词的整行

python - 将字典转换为另一个字典

python - 尝试使用 selenium Python 循环搜索查询时在 find_element_by_partial_link_text() 中出错

python - 为什么tensorflow中的 `tf.nn.nce_loss`无法在GPU上运行?

python - 在 Google App Engine 中禁用任务队列重试

浮点除法的Python相等性

python - 将卡住集的元素写入 pandas 数据帧

python - 使用 beautiful soup 从各种标签中提取标题

python - BeautifulSoup 不会提取所有元素