python - 文本信息无法正确抓取-Python

标签 python html beautifulsoup html-parsing

我需要抓取以下 HTML 之间的文本信息。对于标签和类名相同的情况,我的下面代码无法正常工作。在这里,我需要获取单个列表元素中的文本,而不是两个不同的列表元素。我在这里为没有像下面这样的拆分的情况编写的代码。在我的例子中,我需要抓取这两种文本并将其附加到一个列表中。

示例 HTML 代码(其中列表元素是一个)- 正常工作:

<DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">The board of Hillshire Brands has withdrawn its recommendation to acquire frozen foods maker Pinnacle Foods, clearing the way for Tyson Foods' $8.55bn takeover bid.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Last Monday Tyson won the bidding war for Hillshire, maker of Ball Park hot dogs, with a $63-a-share offer, topping rival poultry processor Pilgrim's Pride's $7.7bn bid.</SPAN></P>

示例 HTML 代码(其中列表元素为两个):

<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>

Python 代码:

soup = BeautifulSoup(response, 'html.parser')
tree = html.fromstring(response)
values = [[''.join(text for text in div.xpath('.//p[@class="c9"]//span[@class="c2"]//text()'))] for div in tree.xpath('//div[@class="c5"]') if div.getchildren()]
        split_at = ','
textvalues = [list(g) for k, g in groupby(values, lambda x: x != split_at) if k]
list2 = [x for x in textvalues[0] if x]
def purify(list2):
     for (i, sl) in enumerate(list2):
          if type(sl) == list:
              list2[i] = purify(sl)
            return [i for i in list2 if i != [] and i != '']
list3=purify(list2)
flattened = [val for sublist in list3 for val in sublist]

当前输出:

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi","--Remaining text--"]

预期样本输出:

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi --Remaining text--"]

请帮助我解决上述问题。

最佳答案

是这样的吗?

from bs4 import BeautifulSoup
a="""
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
"""
l = BeautifulSoup(a).text.split('\n')
b = [' '.join(l[1:])]
print b

输出:

[u"M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi  Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago. But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food. Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0 "]

关于python - 文本信息无法正确抓取-Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41382495/

相关文章:

python - PyCharm 显示没有来自 pandas 的输出

python - findAll中的匿名函数有什么用?

python - 如何使用 BeautifulSoup 查找指向特定域的页面中的所有链接?

python - 使用 Keras 在图形模式下将张量转换为不规则张量

Python - 有多少只是列表中的一个单词

Javascript 多个 onClick() 函数可以在一次只需要 1 个时同时运行

javascript - 在 # 符号后检测 url 更改

python - 使用 BeatifulSoup find_all 时找不到元素

python - Pyenchant - 意大利语和西类牙语

html - Angular 键值管道 : argument type is not assignable to parameter type