python-2.7 - 使用 BeautifulSoup 从 div 中的所有 p 元素中获取文本

标签 python-2.7 web-scraping beautifulsoup

我正在尝试获取给定 div 中所有 p 元素的文本(没有标签的内容):

import requests
from bs4 import BeautifulSoup

def getArticle(url):
    url = 'http://www.bbc.com/news/business-34421804'
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)

    article = []
    article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
    for element in article:
        article = ''.join(element.findAll(text = True))
    return article

问题是它只返回最后一段的内容。但是如果我只使用 print，代码就可以完美运行:

    for element in article:
        print ''.join(element.findAll(text = True))
    return

我想在别处调用这个函数，所以我需要它来返回文本，而不仅仅是打印它。我搜索了 stackoverflow 并搜索了很多，但没有找到答案，我不明白可能是什么问题。我使用 Python 2.7.9 和 bs4。提前致谢!

最佳答案

以下代码应该可以工作 -

import requests
from bs4 import BeautifulSoup

def getArticle(url):
    url = 'http://www.bbc.com/news/business-34421804'
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)

    article_text = ''
    article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
    for element in article:
        article_text += '\n' + ''.join(element.findAll(text = True))
    return article_text

您的代码中有几个问题 -

使用相同的变量名“article”来存储元素和文本。
应该返回的变量只是被赋值而不是附加，所以只有最后一个值保留在其中。

关于python-2.7 - 使用 BeautifulSoup 从 div 中的所有 p 元素中获取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32906238/

上一篇：spring-mvc - 如何找出请求到达错误处理程序的 url？

下一篇：sql - 查找主管的主管是特定人员的员工

相关文章：

python - BeautifulSoup - 解析文件中的数值

python-2.7 - 扫描三重引号字符串文字时的 Python EOF

python - 不正确的列表操作

python - 如何使用 BeautifulSoup 和 Python 仅从相似元素中提取某些文本

python - 使用 Mechanize 填写和提交表格

Python3，beautifulsoup，在特定页面不返回任何内容

python - 装饰外部(库)函数的正确方法是什么？

python - 网页抓取:我只得到我想要的文本的 1/10(使用 BeautifulSoup)

python - MechanicalSoup 棘手的 html 表格

python - 如何遍历整个html表并转换为json数据？