python - 在两个 BeautifulSoup 元素之间拉出文本

标签 python beautifulsoup html-parsing

总体问题已在几个地方提出并回答: http://www.resolvinghere.com/sof/18408799.shtml

How to get all text between just two specified tags using BeautifulSoup?

但是在尝试实现时,我得到了非常麻烦的字符串。

我的设置: 我正在尝试从总统辩论中提取文字记录,我想我应该从这里开始:http://www.presidency.ucsb.edu/ws/index.php?pid=111500

我可以仅隔离转录本

transcript = soup.find_all("span", class_="displaytext")[0]

文字记录的格式不理想。每隔几行文本就有一个 <p>它们通过嵌套 <b> 表示说话者的变化。 。例如:

<p><b>TRUMP:</b> First of all, I have to say, as a businessman, I get along with everybody. I have business all over the world. [<i>booing</i>]</p>,
<p>I know so many of the people in the audience. And by the way, I'm a self-funder. I don't have — I have my wife and I have my son. That's all I have. I don't have this. [<i>applause</i>]</p>,
<p>So let me just tell you, I get along with everybody, which is my obligation to my company, to myself, et cetera.</p>,
<p>Obviously, the war in Iraq was a big, fat mistake. All right? Now, you can take it any way you want, and it took — it took Jeb Bush, if you remember at the beginning of his announcement, when he announced for president, it took him five days.</p>,
<p>He went back, it was a mistake, it wasn't a mistake. It took him five days before his people told him what to say, and he ultimately said, "It was a mistake." The war in Iraq, we spent $2 trillion, thousands of lives, we don't even have it. Iran has taken over Iraq, with the second-largest oil reserves in the world.</p>,
<p>Obviously, it was a mistake.</p>,
<p><b>DICKERSON:</b> So...</p>

但就像我说的,这不是一个新问题。定义开始和结束标签,迭代元素,只要current != next,添加文本。

因此,我正在对单个元素进行测试,以获得正确的详细信息。

startTag = transcript.find_all('b')[165]
endTag = transcript.find_all('b')[166]
content = []
content += startTag.string
content

我得到的结果是 [u'R', u'U', u'B', u'I', u'O', u':']而不是[u'RUBIO:'] .

我错过了什么?

最佳答案

这个想法是找到转录本中的所有 b 元素,然后获取每个 b 元素的父元素并查找下一个段落,直到有一个带有 b 元素内。实现:

from bs4 import BeautifulSoup, Tag
import requests

url = "http://www.presidency.ucsb.edu/ws/index.php?pid=111500"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html5lib")
transcript = soup.find("span", class_="displaytext")
for item in transcript.find_all("b")[3:]:  # skipping first irrelevant parts
    part = [" ".join(sibling.get_text(strip=True) if isinstance(sibling, Tag) else sibling.strip()
                     for sibling in item.next_siblings)]
    for paragraph in item.parent.find_next_siblings("p"):
        if paragraph.b:
            break

        part.append(paragraph.get_text(strip=True))

    print(item.get_text(strip=True))
    print("\n".join(part))
    print("-----")

打印:

DICKERSON:
Good evening. I'm John Dickerson. This holiday weekend, as America honors our first president, we're about to hear from six men who hope to be the 45th. The candidates for the Republican nomination are here in South Carolina for their ninth debate, one week before this state holds the first-in-the-South primary.
George Washington ...
-----
DICKERSON:
Before we get started, candidates, here are the rules. When we ask you a question, you will have one minute to answer, and 30 seconds more if we ask a follow-up. If you're attacked by another candidate, you get 30 seconds to respond.
...
-----
TRUMP:
Well, I can say this. If the president, and if I were president now, ...

关于python - 在两个 BeautifulSoup 元素之间拉出文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36358251/

相关文章:

e^(正态分布变量)的 Python 乘积不等于 1.0?

python - 访问 ActivityTrigger 时,Azure Functions Blob 输出绑定(bind)会在路径中添加引号

python - 正则表达式类型错误 : 'NoneType' object is not callable

javascript - HTML 中的新标记和解析器

python - 在 Matplotlib 中用上标或下标编写单位的最佳做法是什么?

python - 在 Google VM (Ubuntu) 上安装 TA-lib

python - 我们可以将 XPath 与 BeautifulSoup 一起使用吗?

python - 编解码器无法编码字符python3

python - 如何使用标准库在 python 中解析格式错误的 HTML

python - 在 Python 中处理 HTML 以删除和关闭打开标签