python - Beautiful Soup 根据类型提取内容

标签 python xml beautifulsoup xml-parsing

我想从以下 xml 格式中提取问题 (type='q') 和答案 (type='a') 对作为单个数据点:

<?xml version="1.0" encoding="us-ascii"?>
<transcript id="001" >
<body>
<section name="Q&amp;A">
      <speaker id="0">
        <plist>
          <p>Thank you. We'll now be conducting the question-and-answer session. <mark type="Operator Instructions" /> Thank you. Please go ahead with your question.</p>
        </plist>
      </speaker>
      <speaker id="3" type="q">
        <plist>
          <p>Good morning. First of all, Happy New Year.</p>
        </plist>
      </speaker>
      <speaker id="2" type="a">
        <plist>
          <p>Happy New Year, sir.</p>
        </plist>
      </speaker>
      <speaker id="3" type="q">
        <plist>
          <p>Thank you. How is your pain now?.</p>
        </plist>
      </speaker>
       <speaker id="2" type="a">
            <plist>
              <p>Oh, it's better now. I think i am healing.</p>
            </plist>
          </speaker>
</section>
</body>
</transcript>

即输出应该是这样的:['早上好。首先祝大家新年快乐。新年快乐,先生。”,“谢谢。”你现在疼痛怎么样?哦,现在好多了。我想我正在康复。']

有人可以帮我用 BeautifulSoup 来做这个吗?我当前的代码提取所有 <p>文档中的标签,但问题是还有其他部分(“Q&A”除外),其 <p>标签被提取。

soup = BeautifulSoup(handler, "html.parser")
texts = []
for node in soup.findAll('p'):
    text = " ".join(node.findAll(text=True))
    #text = clean_text(text)
    texts.append(text)

最佳答案

您可以使用 find_all('speaker', type='q')find_all('speaker', type='a')< 查找所有问题和所有答案 分别。然后使用zip将相应的问题及其答案连接起来。

代码:

questions = soup.find_all('speaker', type='q')
answers = soup.find_all('speaker', type='a')

for q, a in zip(questions, answers):
    print(' '.join((q.p.text, a.p.text)))

输出:

Good morning. First of all, Happy New Year. Happy New Year, sir.
Thank you. How is your pain now?. Oh, it's better now. I think i am healing.

如果您希望将其放在列表中,可以使用列表理解:

q_and_a = [' '.join((q.p.text, a.p.text)) for q, a in zip(questions, answers)]
print(q_and_a)
# ['Good morning. First of all, Happy New Year. Happy New Year, sir.',
#  "Thank you. How is your pain now?. Oh, it's better now. I think i am healing."]

关于python - Beautiful Soup 根据类型提取内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51147540/

相关文章:

python - Python中杀死线程并释放内存

php - AJAX 返回 null XML

xml - 更改从 OperationContract 命名空间继承的 DataContract 的前缀

sql - 获取 SQL Server 中特定的 XML 子节点

python - 如何使用 BeautifulSoup 查找节点的子节点

python - 压缩 "n"时间对象成员调用

python - 如何在 Python 中同时运行 2 个服务器?

python - 在笔记本中加载 IPython (--profile=SOMETHING) 内核

python - 对于Where条件中的语法

python - 在函数中调用时未定义 BeautifulSoup