html - 从 HTML 到文本的 NLP 预处理

标签 html text beautifulsoup nlp nltk

我看到 NLTK 推荐使用 BeautifulSoup get_text() 将 HTML 处理为文本以供后续的 NLP 分析。但它似乎不是很好用。在下面的示例中,xyzabc 被连接起来,但它们不应该连接起来。有没有更好的预处理工具可以将 HTML 转换为文本以供 NLP 应用使用?

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1:

html_doc = "<h2>xyz</h2><p>abc</p>"

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print soup.get_text()
$ ./main.py 
xyzabc

最佳答案

我建议您使用 html2text工具。这是在命令行中运行的测试:

$ html2text --ignore-links https://content.cultureandempire.com/chapter1.html 

  * Culture & Empire
  *   * __Introduction
  * __**1.** Preface 
  * __**2.** Chapter 1 - Magic Machines 
  * __**3.** Chapter 2 - Spheres of Light 
  * __**4.** Chapter 3 - Faceless Societies 
  * __**5.** Chapter 4 - Freedom in Chains 
  * __**6.** Chapter 5 - Eyes of the Spider 
  * __**7.** Chapter 6 - Wealth of Nations 
  * __**8.** Chapter 7 - March of the Kaiju 
  * __**9.** Chapter 8 - The Reveal 
  * __**10.** Postface 
  * __**11.** Appendix 1 
  *   * Published with GitBook 

#  __Culture & Empire

# Chapter 1. Magic Machines

> Far away, in a different place, a civilization called Culture had taken
seed, and was growing. It owned little except a magic spell called Knowledge.

In this chapter, I'll examine how the Internet is changing our society. It's
happening quickly. The most significant changes have occurred during just the
last 10 years or so. More and more of our knowledge about the world and other
people is transmitted and stored digitally. What we know and who we know are
moving out of our minds and into databases. These changes scare many people,
whereas in fact they contain the potential to free us, empowering us to
improve society in ways that were never before possible.

## From Bricks to Bits

否则,您可以使用lxml.html.Element.text_content()python's textract

关于html - 从 HTML 到文本的 NLP 预处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47907613/

相关文章:

c# - decimal.Parse 中接受的小数的正则表达式

css - 仅使用 CSS 格式化文本

python - BeautifulSoup 有时会给出异常(exception)

javascript - 如何在响应式中更改 div 位置

html - 如何在html中写分数?

html - 如何使用外部html文件在mailgun中发送数据?

jquery - 如何在 3 秒后显示导航栏?

java - 是否可以在不读取和写入整个文件(Java)的情况下更改 txt 文件中的一行?

python - 使用 Python 将 html 转换为文本

python - findall() 函数中使用 beautiful soup 的 2+ 正则表达式参数