python - 从文本文件中提取单词

标签 python awk

我正在使用递归神经网络,需要处理我的输入文本文件(包含树)以提取单词。 输入文件看起来像:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))

作为输出,我希望新文本文件中的单词列表为:

The

Rock

is

destined

...

(忽略行与行之间的空格。)

我尝试在 python 中进行,但无法找到解决方案。另外,我读到 awk 可用于文本处理,但无法生成任何工作代码。感谢您的帮助。

最佳答案

你可以使用正则表达式!

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

请注意 re.findall将返回一个匹配列表,所以如果你想在一个句子中将它们全部打印出来,你可以使用:

' '.join(results)

或您想要用来分隔单词而不是空格的任何其他字符。

打破正则表达式模式我们有:

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If the numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

关于python - 从文本文件中提取单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50765499/

相关文章:

python - 如何将字典作为命令行参数传递给 Python 脚本?

bash - 重命名大量文件的扩展 Bash

linux - 从 Shell 中的目录连接大量选择性文件

bash - nawk 将列添加到每一行

python - 使用 conda 安装和查找共享库

python - 是否有使用 pip 但不安装 virtualenv 的构建方法?

Linux 密码过期与 Awk、shadow 和密码

linux - awk 的多输入文件

python - 将 bash 参数传递给 python 脚本

python - 错误 : Could not build wheels for opencv-python which use PEP 517 and cannot be installed directly