python - 如何在Python中创建 'word stream'和 'document stream'?

标签 python stream nlp

我想获取一堆文本文件并将它们全部组合成两个数组 - 一个“单词流”和一个“文档流”。这是通过计算语料库中单词标记的总数然后创建数组来完成的,其中单词流中的每个条目对应于与该标记关联的单词,文档流对应于该单词来自的文档。

例如,如果语料库是

Doc1: "The cat sat on the mat"
Doc2: "The fox jumped over the dog"

单词流将如下所示:

WS: 1 2 3 4 1 5 1 6 7 8 1 9
DS: 1 1 1 1 1 1 2 2 2 2 2 2 

我不太确定如何执行此操作,所以我的问题本质上是这样的:如何将文本文件转换为单词标记数组?

最佳答案

有这样的事吗?这是 Python3 代码,但我认为这只在 print 语句中重要。这些评论有一些注释供将来添加......

strings = [ 'The cat sat on the mat',           # documents to process
            'The fox jumped over the dog' ]
docstream = []                                  # document indices
wordstream = []                                 # token indices
words = []                                      # tokens themselves

# Return an array of words in the given string. NOTE: this splits up by
# spaces, in real life you might want to split by multiple spaces, newlines,
# tabs, what you have. See regular expressions in the module 're' and
# 're.split(...)'
def tokenize(s):
    return s.split(' ')

# Lookup a token in the wordstream. If not present (yet), append it to the
# wordstream and return the new position. NOTE: in real life you might want
# to fold cases so that 'The' and 'the' are treated the same.
def lookup_token(token):
    for i in range(len(words)):
        if words[i] == token:
            print('Found', token, 'at index', i)
            return i
    words.append(token)
    print('Appended', token, 'at index', len(words) - 1)
    return len(words) - 1

# Main starts here
for stringindex in range(len(strings)):
    print('Analyzing string:', strings[stringindex])
    tokens = tokenize(strings[stringindex])
    for t in tokens:
        print('Analyzing token', t, 'from string', stringindex)
        docstream.append(stringindex)
        wordstream.append(lookup_token(t))

# Done.
print(wordstream)
print(docstream)

关于python - 如何在Python中创建 'word stream'和 'document stream'?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26763463/

相关文章:

C# 8 异步流与 REST/RPC

nlp - 向 Google Action/API.AI 发送 POST 请求或发送超过 5 秒的响应

java - MetaMap java.lang.OutOfMemoryError : Java heap space

python - pydot:是否可以绘制两个具有相同字符串的不同节点?

python - 提取text()并从中获取属性

c++ - 自定义输入流。流缓冲区和下溢方法

java - NLP中的数据字典是什么?

python - 如何创建一个小的 python 代码来获取团队通话的参与者列表?

python - 如何将外部包安装到 Canopy 中?

linux - 将音频从 Windows 输出设备流式传输到 Linux