python - 在 Python 中使词形还原器的多重搜索和替换更加精确

我正在尝试使用词形还原字典在 Python2.7 中为西类牙语制作自己的词形还原器。

我想用引理形式替换特定文本中的所有单词。这是我迄今为止一直在编写的代码。

def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text


my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        depurated_line = line.rstrip()
        (val, key) = depurated_line.split("\t")
        lemmatize_word_dict[key] = val

txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt

这是一个示例字典文件，其中包含用于替换输入或my_tyext_lower中单词的词形还原形式。示例字典是一个制表符分隔的 2 列文件，其中第 1 列表示值，第 2 列表示要匹配的键。

示例字典

flojo   floja
flojo   flojas
flojo   flojos
cargamento  cargamentos
cargante    cargantes
decepción   decepciones
decepcionante   decepcionantes
decentar    decenté
decentar    decentéis
decentar    decentemos
decentar    decentó

我想要的输出如下:

flojo y cargante. decepcionante. decentar decentar

使用这些输入(以及示例短语，如代码中的 my_text 中所列)。我当前的实际输出是:

felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar

目前，我似乎无法理解代码出了什么问题。

它似乎正在替换每个单词的字母或 block ，而不是识别该单词，在引理字典中找到它，然后替换它。

例如，这是我使用整个词典(超过 50.000 个条目)时得到的结果。我的小示例词典不会出现此问题。只有当我使用完整的字典时，这让我认为它在某些时候是双重“替换”的？

是否有一种我缺少的Pythonic技术可以合并到此代码中，以使我的搜索和替换功能更加精确，识别要替换的完整单词而不是 block 和/或不进行任何双重替换？

最佳答案

因为您使用text.replace，您仍有可能匹配子字符串，并且文本将被再次处理。最好一次处理一个输入单词并逐字构建输出字符串。

我已经把你的键值反过来了(因为你想向右查找并找到左边的单词)，我主要改变了replace_all:

import re

def replace_all(text, dic):
    result = ""
    input = re.findall(r"[\w']+|[.,!?;]", text)
    for word in input:
        changed = dic.get(word,word)
        result = result + " " + changed
    return result

my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        kv = line.split()
        lemmatize_word_dict[kv[1]] =kv[0]

    txt = replace_all(my_text_lower, lemmatize_word_dict)
    print txt

关于python - 在 Python 中使词形还原器的多重搜索和替换更加精确，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35061133/

python - 在 Python 中使词形还原器的多重搜索和替换更加精确

上一篇：python - 如何合并不同长度的pandas数据框

下一篇：python - 如何更新实时绘图并使用按钮在 pyqtgraph 中进行交互？