python - 如何识别字符串数据集中的文本模板模式？

我试图找到一种有效的方法来处理文本记录列表并识别记录中常用的文本模板，仅保留固定部分并抽象变量，同时计算与每个识别模板匹配的记录数。

——

我最成功的应对挑战尝试是将文本记录拆分为单词数组，比较每个单词的相同大小单词的数组，以将找到的模板写到模板列表中。

如您所料，它并不完美，很难运行超过 50,000 条记录的数据集。

我想知道是否有一些文本分类库可以更有效或更快地提高性能的逻辑，我当前的代码非常幼稚......

——

这是我在 Python 中的第一次尝试，使用了一个非常简单的逻辑。

samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

samples_split = [x.split() for x in samples]
identified_templates = []

for words_list in samples_split:
    for j,words_list_ref in enumerate(samples_split):
         template = str()
         if len(words_list) != len(words_list_ref) or words_list==words_list_ref:
            continue
         else:
            for i,word in enumerate(words_list):
                if word == words_list_ref[i]:
                    template += ' '+word
                else:
                    template += ' %'
            identified_templates.append(template)

templates = dict()          
for template in identified_templates:
    if template not in templates.keys():
        templates[template]=1

templates_2 = dict()

for key, value in templates.items():
    if '% % %' not in key:
        templates_2[key]=1

print(templates_2)

理想情况下，代码应采用如下输入:

- “Your order tracking number is 123” 
- “Thank you for creating an account with us” 
- “Your order tracking number is 888”
- “Thank you for creating an account with us” 
- “Hello Jim, what is your issue?”
- “Hello Jack, what is your issue?”

并输出模板列表以及它们匹配的记录数。

- “Your order tracking number is {}”,2
- “Thank you for creating an account with us”,2
- “Hello {}, what is your issue?”,2

最佳答案

你可以试试下面的代码。我希望输出符合您的期望。

import re
templates_2 = {}
samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

identified_templates = [re.sub('[0-9]+', '{}', asample) for asample in samples]
unique_identified_templates = list(set(identified_templates))
for atemplate in unique_identified_templates:
    templates_2.update({atemplate:identified_templates.count(atemplate)})
for k, v in templates_2.items():
    print(k,':',v)

输出:

The code for your gardening purchase is {} : 1
Your order {} has been confirmed. Thank you : 5
The code for your bakery purchase is {} : 2
The code for your butcher purchase is {} : 2

关于python - 如何识别字符串数据集中的文本模板模式？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55672528/

python - 如何识别字符串数据集中的文本模板模式？

上一篇：algorithm - 计算树中根节点左侧节点数的函数

下一篇：java - 如果没有构造函数，我怎样才能让这段代码发挥同样的作用？