python - 最长公共(public)序列群

标签 python algorithm nlp pattern-matching

给定以下文本行

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED

我想做一些解析来生成比较合理的分组。上面的列表可以分组如下

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP

TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED

WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE

WAY-VERING.1 H03-TOP
WAY-VERING.1 H03-CANCELLED

任何人都可以建议一种算法(或某种方法)来扫描给定数量的文本并计算出可以按上述方式对文本进行分组。显然每组都可以更进一步。我想我正在寻找一个很好的解决方案来查看短语列表并找出如何最好地按一些常见的字符串序列对它们进行分组。

最佳答案

这是一种方法:

  1. 对您的条目进行排序
  2. 确定每个条目之间公共(public)前缀的长度
  3. 通过在公共(public)前缀比前一个条目短的点分隔列表来对您的条目进行分组

示例实现:

def common_count(t0, t1):
  "returns the length of the longest common prefix"
  for i, pair in enumerate(zip(t0, t1)):
    if pair[0] != pair[1]:
      return i
  return i

def group_by_longest_prefix(iterable):
  "given a sorted list of strings, group by longest common prefix"
  longest = 0
  out = []

  for t in iterable:
    if out: # if there are previous entries 

      # determine length of prefix in common with previous line
      common = common_count(t, out[-1])

      # if the current entry has a shorted prefix, output previous 
      # entries as a group then start a new group
      if common < longest:
        yield out
        longest = 0
        out = []
      # otherwise, just update the target prefix length
      else:
        longest = common

    # add the current entry to the group
    out.append(t)

  # return remaining entries as the last group
  if out:
    yield out

示例用法:

text = """
TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED
"""

T = sorted(t.strip() for t in text.split("\n") if t)

for L in group_by_longest_prefix(T):
  print L

这会产生:

['TOKYO-BLING.1 H02-AVAILABLE', 'TOKYO-BLING.1 H02-MIDDLING', 'TOKYO-BLING.1 H02-TOP']
['TOKYO-BLING.2 H04-AVAILABLE', 'TOKYO-BLING.2 H04-CANCELLED', 'TOKYO-BLING.2 H04-USED']
['WAY-VERING.1 H03-CANCELLED', 'WAY-VERING.1 H03-TOP']
['WAY-VERING.2 H03-AVAILABLE', 'WAY-VERING.2 H03-USED']

在此处查看实际效果:http://ideone.com/1Da0S

关于python - 最长公共(public)序列群,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11263068/

相关文章:

Python HDFS 蛇咬 : Methods work only with print

python - 不同长度向量的余弦相似度?

python - 如何使 Tkinter 在 mac OS 上看起来好看或自然?

c# - 遍历任意维度的数组

algorithm - 如何找出 "Big theta"表示法中的常量

algorithm - 基于点的局部线性分割一组点的分割算法

python - 转移矩阵中的多个 ngram,概率不加到 1

python - GAE 数据存储使用 Python 过滤不返回任何数据

适用于 Maya 的 Python : Why can't I use a variable in concatination with a wildcard?