python - 在python中的另一个较长列表中搜索列表项

标签 python regex list lookup

我是这个论坛的新手,如果这是一个很长的问题,我深表歉意。

我正在尝试创建一个通用关键字解析器,它接受关键字列表和文本行列表(可能是从数据库或自由格式文本文件生成的)。现在,我尝试根据关键字列表从文本行列表中提取实体,以便生成三个关键输出;

  1. 提到的关键词
  2. 提及此关键字的文本行,
  3. 该关键字在文本行中被提及的次数

以下是我为此编写的 Python 代码示例。如您所见,我试图分三个阶段完成此任务;

第 1 阶段 - 接受拒绝序列,以便我可以从文本行列表中删除所有已知的不需要的行

第 2 阶段(第 1 遍解析)- 对关键字执行索引类型搜索以减少我需要进行完整循环搜索的行列表

第 3 阶段 - 执行完整的循环搜索。

问题:我遇到的问题是第 3 阶段(或代码中的第 2 阶段)效率极低,例如具有 4500 个元素的关键字列表和具有近 200 万行的文本行代码运行超过 24 小时。

谁能建议一种更好的方法来完成第 2 步? 或者 是否有更好的方法来编写整个函数?

我是一名 Python 初学者,因此如果我遗漏了一些明显的东西,请提前致歉。

##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
#  keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
#  KeywordDict - is the Dict of all the keywords and the associated ID.
#          (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
#  valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
#  valuesDict - Is the Dict of all the value lines and the associated IDs.
#          (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
#  rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
#  parsedHashIDsList - Is the a hash value that is generated for every successful parse results
#  parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
#  successResultIDsList - list of all unique value references that were parsed successfully
#  rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################

def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
    parsedResultsDict = {}
    parsedHashIDsList = []
    successResultIDsList = []
    rejectResultIDsList = []
    processListPass1 = []
    processListPass2 = []
    idxkeyWordDict = {}

    for keyID in keywordIDsList:
        keywordID, keyWord = keywordDict[keyID]
        idxkeyWordDict[keyWord] = (keywordID, keyWord)

    percCount = 1
    #    optional: if rejectPattern is provided then reject lines
    # ## Some python code for processing the reject patterns - this works fine

    #    Pass 1: Index based matching - partial code for index based search
    for valueID in processListPass1:
        valKey, valText = valuesDict[valueID]
        try:
            keyWordVal, keywordID = idxkeyWordDict[valText]
        except:
            processListPass2.append(valueID)

    percCount = 0

    #   Pass 2: Text based search and lookup - this part of the code is extremely inefficient

    for valueID in processListPass2:
        percCount += 1
        valKey, valText = valuesDict[valueID]
        valSuccess = 'N'
        for keyID in keywordIDsList:
            keyWordVal, keywordID = keywordDict[keyID]
            keySearch = re.findall(keyWordVal, valText, re.DOTALL)
            if keySearch:
                parsedHashID = hash(str(valueID) + str(keyID))
                parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
                valSuccess = 'Y'
        if valSuccess == 'Y':
            successResultIDsList.append(valueID)
        else:
            rejectResultIDsList.append(valueID)

    return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)

最佳答案

这是 Aho-Corasick string matching algorithm 的完美用例.在 this blog post 中使用 python 中的代码示例对类似用例进行了解释。 .

关于python - 在python中的另一个较长列表中搜索列表项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20967794/

相关文章:

python - 将项目添加到列表的副本中出现意外行为

python - 向 matplotlib 图形添加标签

python - 许多列表的总和

python - 正则表达式问题求助

regex - Apache mod_rewrite 仅当请求不是以 '/THEMES/' 开头时

python - 用正则表达式替换所有非字母/数字

c# - 如何让游戏对象从列表中删除自己? (Unity3d)

python - 名称错误 : name '__main__' is not defined

JavaScript/ typescript : Can you look up number values in a list and replace them with strings?

SwiftUI - 列表中的自定义滑动操作