python - 如何在pyparsing中省略重复项?

标签 python parsing pyparsing

好的,我终于掌握了捕获所有测试用例的语法,但我有一个重复(案例 3)和一个误报(案例 6,“模式 5”)。这是我的 test cases和我的 desired output .

我对 python 还是很陌生(虽然可以教我的 child !可怕!)所以我确信有明显的方法可以解决这个问题,我什至不确定这是一个 pyparsing 问题。这是我现在的输出:

['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']]
['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]]
['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]]
['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']]
['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]

这是语法

num = Word(nums)
arith_expr = operatorPrecedence(num,
    [
    (oneOf('-'), 1, opAssoc.RIGHT),
    (oneOf('* /'), 2, opAssoc.LEFT),
    (oneOf('+ -'), 2, opAssoc.LEFT),
    ])
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
score = (Optional(oneOf('( [')) +
         arith_expr('lhs') +
         Optional(oneOf(') ]')) +
         Optional(oneOf('= -')) +
         Optional(oneOf('( [')) +
         Optional(arith_expr('rhs')) +
         Optional(oneOf(') ]')))
gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score)
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")

和输出函数。

lastPatientData = None 
for match in partMatch.searchString(TEXT):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!" 
            continue 
       # getParts() 
        FOUT.write( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]\n".format(lastPatientData.patientData, match.gleason))

如您所见,输出并不像看起来那么好,我只是写入一个文件并伪造一些语法。我一直在为如何获得 pyparsing 中间结果而苦苦挣扎,以便我可以使用它们。我应该把它写出来并运行第二个脚本来找到重复项吗?

更新,基于 Paul McGuire 的回答。这个函数的输出让我将每个条目减少到一行,但现在我丢失了分数的一部分(每个格里森分数,在智力上,具有 primary + secondary = total 的形式。这是标题对于数据库,所以 pri、sec、tot 是单独的 posgresql 列,或者对于解析器的输出,是逗号分隔值)

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
             #this is a new patient data, print out the accumulated
             #Gleason scores for the previous one
             writeOut(accumPatientData)
        accumPatientData = (match.patientData, [])
    elif match.gleason:
        accumPatientData[1].append(match.gleason)
if accumPatientData is not None:
    writeOut(accumPatientData)

所以现在输出看起来像这样

01/01/01,S01-12345,20/111-22-1001,9
02/02/02,S02-1234,20/111-22-1002,6
03/02/03,S03-1234,31/111-22-1003,7,4+3
04/17/04,S04-123,30/111-22-1004,
05/28/05,S05-1234,20/111-22-1005,3+4
06/18/06,S06-10686,20/111-22-1006,,
07/22/07,S07-2749,20/111-22-1007,3+3

我想回到那里,捕获一些丢失的元素,重新排列它们,找到丢失的元素,然后把它们全部放回去。像这样的伪代码:

def diceGleason(glrhs,gllhs)
    if glrhs.len() == 0:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = pri + sec
        return [pri, sec, tot]
    elif glrhs.len() == 1:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = glrhs
        return [pri, sec, tot]
    else:
        pri = glrhs[0]
        sec = glrhs[2]
        tot = gllhs
        return [pri, sec, tot]

更新 2:好的,Paul 很棒,但我很笨。完全按照他所说的进行尝试后,我尝试了几种方法来获取 pri、sec 和 tot,但我失败了。我不断收到这样的错误:

Traceback (most recent call last):
  File "Stage1.py", line 81, in <module>
    writeOut(accumPatientData)
  File "Stage1.py", line 47, in writeOut
    FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso
nList))
AttributeError: 'list' object has no attribute 'pri'

这些 AttributeErrors 是我不断得到的。显然我不明白之间发生了什么(保罗,我有这本书,我发誓它在我面前打开,但我不明白)。这是 my script .有什么东西放错地方了吗?我说的结果错了吗?

最佳答案

我没有对您的解析器进行任何更改,但对您的解析后代码进行了一些更改。

您并没有真正得到“重复”,问题是您每次看到格里森分数时都会打印出当前患者数据,并且您的一些患者数据记录包含多个格里森分数条目。如果我明白你想做什么,这里是我会遵循的伪代码:

accumulator = None
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source):

    if it's a patientDataExpr:
        if accumulator is not None:
            # we are starting a new patient data record, print out the previous one
            print out accumulated data
        initialize new accumulator with current match and empty list for gleason data

    else if it's a gleasonScoreExpr:
        add this expression into the current accumulator

# done with the for loop, do one last printout of the accumulated data
if accumulator is not None:
    print out accumulated data

这很容易转换为 Python:

def printOut(patientDataTuple):
    pd,gleasonList = patientDataTuple
    print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
        pd, ','.join(''.join(gl.rhs) for gl in gleasonList)))

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
            # this is a new patient data, print out the accumulated 
            # Gleason scores for the previous one
            printOut(accumPatientData)

        # start accumulating for a new patient data entry
        accumPatientData = (match.patientData, [])

    elif match.gleason:
        accumPatientData[1].append(match.gleason)
    #~ print match.dump()

if accumPatientData is not None:
    printOut(accumPatientData)

我认为我没有正确转储格里森数据,但我认为您可以从这里调整它。

编辑:

您可以将 diceGleason 作为解析操作附加到 gleason 并获得此行为:

def diceGleasonParseAction(tokens):
    def diceGleason(glrhs,gllhs):
        if len(glrhs) == 0:
            pri = gllhs[0]
            sec = gllhs[2]
            #~ tot = pri + sec
            tot = str(int(pri)+int(sec))
            return [pri, sec, tot]
        elif len(glrhs) == 1:
            pri = gllhs[0]
            sec = gllhs[2]
            tot = glrhs
            return [pri, sec, tot]
        else:
            pri = glrhs[0]
            sec = glrhs[2]
            tot = gllhs
            return [pri, sec, tot]

    pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs)

    # assign results names for later use
    tokens.gleason['pri'] = pri
    tokens.gleason['sec'] = sec
    tokens.gleason['tot'] = tot

gleason.setParseAction(diceGleasonParseAction)

你只是有一个拼写错误,你将 prisec 相加得到 tot,但这些都是字符串,所以你要添加 ' 3' 和 '4' 并得到 '34' - 转换为整数来做加法是所需要的。否则,我将 diceGleason 逐字保留在 diceGleasonParseAction 内部,以隔离您推断 prisectot 来自用新结果名称修饰已解析标记的机制。由于解析操作不会返回任何新内容,因此标记会就地更新,然后随身携带以供稍后在输出方法中使用。

关于python - 如何在pyparsing中省略重复项?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18475048/

相关文章:

python - 需要帮助将 ANTLR 语法转换为 pyparsing

python - 模块 jupyter-vuetify semver 范围未注册为小部件模块

windows - Win32 : How to convert string to a date?

java - 使用斯坦福解析器获得句子的 K 个最佳解析

java - libgdx Json解析

javascript - 用方括号包裹单词而不是 sglQuotedString 或 dblQuotedString

python - 从 PyParsing 中的多行引号字符串中删除\n

python - 将 elem.send_keys 用于页面中的句柄 "Infinite Scroll"。在 Python 中使用 Selenium PhantomJS

python - 如何将 bool 条件包含到Python的列表理解中?

python - SQLAlchemy的反射工具可以输出python源码吗?