python - 使用 Python 查找源代码中不在注释内的所有字符串

我有一个类似 C 的源代码，我试图提取该源代码中的所有字符串并将其保存到一个列表中，而不包括注释中的字符串。此源代码中的字符串可以包含任何字符、空格，甚至注释。

示例:

// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";

输出:

["Alpha 2", "Alpha 4", "Alpha /*  */ 5 // with comments"]

问题是我无法使用正则表达式，因为我可以在给定字符串中添加注释(这是有效的)，当然我可以在内联注释或 block 注释中添加字符串。

我使用此方法来获取代码中的所有字符串:

re.findall(r'\"(.+?)\"', code)

但它也给了我注释中的字符串。

有什么帮助吗？

最佳答案

如果语言像你描述的那么简单，我想我会手工编写解析器。我仍然会使用正则表达式来标记输入。

给你:

import re
from itertools import takewhile


def extract_strings(source):
    def consume(it, end):
        return list(takewhile(lambda x: x != end, it))
    tokens = iter(re.split(r'''("|/\*|\*/|//|\n)''', source))
    strings = []
    for token in tokens:
        if token == '"':
            strings.append(''.join(consume(tokens, '"')))
        elif token == '//':
            consume(tokens, '\n')
        elif token == '/*':
            consume(tokens, '*/')
    return strings

data = '''
// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";
'''
print(extract_strings(data))

关于python - 使用 Python 查找源代码中不在注释内的所有字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47945819/

python - 使用 Python 查找源代码中不在注释内的所有字符串

上一篇：python - 如何从频率数据中找到分位数？

下一篇：python - pyodbc 在 Azure SQL 数据仓库中创建表时出错