compiler-construction - 直接编码与表驱动的词法分析器？

我是编译器构建世界的新手，我想知道直接编码与表驱动的词法分析器之间的区别是什么？

如果可能，请使用简单的源代码示例。

谢谢。

编辑:

在Engineering a Compiler书中，作者将词法分析器分为三类(3):表驱动，直接编码和手工编码。

最佳答案

我将假设您所说的“直接编码”是一个手写的词法分析器，而不是指作为词法分析器生成器的输出而生成的词法分析器。在这种情况下...(请参阅下文)。

... 表驱动的词法分析器是一个(通常自动生成的)程序，它使用某种查找表来确定要执行的操作。考虑与正则表达式ab*a(有意未最小化)相对应的finite automaton:

如果我们只考虑字符“a”和“b”，则可以在查找表中对其进行编码，如下所示:

#define REJECT -1

/* This table encodes the transitions for a given state and character. */
const int transitions[][] = {
    /* In state 0, if we see an a then go to state 1 (the 1).
     * Otherwise, reject input.
     */
    { /*a*/  1,  /*b*/  REJECT },
    { /*a*/  2,  /*b*/  3      },
    { /*a*/ -1,  /*b*/ -1      }, /* Could put anything here. */
    { /*a*/  2,  /*b*/  3      }
};

/* This table determines, for each state, whether it is an accepting state. */
const int accept[] = { 0, 0, 1, 0 };

现在我们只需要一个实际使用该表的驱动程序:

int scan(void) {
    char ch;
    int state = 0;

    while (!accept[state]) {
        ch = getchar() - 'a'; /* Adjust so that a => 0, b => 1. */
        if (transitions[state][ch] == REJECT) {
            fprintf(stderr, "invalid token!\n");
            return 0; /* Fail. */
        } else {
            state = transitions[state][ch];
        }
    }
    return 1; /* Success! */
}

当然，在真正的词法分析器中，每个 token 都将具有相应的接受状态，并且您将必须以某种方式修改接受表以包含 token 标识符。不过，我想强调两点:

表驱动的词法分析器不一定在DFA状态级别上运行。

我不建议手动编写表驱动的词法分析器，因为这是一个乏味且容易出错的过程。

一个手写的(由于缺少更好的名称)词法分析器通常不使用查找表。假设我们需要一种简单的类似于Lisp的语言的词法分析器，该语言具有括号，标识符和十进制整数:

enum token {
    ERROR,
    LPAREN,
    RPAREN,
    IDENT,
    NUMBER
};

enum token scan(void) {
    /* Consume all leading whitespace. */
    char ch = first_nonblank();
    if (ch == '(') return LPAREN;
    else if (ch == ')') return RPAREN;
    else if (isalpha(ch)) return ident();
    else if (isdigit(ch)) return number();
    else {
        printf("invalid token!\n");
        return ERROR;
    }
}

char first_nonblank(void) {
    char ch;
    do {
        ch = getchar();
    } while (isspace(ch));
    return ch;
}

enum token ident(void) {
    char ch;
    do {
        ch = getchar();
    } while (isalpha(ch));
    ungetc(ch, stdin); /* Put back the first non-alphabetic character. */
    return IDENT;
}

enum token number(void) {
    char ch;
    do {
        ch = getchar();
    } while (isdigit(ch));
    ungetc(ch, stdin); /* Put back the first non-digit. */
    return NUMBER;
}

像表驱动的词法分析器示例一样，该示例也不完整。一方面，它需要某种缓冲来存储作为 token 一部分的字符，例如IDENT和NUMBER。另一方面，它不能特别优雅地处理EOF。但是希望您能掌握要点。

编辑:根据Engineering a Compiler中的定义，直接编码的词法分析器基本上是两者的混合。与使用表格不同，我们使用代码标签来表示状态。让我们看看使用与以前相同的DFA的外观。

int scan(void) {
    char ch;

state0:
    ch = getchar();
    if (ch == 'a') goto state1;
    else { error(); return 0; }

state1:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3;
    else { error(); return 0; }

state2:
    return 1; /* Accept! */

state3:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3; /* Loop. */
    else { error(); return 0; }
}

在我的个人经历中，我上面概述的手写方法是“最佳”的方法来编写词法分析器。它可以在每种平台，每种语言上运行，您无需学习lex之类的古老工具，而且也许最重要的是，您可以灵活地扩展lexer，使其具有难以或无法通过工具实现的功能。例如，也许您希望您的词法分析器理解Python风格的块indentaiton，或者可能需要动态扩展某些 token 类。

关于compiler-construction - 直接编码与表驱动的词法分析器？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27763544/

compiler-construction - 直接编码与表驱动的词法分析器？

上一篇：asp.net - 如何在Visual Studio 2013中允许父路径？

下一篇：unit-testing - 团队 build 练习来教授编写单元测试