c - 不使用 strtok 标记一行

我正在从文件中读取行并对它们进行标记。 token 通过空格分隔或位于引号内进行区分(例如:“to ken”)。

我写了一段代码，但我遇到了指针问题。我不知道如何存储一行中的标记，或者更确切地说，如何设置指向它们的指针。

另外，有人建议我在我“识别”的每个标记后面放一个 0，这样我就知道它何时结束，并且我仅将指向 a 的指针存储在 char *tokens[] 中 token 的开始。

我当前的代码:

char *tokens[50];
int token_count;

int tokenize(char *line){
    token_count = 0;
    int n = 0;          

    while(line[n] != NULL || line[n] != '\n'){
        while(isspace(line[n++]));
        if(line[n] == '"'){
            while(line[++n] != '"' || line[n] != NULL){
                  /* set tokens[n] */
            }
        }
        else{
            while(!isspace(line[n++])){
                  /*set tokens[n] */
            }

        }

        n++;
    }

    tokens[token_count] = 0;

}

最佳答案

您可以使用字符串基 line 和索引 n 通过递增 n 逐步遍历字符串:

while (str[n] != '\0') n++;

如果您使用指针，您的任务可能会更容易:

while (*str != '\0') str++;

然后，您的标记可以在读取标记之前通过指针的值来表示，即当您点击引号或非空格时。这为您提供了 token 的开始。

token 的长度怎么样？在 C 中，字符串是字符数组，以空字符结尾。这意味着，您的标记包含整行的其余部分，因此也包含所有后续标记。您可以在每个标记后面放置一个 '\0'，但这有两个缺点:它不适用于只读字符串文字，并且根据您的标记语法，它并不总是可行。例如，字符串 a"b b"c 可能应该解析为三个标记 a、"b b" 和 c，但在标记后面放置空字符会破坏标记化过程。

另一种方法是将标记存储为指向起始字符和长度的指针对。这些标记不再以 null 终止，因此如果您想将它们与标准 C 字符串函数一起使用，则必须将它们写入临时缓冲区。

这是一种方法。

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>

struct token {
    const char *str;
    int length;
};

int tokenize(const char *p, struct token tk[], int n)
{
    const char *start;
    int count = 0;   

    while (*p) {
        while (isspace(*p)) p++;
        if (*p == '\0') break;

        start = p;
        if (*p == '"') {
            p++;
            while (*p && *p != '"') p++;
            if (*p == '\0') return -1;        /* quote not closed */            
            p++;
        } else {            
            while (*p && !isspace(*p) && *p != '"') p++;
        }

        if (count < n) {
            tk[count].str = start;
            tk[count].length = p - start;
        }
        count++;
    }

    return count;
}

void token_print(const struct token tk[], int n)
{
    int i;

    for (i = 0; i < n; i++) {
        printf("[%d] '%.*s'\n", i, tk[i].length, tk[i].str);
    }
}

#define MAX_TOKEN 10

int main()
{
    const char *line = "The \"New York\" Stock Exchange";
    struct token tk[MAX_TOKEN];
    int n;

    n = tokenize(line, tk, MAX_TOKEN);
    if (n > MAX_TOKEN) n = MAX_TOKEN;
    token_print(tk, n);    

    return 0;
}

每个 token 的开头都保存在本地变量中，并在扫描后分配给该 token 。当p指向token后面的字符时，表达式:

p - start

给你长度。 (这称为指针算术。)该例程扫描所有标记，但它最多只分配 n 个标记，以免溢出提供的缓冲区。

关于c - 不使用 strtok 标记一行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24014715/

c - 不使用 strtok 标记一行

上一篇：c - 当我们在 c 中用字符串文字初始化 char 数组时，会发生垃圾收集吗？

下一篇：C 程序无法正常运行