c++ - 使用 C/C++ 从已解析的文本中解析名词短语

我想从已解析的文本中解析名词短语(NN、NNP、NNS、NNPS)。例如:

Input sentence -
John/NNP
works/VBZ
in/IN
oil/NN
industry/NN
./.
Output: John Oil Industry

我对逻辑感到困惑，因为我需要搜索字符串，例如 /NN、/NNP、/NNS 和 /NNPS 并打印它之前的前一个单词。使用 C 或 C++ 解析名词短语的逻辑是什么？

我自己的尝试如下:

char* SplitString(char* str, char sep 
{
    return str;
}
main()
{
    char* input = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    char *output, *temp;
    char * field;
    char sep = '/NNP';
    int cnt = 1;
    output = SplitString(input, sep);

    field = output;
    for(temp = field; *temp; ++temp){ 
       if (*temp == sep){
          printf(" %.*s\n", temp-field, field);
          field = temp+1;
       }
    }
    printf("%.*s\n", temp-field, field);
}

我的修改如下:

#include <regex>
#include <iostream>

int main()
{
    const std::string s = "John/NNP works/VBZ in/IN oil/NNS industry/NNPS ./.";
    std::regex rgx("(\\w+)\/NN[P-S]{0,2}");
    std::smatch match;

    if (std::regex_search(s.begin(), s.end(), match, rgx))
        std::cout << " " << match[1] << '\n';
}

我得到的输出只是“John”。其他/NNS 标签不会出现。

我的第二种方法:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

char** str_split(char* a_str, const char a_delim)
{
    char** result = 0;
    size_t count = 0;
    char* tmp = a_str;
    char* last_comma = 0;
    char delim[2];
    delim[0] = a_delim;
    delim[1] = 0;

    /* Count how many elements will be extracted. */
    while (*tmp)
    {
        if (a_delim == *tmp)
        {
            count++;
            last_comma = tmp;
        }
        tmp++;
    }

    /* Add space for trailing token. */
    count += last_comma < (a_str + strlen(a_str) - 1);

    /* Add space for terminating null string so caller
       knows where the list of returned strings ends. */
    count++;

    result = malloc(sizeof(char*) * count);

    if (result)
    {
        size_t idx  = 0;
        char* token = strtok(a_str, delim);

        while (token)
        {
            assert(idx < count);
            *(result + idx++) = strdup(token);
            token = strtok(0, delim);
        }
        assert(idx == count - 1);
        *(result + idx) = 0;
    }

    return result;
}

int main()
{
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    char** tokens;

    //printf("INPUT SENTENCE=[%s]\n\n", text);

    tokens = str_split(text, '');

    if (tokens)
    {
        int i;
        for (i = 0; *(tokens + i); i++)
        {
            printf("[%s]\n", *(tokens + i));
            free(*(tokens + i));
        }
        printf("\n");
        free(tokens);
    }

    return 0;
}

输出是:

[John/NNP]
[works/VBZ]
[in/IN]
[oil/NN]
[industry/NN]
[./.]

我只想要 /NNP 和 /NN 解析数据，即 John、oil 和 industry 。如何得到这个？正则表达式有帮助吗？如何在 C 中像 C++ 一样使用正则表达式？

最佳答案

如果一切都与打印有关，请尝试这种方法。它使用 regular expression在搜索函数中查找是否存在模式 \/NN[A-Z]{0,3} 即/NN 后跟 0 到 3 个大写字母并捕获 () \\w+ 之前的单词。

虽然这是未经测试的:

#include <regex>
#include <iostream>

int main()
{
    const std::string s = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    std::regex rgx("(\\w+)\/NN[A-Z]{0,3}");
    std::smatch match;

    while (std::regex_search(s, match, rgx))
        std::cout << "match: " << match[1] << '\n';
}

关于c++ - 使用 C/C++ 从已解析的文本中解析名词短语，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33560880/

c++ - 使用 C/C++ 从已解析的文本中解析名词短语

上一篇：c++ - 在 bool 矩阵中查找唯一行(帮助解决 std::string 的垃圾输出)

下一篇：c++ - 在 C++ 中打印二维数组