r 不将数据框视为字符；无法 grep； as.character() 的使用错误？

编辑，一天后:这个问题的答案告诉我，我需要大量编辑我的代码。所以基本上这个问题现在已经消失了，因为我不再在数据帧中grepping。现在的代码如下，更加清晰。

不过，我将原来的问题留在这里，以防我的学习过程对任何人有帮助。

# 1. Find lines containing both "un" and "ɛ̃"
original_lines <- readLines('Test.txt')
lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)

# CHANGE PHONES TO FIND AND PHONES TO ADD

# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines <- character()
for (line in lines_with_pattern)
  modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))

# 3. Combine modified lines with original lines
all_lines <- c(original_lines, modified_lines)

# 4. Sort the lines alphabetically
sorted_lines <- sort(all_lines)

# 5. Print the sorted lines
writeLines(sorted_lines, 'myfile.txt', sep = '\\n')

原始问题

我正在尝试 grep 一个由两列行组成的数据框，列之间用制表符分隔，例如

              V1            V2
17 nempruntèrent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18     vemprunté   ɑ̃ p ʁ ɛ̃ t e
19    fempruntée   ɑ̃ p ʁ ɛ̃ t e
20   wempruntées   ɑ̃ p ʁ ɛ̃ t e
21    2empruntés   ɑ̃ p ʁ ɛ̃ t e

(摘录 - 数据框的最后五行。第一列包含类似法语的虚拟单词；第二列包含国际音标转录。)

test <- read.delim('Test.txt', header=FALSE)
print(test)

生成如上所示的打印输出，因此看起来好像 R“知道”数据框中的内容。

但是我想grep某些字符串，所以我尝试了

# 1. Find lines containing both "un" and "ɛ̃"
lines_with_pattern <- grep("un", test, value = TRUE)
print(lines_with_pattern)

这不起作用。

上面的grep结果是命名字符(0)。这意味着 R 没有找到它正在寻找的字符，所以我尝试过

test <- read.delim('Test.txt', header=FALSE)
test <- as.character(test)

我认为我没有正确使用 as.character() ，因为该片段会产生例如

     V1 V2
 
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA

(同样是输出的最后五行)

因此 print(test) 产生

[1] "c(17, 18, 12, 20, 1)"
[2] "c(8, 6, 6, 6, 6)"

(结果向量中的最后五个数字)

和

lines_with_pattern <- grep("un", test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)

产生字符(0)。

所以:我不明白上面示例中 print(test) 生成的向量 - 这些数字似乎并不指代与数据相对应的任何内容。而且，我最初的问题是:我需要做什么才能grep这个数据集？

很抱歉很长的消息，以及菜鸟问题，但非常感谢您的帮助!

最佳答案

grep() 不能在数据帧上按原样使用。此外，数据框对于逐行操作效果不佳，而这正是您似乎感兴趣的事情。在基本 R 中，我将使用 apply() (它将数据隐式转换为字符矩阵)和 grepl() 的组合:

which(apply(
  test,
  MARGIN = 1,
  FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
))

这将为您提供同时出现“un”和“ɛ̃”的所有行。 which() 的用途是获取行号而不是逻辑向量(它本身也可以完美地进行子集化)。

关于r 不将数据框视为字符；无法 grep； as.character() 的使用错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76406125/

r 不将数据框视为字符；无法 grep； as.character() 的使用错误？

上一篇：eigen - 来自 Eigen 的 2000 行警告

下一篇：angular - 相同的 Angular 代码在生产中停止工作