sql - dplyr 与 dbplyr 带空格过滤

这与我之前的 question 部分相关。如果我使用 dplyr 根据带有尾随空格的唯一 id 和没有尾随空格的 id 来过滤数据帧，dplyr 将考虑空格是一个字符并且不会发生匹配，从而导致空数据框:

library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
#   a          d
#   <chr>  <dbl>
# 1 hjhjh      1

ids <- df %>% 
  select(a) %>% 
  pull()
ids
#[1] "hjhjh"

df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
#   a            d
#   <chr>    <dbl>
# 1 "hjhjh "     1
# 2 "popopo"     2

#now filter
df_new <- df_with_space %>% 
  filter(a  %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>

如果我尝试执行相同的操作并使用 SQL 数据库中的 dbplyr 进行过滤，它会忽略过滤中的空格，但仍将其包含在最终输出中，示例代码:

library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
                       Database = "test",
                       dsn = "SQL_server") 
db_df <- tbl(test_db, "testing")
db_df <- db_df %>% 
  filter(a  %in% ids) %>% 
  collect()
#quotation marks:
# # A tibble: 1 x 2
#   a            d
#   <chr>    <dbl>
# 1 "hjhjh "     1   #matches but includes the white space

我不熟悉 SQL - 这是预期的吗？如果是这样，您什么时候需要担心(尾随)空格？我想我需要首先修剪空格，这在大型数据库上非常慢:

db_df <- db_df %>% 
  mutate(a = str_trim(a, "both")) %>% 
  filter(a  %in% ids) %>% 
  collect()

谢谢

编辑

使用show_query

<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))

我认为这会产生一个可重现的场景:

dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>% 
  filter(a  %in% ids) 
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))

最佳答案

如果您正在连接到 SQL Server，那么我可以重现这一点。我个人会将其标记为“错误”，并且永远不会依赖它......

这里不需要使用dbplyr，问题出在底层DBMS中； dbplyr 只是信使，不要责怪信使:-)

设置

consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)

df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
  DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}

测试

lapply(cons, function(thiscon) {
  DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
#        a d
# 1 hjhjh  1
# $sqlserver
#        a d
# 1 hjhjh  1

lapply(cons, function(thiscon) {
  DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
#        a d
# 1 popopo 2
# $sqlserver
#        a d
# 1 popopo 2

SQL Server 和 MariaDB 在两个测试用例中都“失败”，SQLite 和 Postgres 都没有失败。

我在 SQL 规范中没有看到这一点，所以我不知道这些是否是错误、意外/未记录的功能、选项或其他内容。

解决方法

抱歉，我没有副手。 (除非接受这个“功能”并在查询后进行额外的过滤。)

关于sql - dplyr 与 dbplyr 带空格过滤，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68998247/

sql - dplyr 与 dbplyr 带空格过滤

设置

测试

解决方法

上一篇：Talend Open Studio 接受许可证

下一篇：python - Paramiko 在 Windows 上无法从 ssh-agent 找到 key