sql-server - RODBC和Microsoft SQL Server : Truncating Long Character Strings

标签 sql-server r truncate rhel rodbc

我正在尝试使用R/RODBC从Microsoft SQL Server数据库查询变量。 RODBC正在将字符串截断为8000个字符。

原始代码:截断为255个字符(根据RODBC文档) library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT text_var FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)
部分解决方案:修改查询字符串以截断7999个字符后的文本。 library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT [text_var]=CAST(text_var AS VARCHAR(8000)) FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)
该表/变量包含长达250,000个字符的文本字符串。我真的很想使用R中的所有文本。这可能吗?

@BrianRipley在以下文档的第18页上讨论了该问题(但没有解决方案):
https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

@nutterb讨论了GitHub上RODBCext包的类似问题:
https://github.com/zozlak/RODBCext/issues/6

在SO上看到了类似的讨论,但没有使用VARCHAR> 8000的RODBC解决方案。

RODBC sqlQuery() returns varchar(255) when it should return varchar(MAX)

RODBC string getting truncated

笔记:

  • R 3.3.2
  • Microsoft SQL Server 2012年
  • Linux RHEL 7.1的
  • 用于SQL Server的Microsoft ODBC驱动程序
  • 最佳答案

    由于这是Microsoft提供的ODBC驱动程序的限制,因此在更改驱动程序之前,几乎没有什么可做的。 @zozlak解释了为什么在您链接的GitHub问题中。

    我倾向于在需要时使用存储过程来解决此问题,但这通常需要为每个特定实例编写一个存储过程。在某个时候,我可能会想出一种更通用的在存储过程中执行此操作的方法,但是我发现在存储过程中构造查询的过程既繁琐又令人沮丧。

    因此,我花了一些时间来构建一个函数,该函数将执行涉及VARCHAR(MAX)变量的有限查询。这是一种蛮力方法,对于17000个字符的变量,将其导出为三个变量并将其粘贴到R中。这很粗糙,可能不是很有效,但是我想出了最好的解决方案。

    另一个限制是它不允许您在查询中重命名变量。您将被数据库中命名的变量所困扰。如果您只涉及几个表,那可能不是问题。在非常复杂的数据库中,这可能会出现问题。但是,至少与此相关,您可以仅查询带有少数必要ID的VARCHAR(MAX)变量,以在R中执行合并。

    正如GitHub问题中所讨论的那样,最好尽可能避免VARCHAR(MAX)。如果确实需要未知的长度,则VARBINARY(MAX)会更容易查询。

    例子

    来源(“https://gist.githubusercontent.com/nutterb/d2e050dada608bb6213e61d0f8471b65/raw/be8717f318b3e3087e7c26c9a8f9d0a582a5daef/query_varchar_max

    channel <- odbcDriverConnect(...)
    
    query_varchar_max(channel = channel,
                      id = c("idvar"),
                      varchar_max = c("varchar_max_var", "varchar_max_var2"),
                      from = "FROM dbo.table_name WHERE group = ?",
                      data = list(group = "A"))
    

    功能码
    #' @name query_varchar_max
    #' @title Query a VARCHAR(MAX) Variable from SQL Server
    #' 
    #' @description The RODBC driver to SQL Server (SQL Server Native Client 11.0)
    #'   reports the lenght of a VARCHAR(MAX) variable to be zero.  This presents 
    #'   difficulties in extracting long text values from the database. Often, the
    #'   ODBC will assume a length of 255 characters and truncate the text to that
    #'   many characters.  The approach taken here searches the VARCHAR(MAX) variables 
    #'   for the longest length, and extracts the data in segments to be pasted 
    #'   together in R.  
    #'   
    #' @param channel A valid ODBC channel to a SQL Server database.
    #' @param id A character vector of ID variables that may be used to merge the 
    #'   data from this query into another dataset.
    #' @param varchar_max a character vector of variable names that are to be 
    #'   treated as if they are VARCHAR(MAX) variables. 
    #' @param from A single character string providing the remainder of the query 
    #'   to be run, beginning with the \code{FROM} statement.
    #' @param stringsAsFactors \code{logical(1)}. Should character strings returned 
    #'   from the database be converted to factors?
    #' @param ... Additional arguments to \code{sqlExecute} when running the full 
    #'   query.
    #'   
    #' @details \code{query_varchar_max} operates by determining how many columns of up to
    #'   8000 characters each are required to export a complete VARCHAR(MAX) variable.
    #'   It then creates the necessary number of intermediate variables and queries the 
    #'   data using the SQL Server \code{SUBSTRING} command, extracting the VARCHAR(MAX)
    #'   variable in increments of 8000 characters. After completing the query, 
    #'   the intemediary variables are concatenated and removed from the data.
    #'   
    #'   The function makes accommodation for multi-part queries as far as [TABLE].[VARIABLE]
    #'   formats are concerned. It is not intended for use in [SCHEMA].[TABLE].[VARIABLE]
    #'   formats. This at least allows \code{from} to include joins for more complex 
    #'   queries.  Parameterized queries are also supported through \code{sqlExecute}.
    #'
    #' @export
    
    query_varchar_max <- function(channel, id, varchar_max, from, 
                                  stringsAsFactors = FALSE, ...)
    {
      coll <- checkmate::makeAssertCollection()
    
      checkmate::assert_class(x = channel,
                              classes = "RODBC",
                              add = coll)
    
      checkmate::assert_character(x = id,
                                  add = coll)
    
      checkmate::assert_character(x = varchar_max,
                                  add = coll)
    
      checkmate::assert_character(x = from,
                                  len = 1,
                                  add = coll)
    
      checkmate::assert_logical(x = stringsAsFactors,
                                len = 1,
                                add = coll)
    
      checkmate::reportAssertions(coll)
    
      varchar_max_len <-
        paste0(
          sprintf("MAX(LEN(%s)) AS len_%s", 
                  varchar_max,
                  sub("[.]", "_", varchar_max)),
          collapse = ", "
        )
    
      varchar_len <- 
        unlist(
          RODBCext::sqlExecute(
            channel = channel,
            query = sprintf("SELECT %s %s",
                            varchar_max_len,
                            from),
            fetch = TRUE
          )
        )
    
      varchar_max_cols <- 
        unlist(
          mapply(expand_varchar_max,
                 varchar_max,
                 varchar_len,
                 SIMPLIFY = FALSE)
        )
    
      Prelim <- 
        RODBCext::sqlExecute(
          channel = channel,
          query = sprintf("SELECT %s, %s %s",
                          paste0(id, collapse = ", "), 
                          paste0(varchar_max_cols, collapse = ", "),
                          from),
          fetch = TRUE,
          stringsAsFactors = stringsAsFactors,
          ...
        )
    
      var_stub_to_combine <-
        unique(
          sub(
            "(part)(\\d{1,3})", 
            "\\1",
            sub(".+AS ", "", varchar_max_cols)
          )
        )
    
      col_to_combine <- 
        lapply(var_stub_to_combine,
               grep,
               names(Prelim))
    
      Prelim[sub(".+[.]", "", varchar_max)] <-
        lapply(col_to_combine,
               function(col) apply(Prelim[col], 1, paste0, collapse = ""))
    
      Prelim[-unlist(col_to_combine)]
    
    }
    
    
    expand_varchar_max <- function(varchar_max, varchar_len)
    {
      nvar <- varchar_len %/% 8000 + 1
    
      var_list <- vector("character", length = nvar)
    
      for (i in seq_along(var_list))
      {
        var_list[i] <- 
          sprintf("SUBSTRING(%s, %s, %s) AS %s_part%s",
                  varchar_max,
                  1 + (i - 1) * 8000,
                  8000,
                  paste0(sub("[.]", "_", varchar_max)),
                  i)
      }
      var_list
    }
    

    关于sql-server - RODBC和Microsoft SQL Server : Truncating Long Character Strings,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44249788/

    相关文章:

    mysql - 数据库设计 : Index Varchar

    r - 带有观星器的 LaTeX 表格的简短说明

    r - 更改数据框中的值名称

    sql - 现实世界中何时以及为何使用 TRUNC(DATE, [FORMAT])

    sql-server - 来自不同数据库的表之间的关系

    sql - 当条件满足时计算每个组内的行数 Sql Server

    sql-server - 使用 VBA 仅将新记录插入到 SQL 表中

    arrays - 如何将数组作为参数传递给 R 脚本命令行运行?

    java - 截断为整数...以及 if 语句确定它是否不是整数

    mysql - 如何重写我的 MySQL 更新语句以消除 "Truncated incorrect INTEGER value"警告?