直接使用read_csv从readr读取zip文件产生奇怪的结果

标签 r readr

我正在尝试直接从 URL 读取以获取包含管道分隔文本文件的 zip 文件。如果我下载文件,则使用 read_csv从磁盘读取它,我没有问题。但是如果我尝试使用 read_csv要直接读取 URL,我在生成的 df 中得到了垃圾。我可以通过在下载中编码然后阅读来解决这个问题。但它似乎应该直接工作。关于这里发生了什么的任何线索?

library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
                 col_names = c('year','stFips','stAbbr','coFips','coName',
                               'cropCd','cropName','planCd','planAbbr','coverCat',
                               'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
                               'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
                               'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#>   .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col     row col   expected   actual        file                                expected   <int> <chr> <chr>      <chr>         <chr>                               actual 1     1 year  ""         embedded null 'https://www.rma.usda.gov/data/sob… file 2     1 <NA>  25 columns 1 columns     'https://www.rma.usda.gov/data/sob… row 3     2 <NA>  25 columns 4 columns     'https://www.rma.usda.gov/data/sob… col 4     3 <NA>  25 columns 2 columns     'https://www.rma.usda.gov/data/sob… expected 5     4 year  ""         embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#>   year     stFips   stAbbr  coFips  coName cropCd cropName planCd planAbbr
#>   <chr>    <chr>    <chr>   <chr>   <chr>  <chr>  <chr>    <chr>  <chr>   
#> 1 "PK\u00… <NA>     <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA>   <NA>   <NA>     <NA>   <NA>    
#> 3 "\xb0\x… "\xfd\x… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 4 "j`/Q\x… "\x96\x… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 5 "\xc0\x… <NA>     <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 6 "z\xe4\… "~y\xf5… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> #   covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> #   policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> #   quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> #   prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>

如果我先下载,我会得到以下输出:
> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+                  col_names = c('year','stFips','stAbbr','coFips','coName',
+                                'cropCd','cropName','planCd','planAbbr','coverCat',
+                                'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+                                'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+                                'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
  .default = col_integer(),
  stFips = col_character(),
  stAbbr = col_character(),
  coFips = col_character(),
  coName = col_character(),
  cropCd = col_character(),
  cropName = col_character(),
  planCd = col_character(),
  planAbbr = col_character(),
  coverCat = col_character(),
  deliveryType = col_character(),
  covLevel = col_double(),
  quantType = col_character(),
  lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
   year stFips stAbbr coFips coName       cropCd cropName      planCd planAbbr coverCat deliveryType covLevel
  <int> <chr>  <chr>  <chr>  <chr>        <chr>  <chr>         <chr>  <chr>    <chr>    <chr>           <dbl>
1  2018 02     AK     999    "All Other … 9999   "All Other C… 01     "YP    … "A    "  RBUP            0.500
2  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "A    "  RBUP            0.500
3  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "A    "  RBUP            0.750
4  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "C    "  RCAT            0.500
5  2018 02     AK     240    "Southeast … 9999   "All Other C… 02     "RP    … "A    "  RBUP            0.600
6  2018 02     AK     240    "Southeast … 9999   "All Other C… 02     "RP    … "A    "  RBUP            0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
#   unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
#   liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
> 

最佳答案

readr只能处理 gz压缩文件作为远程源,因为没有类似 base::gzcon()对于其他压缩算法。见 this github issue讨论和 improved documentation (也在 ?readr::datasource 中)。

关于直接使用read_csv从readr读取zip文件产生奇怪的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50415021/

相关文章:

r - 为什么 parse_number 函数说我的字符向量不是字符?

r - 当列为 num 时,列被错误标记为 int

使用readr从cognos 8文件(utf 16)读入R数据帧

r - 带逗号但没有小数的轴标签ggplot

r - 使用 geom_contour_filled 手动设置等高线图的比例

r - 仅当项目在 R 中出现多次时才添加索引

r - 如何将空格分隔的字符串转换为r中的数据框

读取目录中的所有 csv 文件,并在新列中添加每个文件的名称

r - 创建两个发言者的语音和非语音事件图

r - R中的strptime返回NA