regex - R : grep() lookup world. 城市数据集中的地理编码地址

我正在尝试对存储在字符向量中的一些地址进行地理编码。我在 ggmaps 中使用了 geocode() 函数；然而，它只分类了我大约 50% 的地址。我希望使用更基本的方法查找城市名称(来自 maps 包中的 world.cities 数据)是否在我的地址列表中，如果是，从此查找表中获取经度和纬度信息。我将尝试清理返回的文件，并用 R 提供的其他地理编码方法(调用各种外部 API)对其进行补充。到目前为止我编码的是如下:

places <- c("Atlanta,Georgia", "My house, Paris, France", "Some Other House, Paris, Ontario, Canada", "Paris", "Oxford", "Oxford, USA")

library(maps)
data(world.cities)
ddd <- world.cities[world.cities$name %in% c("Paris","Oxford","New York"),]

is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}

for (i in 1:length(places)) {
  for (j in 1:dim(ddd)[1]) {
  k <- ddd$name[j]
    if (is.integer0(grep(k,places[i],perl=TRUE))==TRUE) next
      if (exists("zzz")==FALSE) {
        zzz <- cbind(places[i],ddd[j,1:5])
      } else {
        zzz <- rbind(zzz,cbind(places[i],ddd[j,1:5])) 
      } 
  }
}

输出是我想要的(稍后我会主观地清理它)。我的问题是，我的真实数据约为 8000 个地址，而 world.cities 数据约为 40000 多个城市，因此双 for 循环方法有点慢。与 R 中的其他任务一样，我认为这可以通过 apply 系列的某些成员进行矢量化。我不知道该怎么做。有什么想法吗？

### Output
                                      places[i]   name country.etc     pop    lat  long
28245                   My house, Paris, France  Paris      Canada   10570  43.20  0.38
28246                   My house, Paris, France  Paris      France 2141839  48.86   2.34
282451 Some Other House, Paris, Ontario, Canada  Paris      Canada   10570  43.20 -80.38
282461 Some Other House, Paris, Ontario, Canada  Paris      France 2141839  48.86   2.34
282452                                    Paris  Paris      Canada   10570  43.20 -80.38
282462                                    Paris  Paris      France 2141839  48.86   2.34
27671                                    Oxford Oxford      Canada    1271  45.73 -63.87
27672                                    Oxford Oxford New Zealand    1816 -43.30 172.18
27673                                    Oxford Oxford          UK  157568  51.76  -1.26
276711                              Oxford, USA Oxford      Canada    1271  45.73 -63.87
276721                              Oxford, USA Oxford New Zealand    1816 -43.30 172.18
276731                              Oxford, USA Oxford          UK  157568  51.76  -1.26

经过进一步的数据清理后，我真的想要:

### Output
                                      places[i]   name country.etc     pop    lat  long
 28246                   My house, Paris, France  Paris      France 2141839  48.86   2.34
282451 Some Other House, Paris, Ontario, Canada  Paris      Canada   10570  43.20 -80.38
282462                                    Paris  Paris      France 2141839  48.86   2.34
27673                                    Oxford Oxford          UK  157568  51.76  -1.26
276731                              Oxford, USA Oxford          NA       NA    NA  NA
                               Atlanta, Georgia     NA          NA       NA    NA  NA

基本上，逻辑是:

如果国家/地区也匹配地点字符串，则保留该行。法国巴黎和加拿大巴黎为例。
如果地方字符串包含单个单词，则猜测它们指的是人口最多的城市。因此，默认巴黎为法国巴黎，默认牛津为英国牛津。因为很难对非唯一地址进行地理编码。
如果地方字符串包含多个单词，但国家/地区与任何其他单词都不匹配，例如 Oxford, USA。然后制作除城市 NA 之外的所有内容。在这里，我将尝试使用 geocode() 和其他服务来获取更好的信息。
如果地点地址从未在查找字典中添加，然后尝试使用 geocode() 等填写所有内容(实际上我只想要长/纬度)。这就是亚特兰大佐治亚州的示例。

对一般方法的思考以及如何在 R 中做得更好？如上所述，采用这种方法的动力是看看我是否可以补充我已经得到的内容(使用 geocode() 函数对 50% 的地理编码地址进行补充)

最佳答案

这使得城市提取更加通用(使用字符串正则表达式匹配)，然后与 world.cities 数据合并:

places_dat <- cbind(places, Reduce(rbind, 
                lapply(str_match_all(places, ",*\ *([[:alpha:]]+)\ *,\ *([[:alpha:]]+)\ *$"),
                  function(x) {

  if (length(x) == 0) {
    return(data.frame(city=NA, state=NA))
  } else {
    return(data.frame(city=x[,2], state=x[,3]))
  }

})))

places_dat

##                                     places    city   state
## 1                          Atlanta,Georgia Atlanta Georgia
## 2                  My house, Paris, France   Paris  France
## 3 Some Other House, Paris, Ontario, Canada Ontario  Canada
## 4                                    Paris    <NA>    <NA>
## 5                                   Oxford    <NA>    <NA>
## 6                              Oxford, USA  Oxford     USA
## 

merge(places_dat, world.cities, by.x="city", by.y="name", all.x=TRUE)

##      city                                   places   state country.etc     pop    lat    long capital
## 1 Atlanta                          Atlanta,Georgia Georgia         USA  424096  33.76  -84.42       0
## 2   Paris                  My house, Paris, France  France      France 2141839  48.86    2.34       1
## 3   Paris                  My house, Paris, France  France      Canada   10570  43.20  -80.38       0
## 4 Ontario Some Other House, Paris, Ontario, Canada  Canada         USA  175805  34.05 -117.61       0
## 5  Oxford                              Oxford, USA     USA      Canada    1271  45.73  -63.87       0
## 6  Oxford                              Oxford, USA     USA New Zealand    1816 -43.30  172.18       0
## 7  Oxford                              Oxford, USA     USA          UK  157568  51.76   -1.26       0
## 8    <NA>                                    Paris    <NA>        <NA>      NA     NA      NA      NA
## 9    <NA>                                   Oxford    <NA>        <NA>      NA     NA      NA      NA

它仍然需要一些筛选(也许 complete.cases 作为一步)，但它会让您更进一步，并且应该更快一点。

关于regex - R : grep() lookup world. 城市数据集中的地理编码地址，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25809636/

regex - R : grep() lookup world. 城市数据集中的地理编码地址

上一篇：openscad - 为什么差异函数不起作用(openSCAD)？

下一篇：couchdb - 发出的键可以在数组的开头有可选的数组参数吗？