我正在尝试从 CABI invasive species compendium 中提取有关入侵植物物种位置的数据使用 rvest 包。
看了一些教程后,我发现我应该能够相当轻松地从表中抓取数据。然而,我总是遇到困难。
假设我想要物种 Brassica tournefortii 的位置数据.我应该能够使用这段代码,它使用了 outlined here 技术获取记录该物种的位置的详细信息。
library(rvest)
isc<-read_html("http://www.cabi.org/isc/datasheet/50069")
isc %>%
html_node("#toDistributionTable td:nth-child(1)") %>%
html_text()
但是,运行这段代码我得到了错误
Error: No matches
我对网页抓取完全陌生。我做错了什么可怕的事吗?
最佳答案
首先,我希望我能多给你点赞。最后一个与 $SPORTSBALL 或 $MONEY 无关的抓取问题! :-)
那个网站是邪恶的。它使用需要处理的嵌入式 namespace ,这也意味着使用 xml2
包:
library(rvest)
library(xml2)
isc <- read_html("http://www.cabi.org/isc/datasheet/50069")
ns <- xml_ns(isc)
xml_text(xml_find_all(isc, xpath="//div[@id='toDistributionTable']/table/tbody/tr/td[1]", ns))
## [1] "ASIA" "Azerbaijan"
## [3] "Bhutan" "China"
## [5] "-Tibet" "India"
## [7] "-Delhi" "-Indian Punjab"
## [9] "-Rajasthan" "-Uttar Pradesh"
## [11] "Iran" "Iraq"
## [13] "Israel" "Jordan"
## [15] "Kuwait" "Lebanon"
## [17] "Oman" "Pakistan"
## [19] "Qatar" "Saudi Arabia"
## [21] "Syria" "Turkey"
## [23] "Turkmenistan" "United Arab Emirates"
## [25] "Uzbekistan" "Yemen"
## [27] "AFRICA" "Algeria"
## [29] "Egypt" "Libya"
## [31] "Morocco" "South Africa"
## [33] "Tunisia" "NORTH AMERICA"
## [35] "Mexico" "USA"
## [37] "-Arizona" "-California"
## [39] "-Nevada" "-New Mexico"
## [41] "-Texas" "-Utah"
## [43] "SOUTH AMERICA" "Chile"
## [45] "EUROPE" "Belgium"
## [47] "Cyprus" "Denmark"
## [49] "France" "Greece"
## [51] "Ireland" "Italy"
## [53] "Spain" "Sweden"
## [55] "UK" "-England and Wales"
## [57] "-Scotland" "OCEANIA"
## [59] "Australia" "-Australian Northern Territory"
## [61] "-New South Wales" "-Queensland"
## [63] "-South Australia" "-Tasmania"
## [65] "-Victoria" "-Western Australia"
## [67] "New Zealand"
关于html - 从 html 表中抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36652452/