我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:
http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=
该页面确实提供了一个robotx.txt 文件,但是,它的内容对公众免费提供,我认为抓取这些数据是不受禁止的。
这是更新 of this question ,因为我取得了一些进展。
到目前为止我取得了什么
# opens the first results page
# opens the first link as a table at the end of the page
library("rvest")
library("dplyr")
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
下一篇:获取基础数据
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
下一步:获取附加数据
# gives the desired informations (=additional data) of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
我的问题:
# if I open the second search page
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
下一篇:获取新的基础数据
# I get easily a table with the new results
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
但是,如果我尝试获取新的附加数据,则会再次从第 1 页获取结果:
# does not give the desired output:
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
我要找的:第2页第一种药的详细资料
问题:
__VIEWSTATE
那可能换新期间
request_POST
? 最佳答案
我认为你只是想多了这个问题。问题出在 xpath
.本质上是 xpath
您用于数据提取的所有页面都相同。是的,//*[@id="ctl00_cphContent_gvwPreparations"]
您的代码中唯一发生变化的组件是 txtPageNumber
.在下面的代码中,我更改了 txtPageNumber
至 3
,喜欢,txtPageNumber=3
我建议你的重点应该放在类似的东西上,如何自动化页码以进行数据提取? .这样,您就不必手动更改 txtPageNumber
在
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
以下代码对我有用;
library(rvest)
library(dplyr)
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# A tibble: 11 x 1
.$`` $Präparat $`Galen. Form /~ $Packung $FAP $PP $SB $`Lim-Pkt` $Lim
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 21. Accolate Tabl 20 mg 60 Stk 29.75 50.55 "" "" ""
2 22. Accupaque Inj Lös 300 mg Plast F~ 32.00 53.10 "" "" ""
3 23. Accupaque Inj Lös 300 mg Plast F~ 61.15 86.60 "" "" ""
4 24. Accupaque Inj Lös 300 mg Plast F~ 120.~ 154.~ "" "" ""
5 25. Accupaque Inj Lös 350 mg Plast F~ 33.97 55.35 "" "" ""
6 26. Accupaque Inj Lös 350 mg Plast F~ 66.88 93.20 "" "" ""
7 27. Accupaque Inj Lös 350 mg Plast F~ 129.~ 164.~ "" "" ""
8 28. Accupro ~ Filmtabl 10 mg 30 Stk 8.56 18.00 "" "" ""
9 29. Accupro ~ Filmtabl 10 mg 100 Stk 26.60 46.90 "" "" ""
10 30. Accupro ~ Filmtabl 20 mg 30 Stk 14.02 28.35 "" "" ""
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
# $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
# Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>
# gives the desired informations of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_text %>%
head(10)
[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n 21.\r\n \r\n Accolate\r\n \r\n Tabl 20 mg \r\n \r\n 60 Stk\r\n \r\n 29.75\r\n \r\n 50.55\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n 53750036\r\n \r\n AstraZeneca AG\r\n \r\n Zafirlukastum\r\n \r\n 17053\r\n \r\n 15.03.1998\r\n \r\n \r\n \r\n \r\n \r\n \r\n 03.04.50.\r\n \r\n R03DC01\r\n \r\n\t\t\t\t\r\n 22.\r\n \r\n Accupaque\r\n \r\n
关于R:POST 后抓取附加数据仅适用于第一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56068532/