R:POST 后抓取附加数据仅适用于第一页

标签 r web-scraping rvest

我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

该页面确实提供了一个robotx.txt 文件,但是,它的内容对公众免费提供,我认为抓取这些数据是不受禁止的。

这是更新 of this question ,因为我取得了一些进展。

到目前为止我取得了什么

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

下一篇:获取基础数据
# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

下一步:获取附加数据
# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text 

我的问题:
# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

下一篇:获取新的基础数据
# I get easily a table with the new results

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

但是,如果我尝试获取新的附加数据,则会再次从第 1 页获取结果:
# does not give the desired output:

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text 

我要找的:第2页第一种药的详细资料
enter image description here

问题:
  • 为什么我得到重复的结果?是不是因为__VIEWSTATE那可能
    换新期间request_POST ?
  • 有没有办法解决这个问题?
  • 有没有更好的方法来获取基本数据和附加数据?如果是,如何?
  • 最佳答案

    我认为你只是想多了这个问题。问题出在 xpath .本质上是 xpath您用于数据提取的所有页面都相同。是的,//*[@id="ctl00_cphContent_gvwPreparations"]您的代码中唯一发生变化的组件是 txtPageNumber .在下面的代码中,我更改了 txtPageNumber3 ,喜欢,txtPageNumber=3我建议你的重点应该放在类似的东西上,如何自动化页码以进行数据提取? .这样,您就不必手动更改 txtPageNumber

    page<-rvest:::request_POST(pgsession,url,
                               body=list(
                                 `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                                 `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                                 `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                                 `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                                 `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                                 `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                                 `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                                 `__EVENTARGUMENT`=""
    
                               ),
                               encode="form")
    

    以下代码对我有用;
    library(rvest)
    library(dplyr)
    
    
    url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
    pgsession<-html_session(url)
    pgform<-html_form(pgsession)[[1]]
    
    page<-rvest:::request_POST(pgsession,url,
                               body=list(
                                 `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                                 `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                                 `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                                 `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                                 `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                                 `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                                 `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                                 `__EVENTARGUMENT`=""
    
                               ),
                               encode="form")
    # makes a table of all results of the first page
    
    read_html(page) %>%
      html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
      html_table(fill=TRUE) %>% 
      bind_rows %>%
      tibble()
    
    # A tibble: 11 x 1
       .$``  $Präparat $`Galen. Form /~ $Packung $FAP  $PP   $SB   $`Lim-Pkt` $Lim 
       <chr> <chr>     <chr>            <chr>    <chr> <chr> <chr> <chr>      <chr>
     1 21.   Accolate  Tabl 20 mg       60 Stk   29.75 50.55 ""    ""         ""   
     2 22.   Accupaque Inj Lös 300 mg   Plast F~ 32.00 53.10 ""    ""         ""   
     3 23.   Accupaque Inj Lös 300 mg   Plast F~ 61.15 86.60 ""    ""         ""   
     4 24.   Accupaque Inj Lös 300 mg   Plast F~ 120.~ 154.~ ""    ""         ""   
     5 25.   Accupaque Inj Lös 350 mg   Plast F~ 33.97 55.35 ""    ""         ""   
     6 26.   Accupaque Inj Lös 350 mg   Plast F~ 66.88 93.20 ""    ""         ""   
     7 27.   Accupaque Inj Lös 350 mg   Plast F~ 129.~ 164.~ ""    ""         ""   
     8 28.   Accupro ~ Filmtabl 10 mg   30 Stk   8.56  18.00 ""    ""         ""   
     9 29.   Accupro ~ Filmtabl 10 mg   100 Stk  26.60 46.90 ""    ""         ""   
    10 30.   Accupro ~ Filmtabl 20 mg   30 Stk   14.02 28.35 ""    ""         ""   
    11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
    # ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
    #   $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
    #   Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>
    
    # gives the desired informations of the first drug (not yet very structured)
    
    read_html(page) %>%
      html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
      html_text %>%
      head(10)
    
    
    [1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n                        21.\r\n                    \r\n                        Accolate\r\n                    \r\n                        Tabl 20 mg \r\n                    \r\n                        60 Stk\r\n                    \r\n                        29.75\r\n                    \r\n                        50.55\r\n                    \r\n                                                \r\n                    \r\n                        \r\n                    \r\n                      \r\n                    \r\n                        53750036\r\n                    \r\n                        AstraZeneca AG\r\n                    \r\n                        Zafirlukastum\r\n                    \r\n                        17053\r\n                    \r\n                        15.03.1998\r\n                    \r\n                        \r\n                        \r\n                    \r\n                        \r\n                    \r\n                        03.04.50.\r\n                    \r\n                        R03DC01\r\n                    \r\n\t\t\t\t\r\n                        22.\r\n                    \r\n                        Accupaque\r\n                    \r\n 
    

    关于R:POST 后抓取附加数据仅适用于第一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56068532/

    相关文章:

    python - 谷歌地图使用 Selenium 的地点ID

    python - 使用列表理解进行网页抓取

    javascript - 将幻影渲染的 HTML 读取到 R 中

    r - 对最大位于 r 中心的向量进行排序

    r - 以编程方式获取最新的稳定版本号

    r - 如何以编程方式在data.table中选择列?

    python - ReadTimeout : HTTPSConnectionPool(host ='' , 端口=443) : Read timed out. (读取超时=10)

    R:管道 (%>%) 不适用于 round()。例如:136/13.00 %>% round() = 10.46154

    xml - 如何将 HTML R 对象转换为字符?

    r - 使用 rvest 或 RSelenium 来抓取表