html - R 解析网页中的不完整文本(HTML)

标签 html r xml text-mining rvest

我正在尝试从多篇科学文章中解析纯文本以供后续文本分析。到目前为止,我使用 R script by Tony Breyal基于包 RCurlXML。这适用于所有目标期刊, http://www.sciencedirect.com 发表的除外。 。当我尝试解析来自 SD 的文章时(这对于我需要从 SD 访问的所有测试期刊都是一致的),R 中的文本对象仅将整个文档的第一部分存储在其中。不幸的是,我不太熟悉 html,但我认为问题应该出在 SD html 代码中,因为它适用于所有其他情况。 我知道有些期刊不是开放访问的,但我有访问权限,问题也出现在开放访问的文章中(查看示例)。 这是来自 Github 的代码:

 htmlToText <- function(input, ...) {
###---PACKAGES ---###
 require(RCurl)
 require(XML)


###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
 evaluate_input <- function(input) {    
# if input is a .html file
if(file.exists(input)) {
  char.vec <- readLines(input, warn = FALSE)
  return(paste(char.vec, collapse = ""))
}

# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)

# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
  # downolad SSL certificate in case of https problem
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
  return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}

# return NULL if none of the conditions above apply
return(NULL)
}

# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}

# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
 }

###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)

# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)

# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}

现在这是我的代码和一篇示例文章:

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)

未格式化的文本在方法部分的某处停止:

DNA was extracted using the MasterPure™ Yeast DNA Purification Kit (Epicentre, Madison, Wisconsin, USA) following the manufacturer's instructions.

有什么建议/想法吗?

附言我还尝试了基于 rvesthtml_text,结果相同。

最佳答案

您可以直接使用您现有的代码,只需将 ?np=y 添加到 URL 的末尾,但这样更紧凑一些:

library(rvest)
library(stringi)

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"

pg <- read_html(target)
pg %>%
  html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>% 
  stri_trim() %>% 
  paste0(collapse=" ") %>% 
  write(file="output.txt")

一些输出(那篇文章的总数 >80K):

 Fungal Ecology Volume 22 , August 2016, Pages 61–72        175394|| Species richness 
 influences wine ecosystem function through a dominant species Primrose J. Boynton a , , , 
 Duncan Greig a , b a  Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany 
 b  The Galton Laboratory, Department of Genetics, Evolution, and Environment, University 
 College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016, 
 Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
 Davey Abstract Increased species richness does not always cause increased ecosystem function. 
 Instead, richness can influence individual species with positive or negative ecosystem effects. 
 We investigated richness and function in fermenting wine, and found that richness indirectly 
 affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae . 
 While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich 
 communities, probably because antagonistic species prevent it from growing. It is also diluted 
 from species-poor communities, 

关于html - R 解析网页中的不完整文本(HTML),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38347902/

相关文章:

windows - 来自相当于 wget 的 R 控制台窗口

r - 如何在 Linux 上运行的 DeployR 服务器上安装 R 包(托管在 Amazon EC2 上)?

html - 常规 : parsing xml with HTML tags inside

html - Github 页面未在子文件夹上加载 css

r - 如何检查变量的值是否包含数字vs值是R中的数字

javascript - jQuery 在 Dreamweaver 中工作但在浏览器中不工作

c# - CustomXMLPart 中的换行符

java - Spring Batch 中的作业被执行多次并且不会停止

javascript - 如何一次验证所有字段?

javascript - 使文本转到新文本