html - 使用 rvest 将复杂的 html 文件读入 R

标签 html r rvest

我是 R 和 stackoverflow 的新手,所以请保持温和,我会尽量保持这篇文章的正确性。 我正在开展一个项目,将全外显子组测序 (WES) 结果与蛋白质组数据进行比较。我们的 WES 设施仅以 html 文件形式提供数据,因此我需要将其读入 R 以继续我的工作。

我试图跟随 DataCamp tutorial for rvest但我认为问题可能是 html 文件太复杂了,因为我得到的是\t\t\tn\n\t 之间的一些文本。我想问题是 html_node 不正确?

这是我的 R 代码,后跟经过缩短和变体修改的 HTML。

我想要得到的是一个与 html 中具有相同列的数据框。如示例中所示,某些变体会影响多个转录本,在这些情况下,单行/转录本将是完美的,但无论如何都不是必须的。

非常感谢您的帮助!

塞巴斯蒂安

library(tidyverse)  
library(rvest)    

htmlALL <- read_html("Example_html")

getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()

}

df_html <- getDATA(htmlALL)

<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <!-- add title in the brower tab bar -->
  <title>Homozygous variants of sample XXX </title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>


<!-- change style to look nice -->
<style type="text/css">


html { 
  text-align: center;
  vertical-align: middle;
  height: 100%;
  width: 100%;
}
body { 
  background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
  font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
  font-size: 62.5%;
  entry-height: 1;
  color: #585858;
  padding: 22px 10px;
  padding-bottom: 55px;

}

::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }

br { display: block; entry-height: 1.6em; } 

input, textarea { 
  -webkit-font-smoothing: antialiased;
  -webkit-text-size-adjust: 100%;
  -ms-text-size-adjust: 100%;
  -webkit-box-sizing: border-box;
  -moz-box-sizing: border-box;
  box-sizing: border-box;
  outentry: none; 
}

blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; } 


h1 {
  font-weight: bold;
  font-size: 3.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

h2 {
  font-weight: bold;
  font-size: 2.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

/** big white sheet everything is on **/
.wrapper {
  display: block;
  width: 95%;
  background: #fff;
  margin: 0 auto;
  padding: 10px 17px 100px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  overflow-x: auto;
  overflow-y: visible;
}

/* smaller box the family information is on */
.info{
  display: block;
  width: 800px;
  background: #f2f2f2;
  margin: 0 auto;
  padding: 10px 17px 10px 10px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  font-size: 1.8em;
  margin-bottom: 10px;
}


/* this is what actually contains the info */
.table {
  display: table;
  margin: 0 auto;
  width: 99%;
  font-size: 1.2em;
  margin-bottom: 15px;
  border-collapse: collapse;
  overflow: visible;
}

/* one row of the variants */
.tablerow {
  display: table-row;
  overflow: visible;
  border: 1px solid gray;
  width: 100%;
}

/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
  display: table-cell;
  background: #f2f2f2;
  padding: 3px 10px;
  margin-bottom: 25px;
  font-size: 1.8em;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}

/* in the following each column gets specified to increase readablity*/

.position {
  display: table-cell;
  padding: 3px 10px;
  font-size: 1.4em;
  height: 100%;
  text-align: center;
  vertical-align: middle;
}

.variants {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  overflow: visible;
  white-space: nowrap;
  
}

.stacked {
  display: table;
  height: 50%;
  width: 100%;

}

.center {
  display: table-cell;
  vertical-align: middle;
  width: 100%;
  padding: 0px 5px;
}


.consequences {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 3px 10px;
}

.gene {
  display: table-cell;
  padding: 3px 15px;
  height: 100%;
  vertical-align: middle;
  font-size: 1.4em;
  font-weight: bold;
}

.transcripts {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.list {
  height: 100%;
  width: 100%;
  display: table;
  table-layout: fixed;
}
.row {
  display: table-row;
  overflow: visible;
  vertical-align: middle;
}
.entry {
  display: table-cell;
  vertical-align:middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

.cdspos {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.exon {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}



.hgvs {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}

.hgvs .list .row{
  display: table-row;
  vertical-align: middle;
}

.polyphen {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.polyphen .list .row{
  display: table-row;
  vertical-align: middle;
}

.sift {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.sift .list .row{
  display: table-row;
  vertical-align: middle;
}

.allelefreq {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}



/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
    position: relative;
    display: inline-block;
    border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
    
}



.tooltiptext{
    visibility: hidden;
    overflow: auto;
    min-width: 400px;
    background-color: #ffb380;
    color: black;
    text-align: left;
    padding: 5px 10px;
    border-radius: 6px;
    font-size: 12pt;
    font-weight: normal;
    
    /* Position the tooltip text - see examples below! */
    position: absolute;
    z-index:1;
    
    /* shadow */
    box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    
    opacity: 0.95;
    filter: alpha(opacity=95);

}

/* Tooltip text */
.tooltip_gene .tooltiptext {
    top: -5px;
    left: 105%;
 
}


/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
    top: -5px;
    right: 105%;
    min-width: 120px;
    
 
}

/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
    visibility: visible;
}


.clin {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

</style>


<body>
  <div class="wrapper">
      <!-- add info about patients -->
      <h1>Homozygous variants of sample XXX</h1>
      <h2>Tue Jan 23 09:01:56 2018</h2>
      <div class="info">
	
	  Patient only<br>
	
      </div>
      <!-- variants table start -->
      <div class="table">
	<!-- table header start -->
	<div class="tablerow">
	  <div class="tableheader">
	    Position
	  </div>
	  <div class="tableheader">
	    Variant
	  </div>
	  <div class="tableheader">
	    Cons
	  </div>
	  <div class="tableheader">
	    Gene
	  </div>
	  <div class="tableheader">
	    Transcript
	  </div>
	  <div class="tableheader">
	    HGVSC
	  </div>
	  <div class="tableheader">
	    HGVSP
	  </div>
	  <div class="tableheader">
	    PolyPhen
	  </div>
	  <div class="tableheader">
	    SIFT
	  </div>
	  <div class="tableheader">
	    AF
	  </div>
	  <div class="tableheader">
	    Clin
	  </div>
	</div>
	<!-- table header stop -->
	<!-- var loop start -->
	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  G->T
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
		      TTF2
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00000
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	
	 	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  
		      <a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
		  
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
		      SF3B4
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
		      </a>
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A(p.%3D)
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00021
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	 	
	<!-- var loop stop -->
      </div>
      <!-- variant table stop -->
    </div>
</body>
</html>

最佳答案

这是我能为您提供的最好的。请注意,输出包括将鼠标悬停在 Gene 列中的数据上时弹出的“工具提示文本”。

library(rvest)

# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')

# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length

# sprintf-friendly format to get the %d-th node matching
#   //div[@class="tablerow"] (same as div.tablerow in CSS)
#   All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'

# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>% 
  html_text %>% trimws

# rows 2:n contain the actual data; gsub is
#   stripping leading/trailing whitespace and 
#   any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
  pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>% 
    html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})

# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
#      Position Variant       Cons
# 1 1:117635487    G->T synonymous
# 2 1:149898455    G->A synonymous
#                                                                                                                                                                                                                                                                                                                                                                                                                                                     Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2                                                       SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
#                       Transcript            HGVSC
# 1                ENST00000369466        c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
#                            HGVSP PolyPhen SIFT
# 1               c.2940G>T(p.%3D)              
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)              
#                                                         AF
# 1       0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
#   Clin
# 1     
# 2     

请注意,它不适用于此处,因为您的所有列似乎都是 character 类型,但更复杂的方法会将此处的行转换为常规文件(例如 csv ),然后使用 read.table(或者更好,fread)读入文本并自动检测列类型。

关于html - 使用 rvest 将复杂的 html 文件读入 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52297953/

相关文章:

html - 输入组插件的对齐

html - 导航不位于一行或对齐

r - 将标题和数据分解为单独的字段

r - rvest 和选择器小工具的网页抓取问题

r - 如何在没有按钮参数的 Rvest 包中提交登录表单

javascript - Facebook 中 URL 发生变化的网页部分更新效果

html - 在外部 css 样式表中使用 XSLT 变量

r - 考虑到数据帧的顺序,如何匹配R中两个列表的元素

R 编码,我试图将数据帧中的变量从 1 到 13 正确排序,但它就像 201501、2015010、011,012,013、02...09

r - 在 R 的 rvest 包中编写哪个选择器?