r - 使用Rvest在R中提取Youtube视频描述

标签 r youtube rvest

我正在尝试使用Rvest提取YouTube视频描述。我知道,仅使用API​​会更容易,但最终目标是更加熟悉Rvest,而不是仅仅获得Video描述。这是我到目前为止所做的:

# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"

# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'

# getting page
Website <- read_html(page)

# printing description
html_attr(Description, name = "content")

虽然这确实指向视频描述,但我没有得到完整的视频描述,而是一个字符串,该字符串在几行后被截断:
[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."

预期输出将是完整的描述
"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health, better infrastructure, more police, and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England. 

Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland, he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years. 

Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg, health editor Hugh Pym and Scotland editor Sarah Smith.


Please subscribe HERE: LINK"

有什么办法可以得到rvest的完整描述?

最佳答案

正如您所说的,您专注于学习,在显示代码之后,我添加了一些说明如何到达那里的。

可复制的代码:

library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>% 
  read_html %>% 
  html_nodes(xpath = "//*[@id = 'eow-description']") %>% 
  html_text

说明:

1.定位元素

有几种方法可以解决此问题。通常的第一步是在浏览器中右键单击目标元素,然后选择“检查元素”。您将看到如下内容:

enter image description here

接下来,您可以尝试提取数据。
url %>% 
      read_html %>% 
      html_nodes(xpath = "//*[@id = 'description']")

不幸的是,这不适用于您的情况。

2.确保您具有正确的来源

因此,您必须确保目标数据在加载的文档中。您可以在浏览器的网络 Activity 中看到这一点,或者如果您想在R中进行检查,我为此编写了一个小函数:
showHtmlPage <- function(doc){
  tmp <- tempfile(fileext = ".html")
  doc %>% toString %>% writeLines(con = tmp)
  tmp %>% browseURL(browser = rstudioapi::viewer)
}

用法:
url %>% read_html %>% showHtmlPage

您会看到目标数据实际上在您下载的文档中。因此,您可以坚持使用rvest。接下来,您必须找到xpath(或CSS),...

3.在下载的文档中找到目标标记

您可以搜索包含您要查找的文本的标签
doc %>% html_nodes(xpath = "//*[contains(text(), 'The Conservatives and ')]")

输出将是:
{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....

然后您会看到您正在寻找一个ID为eow-description的标签。

关于r - 使用Rvest在R中提取Youtube视频描述,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58861911/

相关文章:

r - 强制 rvest 识别表格(html_tag(x) == "table"不是 TRUE)

r - 如何在 R 中确定我正在运行的平台?

r - 我们可以使用 H2O 预测时间序列单维数据吗?

R- 将某些列从 0 标准化为 1,其值等于 0

javascript - 使用jQuery从链接列表播放第一个youtube嵌入式视频

android - 如何在 youtube 分享列表中显示我的应用程序?

php - 在哪里可以找到youtube XML模式

r - 使用 XML 和 Rvest 在 R 中进行网页抓取

r - 使用 R 进行网页抓取,内容

r - R 包 mlr 的(二进制)因子变量应该有哪些类?