python - 如何在 python 中抓取网页上的嵌入式脚本

标签 python html xpath web-scraping html-parsing

例如,我有网页http://www.amazon.com/dp/1597805483 .

我想用 xpath 来抓取这句话在全局所有的运动中,没有比棒球更多的诅咒和迷信了,美国的全民消遣。

page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()')
print feature_bullets

以上代码没有返回任何内容。原因是浏览器解释的 xpath 与源代码不同。但是我不知道如何从源代码中获取xpath。

最佳答案

构建您正在网络抓取的页面涉及很多事情。

至于描述,具体来说,底层HTML是在一个javascript函数中构建的:

<script type="text/javascript">

    P.when('DynamicIframe').execute(function (DynamicIframe) {
        var BookDescriptionIframe = null,
                bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E",
                bookDescriptionAvailableHeight,
                minBookDescriptionInitialHeight = 112,
                options = {};
    ...

</script>

这里的想法是获取脚本标签的文本,使用正则表达式提取描述值,取消引用 HTML,使用 lxml.html 解析它并获取 .text_content() :

import re
from urlparse import unquote

from lxml import html
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
tree = html.fromstring(page.content)

script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0]
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = html.fromstring(unquote(match.group(1)))
    print description_html.text_content()

打印:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime. 
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural.  
Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. 
In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies. 
This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

类似的解决方案,但使用 BeautifulSoup :

import re
from urlparse import unquote

from bs4 import BeautifulSoup
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(page.content)

script = soup.find('script', text=lambda x:'bookDescEncodedData' in x)
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = BeautifulSoup(unquote(match.group(1)))
    print description_html.text

或者,您可以采用高级方法并在 selenium 的帮助下使用真正的浏览器:

from selenium import webdriver

url = "http://rads.stackoverflow.com/amzn/click/1597805483"

driver = webdriver.Firefox()
driver.get(url)

iframe = driver.find_element_by_id('bookDesc_iframe')
driver.switch_to.frame(iframe)

print driver.find_element_by_id('iframeContent').text

driver.close()

产生更好的格式化输出:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime

Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as:

Stephen King and Stewart O’Nan
Jack Kerouac
Karen Joy Fowler
Rod Serling
W. P. Kinsella
And many more!

Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

关于python - 如何在 python 中抓取网页上的嵌入式脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26680590/

相关文章:

javascript - 使用 XPath 读取输入值,然后在 Greasemonkey 中使用

jQuery 数据表 : Disabling sorting for specific columns

java - 选择使用 XPath 并考虑命名空间和前缀

python - 如何对这个形状进行分割?

python - 尝试安装 python3 但终端始终显示为 python 2.7.13

javascript - 输入搜索在 Edge、IE 或 Firefox 中不起作用

html - 打印 html (IE 8-11) 时每页边距大小发生变化

xpath - 使用 Selenium 和 XPath,将 By 元素转换为字符串

python - 此系统上未安装 c++ 编译器

python - 为什么一个 `tf.constant()`的值在TensorFlow中会多次存储在内存中?