python - Xpath 使用 Scrapy 获取下一个兄弟标签中的信息

标签 python html xpath scrapy

我正在尝试接触 Scrapy,现在我尝试从词源网站提取信息:http://www.etymonline.com 现在,我只想得到这些词和它们的原始描述。这是 etymonline 中常见的 HTML 代码块的呈现方式:

<dt>
  <a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a>
  <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com">
    <img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com"/>
  </a>
</dt>
<dd>
  1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).
</dd>

该词包含在 <dt> 中标记和下一个兄弟中的描述,<dd>标签。 获取页面上的单词列表,如 http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0 , 可以写成 word = sel.xpath('//dl/dt/a/text()').extract() .

然后我尝试遍历这个单词列表并使用这行代码提取相关信息 info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd") .但这似乎不起作用。有什么想法吗?

最佳答案

前往 <dd><dt>之后, 您可以使用 following-sibling轴,你是对的。

following-sibling::dd全选 dd上下文节点之后的元素。因此,您需要使用位置谓词 [1] 将 XPath 限制为仅第一个。 .

对于每个 dt你得到的元素//dl/dt , 您选择 following-sibling::dd[1] .

这是一个使用 scrapy shell 的示例 session 对于术语“地址”:

$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s]   item       {}
[s]   request    <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   response   <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   settings   <scrapy.settings.Settings object at 0x7f1397399bd0>
[s]   spider     <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: for dt in response.xpath('//dl/dt'):
    print "Word:", dt.xpath('string(a)').extract()
    print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
    print
   ...:     
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']

Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']

Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']

...

Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']

Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\'s speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']


In [2]: 

关于python - Xpath 使用 Scrapy 获取下一个兄弟标签中的信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27139318/

相关文章:

python - 修改一个pytorch张量然后获取梯度让梯度不起作用

python - Django 自定义处理程序中的参数数量错误

python - 我可以为重置索引指定名称吗?

javascript - 使复选框变大(网站仅支持 chrome)

python - 使用 "\d"通过 Scrapy for Python 2 从 Div 中提取数据时出现 XPath 错误

python - Plotly:plotly 表达遵循什么颜色循环?

html - 当您单击 SAFARI 浏览器中的任何选择框时,页面会刷新

jquery - 如何更改表格行的背景?

javascript - 在 google chrome 插件 (javascript) 中通过 xpath 获取元素值

selenium - Xpath/获取具有某些值的最接近的祖先