python - 如何使用 lxml 获取元素

标签 python parsing xpath lxml

https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7

如何从每一列中获取文本,即从类 <div class = "col-currency-rate"> 的最后三个 block 中获取文本每个<div class = "row"> ?我拿到了 table ,但下一步该怎么办?

>>> tree.xpath('//div[@class="table-currency"]/div[@class="row"]')
[<Element div at 0x7fcac2a47ba8>, <Element div at 0x7fcac2a47c00>, <Element div at 0x7fcac2a47c58>, <Element div at 0x7fcac2a47cb0>, <Element div at 0x7fcac2a47d08>, <Element div at 0x7fcac2a47d60>, <Element div at 0x7fcac2a47db8>, <Element div at 0x7fcac2a47e10>, <Element div at 0x7fcac2a47e68>, <Element div at 0x7fcac2a47ec0>, <Element div at 0x7fcac2a47f18>, <Element div at 0x7fcac2a47f70>, <Element div at 0x7fcac2a47fc8>, <Element div at 0x7fcac2a4e050>, <Element div at 0x7fcac2a4e0a8>, <Element div at 0x7fcac2a4e100>, <Element div at 0x7fcac2a4e158>, <Element div at 0x7fcac2a4e1b0>, <Element div at 0x7fcac2a4e208>, <Element div at 0x7fcac2a4e260>, <Element div at 0x7fcac2a4e2b8>, <Element div at 0x7fcac2a4e310>, <Element div at 0x7fcac2a4e368>, <Element div at 0x7fcac2a4e3c0>, <Element div at 0x7fcac2a4e418>, <Element div at 0x7fcac2a4e470>, <Element div at 0x7fcac2a4e4c8>, <Element div at 0x7fcac2a4e520>]
>>> len(tree.xpath('//div[@class="table-currency"]/div[@class="row"]'))
28

html

<div class="table-currency">
    <div class="row"><div class="col col-currency">
    2.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/5/5/1055/1055.jpg" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1057">
    ForteBank
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 985 956 865</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+89 298 547</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+390 999 868</p><p></p></div></div>

    <div class="row"><div class="col col-currency">
    3.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/9/5/1095/1095.png" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1076">
    Сбербанк России
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 983 840 092</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+88 853 745</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+119 145 827</p><p></p></div></div>
</div>

enter image description here

最佳答案

具有特定Xpath表达式的复杂解决方案:

from lxml import html
import requests

url  = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
doc = html.document_fromstring(requests.get(url).content)

for row in doc.xpath('//div[@class="table-currency"]/div[@class="row"]'):
    bank_name = row.xpath('descendant::a/text()')[0].strip()
    print(bank_name)
    for cur_rate in row.xpath('div[contains(@class, "col-currency-rate")][position() > last() - 3]'):
        print('-', cur_rate.text_content())
    print()

详细信息:

  • descendant::a/text() - 用于提取 a 元素的文本节点的 xpath,该元素是下划线行的子/后代节点
  • div[contains(@class, "col-currency-rate")][position() > last() - 3] - 用于选择 div 元素的 xpath具有特定的 class 属性部分值,并且位置从最后一个 3 位置开始到末尾 (last() - 最后一个的位置元素,last() - 3 指向倒数第三个位置)

输出:

Народный банк Казахстана
- 8 729 518 087
- +101 401 107
- -190 957 466

ForteBank
- 1 985 956 865
- +89 298 547
- +390 999 868

Сбербанк России
- 1 983 840 092
- +88 853 745
- +119 145 827

Kaspi Bank
- 1 907 391 103
- +12 378 770
- +233 318 909

Банк ЦентрКредит
- 1 495 599 542
- +34 795 443
- -14 202 851

АТФБанк
- 1 314 405 536
- +1 661 967
- -19 558 254

First Heartland Jýsan Bank
- 1 217 617 065
- +52 641 777
- -553 564 176

Жилстройсбербанк Казахстана
- 1 148 974 349
- +7 721 823
- +261 041 394

Евразийский банк
- 1 040 820 999
- -25 910 447
- -25 911 373

Ситибанк Казахстан
- 758 117 020
- +48 724 924
- +82 877 576

Банк "Bank RBK"
- 618 310 738
- +21 856 874
- +62 626 834

Альфа-Банк
- 504 777 556
- +17 401 839
- +51 157 130

Altyn Bank («Народный банк Казахстана»)
- 421 018 633
- -20 058 555
- +33 720 048

Нурбанк
- 408 442 557
- +7 065 511
- -18 282 545

Хоум Кредит энд Финанс Банк
- 372 901 871
- -2 127 105
- +33 983 288

Банк Китая в Казахстане
- 324 386 349
- +11 609 880
- +4 997 316

Банк ВТБ
- 184 247 490
- +5 800 194
- +40 725 927

First Heartland Bank (Банк ЭкспоКреди)
- 173 058 018
- -17 261 535
- +16 047 168

Торгово-промышленный Банк Китая в Алматы
- 140 792 847
- +6 365 348
- -26 137 736

Банк Kassa Nova
- 133 910 512
- +954 985
- +4 039 523

Tengri Bank (Punjab National Bank)
- 133 721 602
- +1 136 896
- -485 570

Азия Кредит Банк
- 99 659 306
- -3 790 116
- -21 420 844

Capital Bank Kazakhstan
- 85 702 895
- -3 165 322
- +4 469 187

KZI Bank (Казахстан Зират Интернешнл)
- 65 240 704
- -3 412 060
- -126 750

Шинхан Банк Казахстан
- 43 323 406
- -7 588 366
- +722 399

Исламский Банк "Al-Hilal"
- 30 562 279
- +2 411 098
- -1 430 198

Заман-Банк
- 22 969 984
- -168 105
- +5 544 675

Национальный Банк Пакистана
- 4 705 084
- -20 113
- -131 233

关于python - 如何使用 lxml 获取元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57305572/

相关文章:

xml - xslt 异常 - 来自 <xsl :/> tag 的属性值

Java进程立即退出

ruby - open-uri + hpricot 和 nokogiri 不能正确解析 html

xslt - XPath - 查询两个 XML 文档

javascript - 解析参数并将其转换为 Javascript 中的 DateTime

C:字符串和指针。以特定模式更改字符串中的子字符串

jquery - 无法使用多个类和跳过元素将 CSS 重写为 XPATH

python - 多个范围的并集

python - 如何对逗号和句号使用 re.split?

python - 在 python 中跟踪按键