Python 和 Selenium : identification of elements in 'tables within tables' without using a keyword

我想抓取this带有 Selenium 的页面。

我要提取的信息在两个表中

首先，我需要“一般信息”表中的所有信息:

Name           RTD-1
Sequence       RCICTRGFCRCLCRRGVC
Class          Primate
Average Mass   2081.56
Monoisotopic Mass   2079.91
m/z M+H         2080.92
ProteinType     Wild type
Parent  
Organism        Macaca mulatta (rhesus monkey)
Notes           Theta-defensin.
Cyclic          Yes

我可以使用以下代码轻松提取该表:

from selenium import webdriver
import os
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
import sys
import re
import requests

options = Options()
options.binary_location=r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/mnt/c/Users/kela/Desktop/selenium/chromedriver.exe')
driver.get('http://www.cybase.org.au/index.php?page=card&table=protein&id=85')

def parse_field(field_str):
       field = driver.find_element_by_xpath(str(field_str))
       if field:
               return(field.text.strip())
       else:
               return('-')


field = lambda x: "//tbody/tr/td[contains(.,'" + x + "')]/following::td[1]"
field_list = ['Name','Sequence','Class','Average Mass','Monoisotopic Mass','m/z M+H','ProteinType','Parent','Organism','Notes','Cyclic']
for i in field_list:
        text_to_return = parse_field(field(i))
        print(text_to_return)

对于带有“检测”图例的表格，我只想提取检测的名称(即在本例中为抗菌和膜结合检测)和嵌入在中的论文的 UID a href(称为 uid_list)。

我从 here 找到这样的代码(以及此代码的变体)工作类型:

assay = driver.find_element_by_xpath("//legend[contains(.,'Assay')]//..//td[contains(text(),'.')]/following::td[1]").text

在这种情况下，'Tang YQ et al. (1999) Science 286:498-502' 出版。

我尝试过该行的其他变体，例如，如果我更改为:

assay = driver.find_element_by_xpath("//legend[contains(.,'Assay')]//..//td[contains(text(),'Anti-bacterial')]/following::td[1]").text

输出为:

RTD-1-3 shown to possess anti-bacterial activity [...] Tran D et al. (2002) J Biol Chem 277:3079-84
RTD-1 found to possess anti-microbial activity greater than the acyclic analogue. [...] Tang YQ et al. (1999) Science 286:498-502

您可以看到我想要的特定表似乎是表中的表:

如何调整这行代码以仅返回:

Antibacterial     ...and the set of 'list_UIDs' (i.e. the number in the href) for these rows
Membrane-binding assay   ...and the set of 'list_UIDs' (i.e. the number in the href) for these rows.

有一个重要的注意事项，测定名称(即抗菌和膜结合测定)在页面之间不会保持不变。例如不同的页面可能有完全不同的测定名称。这就是我陷入困境的地方，如何返回化验和 UID，而不通过特定单词(例如抗菌)识别文本。

编辑1:根据下面的建议，我尝试了这个:

##second box
assaynames = []
assays = driver.find_elements_by_css_selector("#main > fieldset:nth-child(4) > table > tbody > tr > td > table > tbody > tr > td")
i = 0
for assay in assays:
        if i==0 or i%2==0:
                assaynames.append(assay.text)
                i+=1


print(assaynames)

输出为:

RTD-1
RCICTRGFCRCLCRRGVC
Primate
2081.56
2079.91
2080.92
Wild type

Macaca mulatta (rhesus monkey)
Theta-defensin.
Yes
['Anti-bacterial']

我只是想知道如何提取检测盒中的所有检测结果？那么，在这种情况下，抗菌和膜结合测定的信息是什么？下面的建议中缺少的另一部分是提取 PMID 引用(请参阅原始问题)。

最佳答案

您的解决方案可以如下所示，

我正在获取表格，并且模块 2 返回零的元素间隔将是您的测定名称:

assaynames = []
assays = driver.find_elements_css_selector("#main > fieldset:nth-child(4) > table > tbody > tr > td > table > tbody > tr > td") 
i = 0;
    for assay in assays : 
        if i==0 or i%2==0
        assaynames.append(assay.text)
        i+=1

关于Python 和 Selenium : identification of elements in 'tables within tables' without using a keyword，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57757735/

Python 和 Selenium : identification of elements in 'tables within tables' without using a keyword

上一篇：python - 无法找到: ERROR [root] Error: Can't locate revision identified by '..' 的来源

下一篇：python - Pandas 使用不同列连接数据帧 : AttributeError: 'NoneType' object has no attribute 'is_extension'