python - 打开连接并获取响应需要太多时间

标签 python sparql sparqlwrapper

我写了一个python脚本用于查询this endpoint使用 SPARQL 来获取一些有关基因的信息。该脚本的工作原理如下:

Get genes
Foreach gene:
    Get proteins
        Foreach proteins
            Get the protein function
            .....
    Get Taxons
    ....

但是脚本执行时间太长。我使用 pyinstrument 进行了分析我得到了以下结果:

  39.481 <module>  extracting_genes.py:10
  `- 39.282 _main  extracting_genes.py:750
     |- 21.629 create_prot_func_info_dico  extracting_genes.py:613
     |  `- 21.609 get_prot_func_info  extracting_genes.py:216
     |     `- 21.596 query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
     |        `- 21.596 _query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
     |           `- 21.588 urlopen  urllib2.py:131
     |              `- 21.588 open  urllib2.py:411
     |                 `- 21.588 _open  urllib2.py:439
     |                    `- 21.588 _call_chain  urllib2.py:399
     |                       `- 21.588 http_open  urllib2.py:1229
     |                          `- 21.588 do_open  urllib2.py:1154
     |                             |- 11.207 request  httplib.py:1040
     |                             |  `- 11.207 _send_request  httplib.py:1067
     |                             |     `- 11.205 endheaders  httplib.py:1025
     |                             |        `- 11.205 _send_output  httplib.py:867
     |                             |           `- 11.205 send  httplib.py:840
     |                             |              `- 11.205 connect  httplib.py:818
     |                             |                 `- 11.205 create_connection  socket.py:541
     |                             |                    `- 9.552 meth  socket.py:227
     |                             `- 10.379 getresponse  httplib.py:1084
     |                                `- 10.379 begin  httplib.py:431
     |                                   `- 10.379 _read_status  httplib.py:392
     |                                      `- 10.379 readline  socket.py:410
     |- 6.045 create_gene_info_dico  extracting_genes.py:323
     |  `- 6.040 ...
     |- 3.957 create_prots_info_dico  extracting_genes.py:381
     |  `- 3.928 ...
     |- 3.414 create_taxons_info_dico  extracting_genes.py:668
     |  `- 3.414 ...
     |- 3.005 create_prot_parti_info_dico  extracting_genes.py:558
     |  `- 2.999 ...
     `- 0.894 create_prot_loc_info_dico  extracting_genes.py:504
        `- 0.893 ...

基本上,我多次执行多个查询(+60000),所以我的理解是,打开连接获取响应多次完成,这会减慢速度执行。

有人知道如何解决这个问题吗?

最佳答案

正如 @Stanislav 所说,SPARQLWrapper 使用的 urllib2 Doesn't support persistent connections但我找到了一种方法来保持连接处于事件状态,使用 SPARQLWrapper/Wrapper.py 中定义的 setUseKeepAlive() 函数。

我必须先安装 keepalive 软件包:

pip install keepalive

它减少了近 40% 的执行时间。

def get_all_genes_uri(endpoint, the_offset):
    sparql = SPARQLWrapper(endpoint)
    sparql.setUseKeepAlive() # <--- Added this line
    sparql.setQuery("""
        #My_query
    """)
    ....

得到以下结果:

  24.673 <module>  extracting_genes.py:10
  `- 24.473 _main  extracting_genes.py:750
     |- 12.314 create_prot_func_info_dico  extracting_genes.py:613
     |  `- 12.068 get_prot_func_info  extracting_genes.py:216
     |     |- 11.428 query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
     |     |  `- 11.426 _query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
     |     |     `- 11.353 urlopen  urllib2.py:131
     |     |        `- 11.353 open  urllib2.py:411
     |     |           `- 11.339 _open  urllib2.py:439
     |     |              `- 11.338 _call_chain  urllib2.py:399
     |     |                 `- 11.338 http_open  keepalive/keepalive.py:343
     |     |                    `- 11.338 do_open  keepalive/keepalive.py:213
     |     |                       `- 11.329 _reuse_connection  keepalive/keepalive.py:264
     |     |                          `- 11.280 getresponse  httplib.py:1084
     |     |                             `- 11.262 begin  httplib.py:431
     |     |                                `- 11.207 _read_status  httplib.py:392
     |     |                                   `- 11.204 readline  socket.py:410
     |     `- 0.304 __init__  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:261
     |        `- 0.292 resetQuery  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:301
     |           `- 0.288 setQuery  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:516
     |- 4.894 create_gene_info_dico  extracting_genes.py:323
     |  `- 4.880 ...
     |- 2.631 create_prots_info_dico  extracting_genes.py:381
     |  `- 2.595 ...
     |- 1.933 create_taxons_info_dico  extracting_genes.py:668
     |  `- 1.923 ...
     |- 1.804 create_prot_parti_info_dico  extracting_genes.py:558
     |  `- 1.780 ...
     `- 0.514 create_prot_loc_info_dico  extracting_genes.py:504
        `- 0.510 ...

老实说,执行时间还是没有我想要的那么快,我会看看是否还有其他可以做的事情。

关于python - 打开连接并获取响应需要太多时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51591872/

相关文章:

python - sparql.query().convert() 处 SPARQLWrapper 出现 URLError

javascript - 通过 Python SPARQLWrapper 将数据插入到 Fuseki

python - SPARQLWrapper 不返回 JSON

python - 如何使用sklearn的IncrementalPCApartial_fit

filter - 加速 SPARQL 查询 - 过滤掉包含的行

python - AttributeError: 'module' 对象在脚本中没有属性 '_Condition'

ArangoDB 的 SPARQL 接口(interface)

来自 DBpedia 和 Jena 的 SPARQL XML 结果

python - 顺序模型在每次运行时给出不同的结果

javascript - Send_keys 函数确实按 Selenium python 中的预期工作