python - 从具有特殊格式的 URL 结果中提取数据

标签 python parsing url

我有一个网址:
http://somewhere.com/relatedqueries?limit=2&query=seedterm

在修改输入、限制和查询的地方,将生成想要的数据。 Limit 是最大可能的词条数,query 是种子词条。

URL 提供以这种方式格式化的文本结果:
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{id:'score',label:'Score ',type:'number',pattern:'#,##0.###'},{id:'query',label:'Query',type:'string',pattern:''}],行:[{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm1'}]},{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm2' }]}],p:{'totalResultsCount':'7727'}}});

我想写一个带有两个参数(限制数和查询种子)的 python 脚本,去网上获取数据,解析结果并返回一个包含新术语的列表 ['newterm1','newterm2' ] 在这种情况下。

我希望得到一些帮助,尤其是在 URL 获取方面,因为我以前从未这样做过。

最佳答案

听起来你可以把这个问题分解成几个子问题。

子问题

在编写完整的脚本之前需要解决一些问题:

  1. 形成请求 URL:从模板创建配置的请求 URL
  2. 检索数据:实际发出请求
  3. 展开 JSONP : 返回的数据似乎是 JSON包装在 JavaScript 函数调用中
  4. 遍历对象图:浏览结果以找到所需的信息位

形成请求URL

这只是简单的字符串格式化。

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')

Python 2 Note

You will need to use the string formatting operator (%) here.

url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')

检索数据

您可以为此使用内置的 urllib.request 模块。

import urllib.request
data = urllib.request.urlopen(url) # url from previous section

这将返回一个名为 data 的类文件对象。您还可以在此处使用 with 语句:

with urllib.request.urlopen(url) as data:
    # do processing here

Python 2 Note

Import urllib2 instead of urllib.request.

展开 JSONP

您粘贴的结果看起来像 JSONP。假设调用的包装函数 (oo.visualization.Query.setResponse) 没有改变,我们可以简单地去掉这个方法调用。

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]

解析JSON

生成的 result 字符串只是 JSON 数据。使用内置的 json 模块解析它。

import json

result_object = json.loads(result)

遍历对象图

现在,您有一个代表 JSON 响应的 result_object。对象本身是一个 dict,带有 versionreqId 等键。根据您的问题,以下是创建列表所需的操作。

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]

综合考虑

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))

Python 2.7 版本

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))

关于python - 从具有特殊格式的 URL 结果中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4056375/

相关文章:

python - 使用 pandas 在依赖于另一列的一列上有效地应用操作

python - 在 Django 中,如何将具有复合主键的表连接到另一个表?

parsing - 有没有办法改变 Bison 的弹性启动状态?

go - 如何使用 'routes' 文件读取 Revel 中的查询参数?

php - 使用php获取字符串中的所有url

python - 修复 Pygame 玩家输入响应时间

python - scipy 大型稀疏矩阵

jquery - 使用 jQuery 解析 JSON

c# - 如何在 OpenXML 段落、运行、文本中保留带格式的字符串?

javascript - 如果出现部分网址或短网址,如何重定向原始网址