google-trends - 自动从 Google Trends 中提取 csv 文件

标签 google-trends

pyGTrends 似乎不起作用。在 Python 中给出错误。

pyGoogleTrendsCsvDownloader 似乎可以工作,可以登录,但在收到 1-3 个请求(每天!)后,提示配额耗尽,即使使用相同登录名/IP 的手动下载工作完美。

底线:两者都不起作用。通过 stackoverflow 搜索:尝试从 Google 提取 csv 的人们提出了许多问题,但我找不到可行的解决方案...

预先感谢您:无论谁能够提供帮助。代码应该怎么改呢?您知道另一种可行的解决方案吗?

这是 pyGoogleTrendsCsvDownloader.py 的代码

    import httplib
    import urllib
    import urllib2
    import re
    import csv
    import lxml.etree as etree
    import lxml.html as html
    import traceback
    import gzip
    import random
    import time
    import sys

    from cookielib import Cookie, CookieJar
    from StringIO import StringIO


    class pyGoogleTrendsCsvDownloader(object):
    '''
    Google Trends Downloader
    Recommended usage: 
    from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
    r = pyGoogleTrendsCsvDownloader(username, password)
    r.get_csv(cat='0-958', geo='US-ME-500')
    '''
    def __init__(self, username, password):
        '''  
    Provide login and password to be used to connect to Google Trends
    All immutable system variables are also defined here
    '''
        
        # The amount of time (in secs) that the script should wait before making a request.
        # This can be used to throttle the downloading speed to avoid hitting servers too hard.
        # It is further randomized.
        self.download_delay = 0.25
        
        self.service = "trendspro"
        self.url_service = "http://www.google.com/trends/"
        self.url_download = self.url_service + "trendsReport?"
        
        self.login_params = {}
        # These headers are necessary, otherwise Google will flag the request at your account level
        self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
                        ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
                        ("Accept-Language", "en-gb,en;q=0.5"),
                        ("Accept-Encoding", "gzip, deflate"),
                        ("Connection", "keep-alive")]
        self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
        self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
        self.header_dictionary = {}
        
        self._authenticate(username, password)
        
    def _authenticate(self, username, password):
        '''
    Authenticate to Google:
    1 - make a GET request to the Login webpage so we can get the login form
    2 - make a POST request with email, password and login form input values
    '''
        
        # Make sure we get CSV results in English
        ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
        
        self.cj = CookieJar()
        self.cj.set_cookie(ck)
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
        self.opener.addheaders = self.headers
        
        # Get all of the login form input values
        find_inputs = etree.XPath("//form[@id='gaia_loginform']//input")
        try:
            #
            resp = self.opener.open(self.url_login)
            
            if resp.info().get('Content-Encoding') == 'gzip':
                buf = StringIO( resp.read())
                f = gzip.GzipFile(fileobj=buf)
                data = f.read()
            else:
                data = resp.read()
            
            xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
            
            for input in find_inputs(xmlTree):
                name = input.get('name')
                if name:
                    name = name.encode('utf8')
                    value = input.get('value', '').encode('utf8')
                    self.login_params[name] = value
        except:
            print("Exception while parsing: %s\n" % traceback.format_exc())
        
        self.login_params["Email"] = username
        self.login_params["Passwd"] = password
        
        params = urllib.urlencode(self.login_params)
        self.opener.open(self.url_authenticate, params)
        
    def get_csv(self, throttle=False, **kwargs):
        '''
    Download CSV reports
    '''
        
        # Randomized download delay
        if throttle:
            r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
            time.sleep(r)
        
        params = {
            'export': 1
        }
        params.update(kwargs)
        params = urllib.urlencode(params)
        
        r = self.opener.open(self.url_download + params)
        
        # Make sure everything is working ;)
        if not r.info().has_key('Content-Disposition'):
            print "You've exceeded your quota. Continue tomorrow..."
            sys.exit(0)
            
        if r.info().get('Content-Encoding') == 'gzip':
            buf = StringIO( r.read())
            f = gzip.GzipFile(fileobj=buf)
            data = f.read()
        else:
            data = r.read()
        
        myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
        myFile.write(data)
        myFile.close()

最佳答案

虽然我不懂python,但我可能有解决方案。我目前正在 C# 中做同样的事情,虽然我没有获得 .csv 文件,但我通过代码创建了一个自定义 URL,然后下载该 HTML 并保存到文本文件(也是通过代码)。在此 HTML 中(第 12 行)包含创建 Google 趋势上使用的图表所需的所有信息。然而,其中有大量不必要的文本需要删除。但无论哪种方式,你最终都会得到相同的结果。谷歌趋势数据。我在这里发布了我的问题的更详细答案:

Downloading .csv file from Google Trends

关于google-trends - 自动从 Google Trends 中提取 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14772235/

相关文章:

r - 如何在 R 中获取 Google Trends 前 10 个搜索词?

search-engine - 谷歌趋势的 API 替代方案

google-trends - 我可以将 Google 趋势图添加到 Google 数据洞察吗?

javascript - Google 趋势嵌入和 X-Frame-Options

python - 不再支持使用时间戳对整数和整数数组进行加/减。不要使用 `n`来添加/减去 `n * obj.freq`

python - 如何使用 Python Requests 模块登录 Google?

python - Google 趋势 - 配额限制 - IP 地址更改器

seo - 用于谷歌流行度的 URL 排名检查器 API

python - 读入 python 时,Google Trends API 不是很好的 json