go - GET 请求的输出与查看源不同

标签 go get

我正在尝试从 whoscored.com 中提取匹配数据。当我在 firefox 上查看源代码时,我在第 816 行发现一个大的 json 字符串,其中包含我想要的那个 matchid 的数据。我的目标是最终得到这个 json。

在执行此操作时,我尝试下载 https://www.whoscored.com/Matches/ID/Live 的每个页面,其中 ID 是匹配项的 ID。我写了一个小 Go 程序来 GET 请求每个 ID 到某个点:

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "os"
)

// http://www.whoscored.com/Matches/614052/Live is the match for
// Eveton vs Manchester
const match_address = "http://www.whoscored.com/Matches/"

// the max id we get
const max_id = 300
const num_workers = 10

// function that get the bytes of the match id from the website
func match_fetch(matchid int) {
    url := fmt.Sprintf("%s%d/Live", match_address, matchid)

    resp, err := http.Get(url)
    if err != nil {
        fmt.Println(err)
        return
    }

    // if we sucessfully got a response, store the
    // body in memory
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        fmt.Println(err)
        return
    }

    // write the body to memory
    pwd, _ := os.Getwd()
    filepath := fmt.Sprintf("%s/match_data/%d", pwd, matchid)
    err = ioutil.WriteFile(filepath, body, 0644)
    if err != nil {
        fmt.Println(err)
        return
    }
}

// data type to send to the workers,
// last means this job is the last one
// matchid is the match id to be fetched
// a matchid of -1 means don't fetch a match
type job struct {
    last    bool
    matchid int
}

func create_worker(jobs chan job) {
    for {
        next_job := <-jobs
        if next_job.matchid != -1 {
            match_fetch(next_job.matchid)
        }
        if next_job.last {
            return
        }
    }
}

func main() {
    // do the eveton match as a reference
    match_fetch(614052)

    var joblist [num_workers]chan job
    var v int

    for i := 0; i < num_workers; i++ {
        job_chan := make(chan job)
        joblist[i] = job_chan
        go create_worker(job_chan)
    }
    for i := 0; i < max_id; i = i + num_workers {
        for index, c := range joblist {
            if i+index < max_id {
                v = i + index
            } else {
                v = -1
            }
            c <- job{false, v}
        }
    }
    for _, c := range joblist {
        c <- job{true, -1}
    }
}

代码似乎可以工作,因为它用 html 填充了一个名为 match_data 的目录。问题是这个html和我在浏览器中得到的完全不同!这是我认为这样做的部分:(来自 http://www.whoscored.com/Matches/614052/Live 的 GET 请求的正文。

(function() { 

var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D313536343032333530343538313538333938362C31373139363833393832313930303534313833392C31333935303737313737393531363432383234342C3132363636222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();

我认为是这种情况的原因是页面中的 javascript 提取 DOM 并将其编辑为我在查看源代码中看到的内容。我怎样才能让 golang 运行 javascript?有没有图书馆可以做到这一点?更好的是,我可以直接从服务器获取 JSON 吗?

最佳答案

这可以通过 https://godoc.org/github.com/sourcegraph/webloop#View.EvaluateJavaScript 来完成 阅读他们的主要示例 https://github.com/sourcegraph/webloop

您需要的是一般的“ headless 浏览器”。

关于go - GET 请求的输出与查看源不同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36474438/

相关文章:

go - Golang 中的 TF-IDF

android - 您如何在 Android 中获取/设置媒体音量(不是铃声音量)?

php - 传递的数组丢失除第一个元素以外的所有元素

get - NestJS 从 GridFS 返回一个文件

apache - mod_rewrite 删除一个 GET 变量

concurrency - 游览练习 #7 : Walking the tree

variables - golang 编译器说程序正在重新定义变量,还没有重新定义

go - 比较具有多个返回值的方法的返回值

maven - Maven Golang构建不扫描Go

json - LWP::Simple::get 在浏览器中有效的 URL 上失败