go - 如何在 Golang 中解析 HTTP.GET 响应

标签 go html-parsing html-escape-characters

我从我点击的 url 得到了这种类型的响应,我需要解析它以获得所需的 HTML。

this=ajax({"htmlInfo":"SOME-HTML", "otherInfo": "Blah Blah", "moreInfo": "Bleh Bleh"})

如上所述,我有三个 key 对值,我需要从中获取“SOME-HTML”,我如何获取它,主要问题是“SOME-HTML”具有转义字符。以下是将出现的响应类型。

\u003Cdiv class=\u0022container columns-2\u0022\u003E\n\n \u003Csection class=\u0022col-main\u0022\u003E\n \r\n\u003Cdiv class=\u0027visor-article-list list list-view-recent\u0027 \u003E\r\n\u003Cdiv class=\u0027grid_item visor-article-teaser list_default\u0027 \u003E\n \u003Ca class=\u0027grid_img\u0027 href=\u0027/manUnited-is-the-best\u0027\u003E\n \u003Cimg src=\u0022http://www.xyz.com/sites//files/styles/w400h22

任何人都可以在这方面帮助我。我不确定如何解决这个问题。

提前致谢。

最佳答案

最简单的方法是提取 JSON,然后将其解码为一个结构。 \uXXXX部分是unicode字符

package main

import (
    "encoding/json"
    "fmt"
    "regexp"
)

// Data follows the structure of the JSON data in the response
type Data struct {
    HTMLInfo  string `json:"htmlInfo"`
    OtherInfo string `json:"otherInfo"`
    MoreInfo  string `json:"moreInfo"`
}

func main() {
    // input is an example of the raw response data. It's probably a []byte if
    // you got it from ioutil.ReadAll(resp.Body)
    input := []byte(`this=ajax({"htmlInfo":"\u003Cdiv class=\u0022container columns-2\u0022\u003E\n\n \u003Csection class=\u0022col-main\u0022\u003E\n \r\n\u003Cdiv class=\u0027visor-article-list list list-view-recent\u0027 \u003E\r\n\u003Cdiv class=\u0027grid_item visor-article-teaser list_default\u0027 \u003E\n \u003Ca class=\u0027grid_img\u0027 href=\u0027/manUnited-is-the-best\u0027\u003E\n \u003Cimg src=\u0022http://example.com/sites//files/styles/w400h22", "otherInfo": "Blah Blah", "moreInfo": "Bleh Bleh"})`)

    // First we want to extract the data json using regex with a capture group.
    dataRegex, err := regexp.Compile("ajax\\((.*)\\)")
    if err != nil {
        fmt.Println("regex failed to compile:", err)
        return
    }

    // FindSubmatch should return two matches:
    // 0: The full match
    // 1: The contents of the capture group (what we want)
    matches := dataRegex.FindSubmatch(input)
    if len(matches) != 2 {
        fmt.Println("incorrect number of match results:", len(matches))
        return
    }
    dataJSON := matches[1]

    // Since the data is in JSON format, we can unmarshal it into a struct.  If
    // you don't care at all about the fields other than "htmlInfo", you can
    // omit them from the struct.
    data := &Data{}
    if err := json.Unmarshal(dataJSON, data); err != nil {
        fmt.Println("failed to unmarshal data json:", err)
    }

    // You now have access to the "htmlInfo" property
    fmt.Println("HTML INFO:", data.HTMLInfo)
}

将产生:

HTML INFO: <div class="container columns-2">

 <section class="col-main">

<div class='visor-article-list list list-view-recent' >
<div class='grid_item visor-article-teaser list_default' >
 <a class='grid_img' href='/manUnited-is-the-best'>
 <img src="http://example.com/sites//files/styles/w400h22

关于go - 如何在 Golang 中解析 HTTP.GET 响应,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36534609/

相关文章:

arrays - 排序多维数组/slice

Java:AES/CFB/NoPadding 加密,Base64 编码

PHP 简单 HTML DOM 解析器,在没有类和 ID 的标签内查找文本

python - 在 Python 中从 HTML 中提取链接

jquery - 语法错误,jQuery 中无法识别的表达式

javascript - JS : Replace a link with another word. 嵌套引号 + 转义码

go - 为什么很多golang项目直接从GitHub导入?

go - 如何从golang中的对象时间获取string或int64?

Python Beautifulsoup Find_all 除了