我正在尝试使用 gocolly 的并行设置来限制一次抓取最大数量的 URL。
使用我粘贴在下面的代码,我得到了这个输出:
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv
这表明访问没有被给定的最大线程数阻塞。添加更多 URL 时,它们会一起发送,从而导致服务器被禁止。
如何配置库以获得以下输出:
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv
代码如下:
const (
letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
URL = "https://www.google.com/search?q="
)
func RandStringBytes(n int) chan string {
out := make(chan string)
quit := make(chan int)
go func() {
for i := 1; i <= 5; i++ {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Intn(len(letterBytes))]
}
out <- string(b)
}
close(out)
quit <- 0
}()
return out
}
func main() {
c := RandStringBytes(6)
collector := colly.NewCollector(
colly.AllowedDomains("www.google.com"),
colly.Async(true),
colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
)
collector.Limit(&colly.LimitRule{
DomainRegexp: "www.google.com",
Parallelism: 2,
RandomDelay: 5 * time.Second,
})
collector.OnResponse(func(r *colly.Response) {
url := r.Ctx.Get("url")
fmt.Println("Done visiting", url)
})
collector.OnRequest(func(r *colly.Request) {
r.Ctx.Put("url", r.URL.String())
fmt.Println("Visiting", r.URL.String())
})
collector.OnError(func(r *colly.Response, err error) {
fmt.Println(err)
})
for w := range c {
collector.Visit(URL+w)
}
collector.Wait()
}
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv
最佳答案
OnRequest
在请求实际发送到服务器之前完成。您的调试语句具有误导性:fmt.Println("Visiting", r.URL.String())
应该是:fmt.Println("Preparing request for:", r.URL .String())
.
我觉得你的问题很有趣,所以我用 python 的 http.server
设置了一个本地测试用例,如下所示:
$ cd $(mktemp -d) # make temp dir
$ for n in {0..99}; do touch $n; done # make 100 empty files
$ python3 -m http.server # start up test server
然后修改上面的代码:
package main
import (
"fmt"
"strconv"
"time"
"github.com/gocolly/colly"
)
const URL = "http://127.0.0.1:8000/"
func main() {
collector := colly.NewCollector(
colly.AllowedDomains("127.0.0.1:8000"),
colly.Async(true),
colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
)
collector.Limit(&colly.LimitRule{
DomainRegexp: "127.0.0.1:8000",
Parallelism: 2,
Delay: 5 * time.Second,
})
collector.OnResponse(func(r *colly.Response) {
url := r.Ctx.Get("url")
fmt.Println("Done visiting", url)
})
collector.OnRequest(func(r *colly.Request) {
r.Ctx.Put("url", r.URL.String())
fmt.Println("Creating request for:", r.URL.String())
})
collector.OnError(func(r *colly.Response, err error) {
fmt.Println(err)
})
for i := 0; i < 100; i++ {
collector.Visit(URL + strconv.Itoa(i))
}
collector.Wait()
}
请注意,我将 RandomDelay
更改为常规的,这使得测试用例的推理变得更容易,并且我更改了 OnRequest
的调试语句。
现在如果你去运行
这个文件,你会看到:
- 它立即打印
Creating request for: http://127.0.0.1:8000/
+ 一个数字,100 次 - 它打印
Done visiting http://127.0.0.1:8000/
+ 一个数字,两次 - Python HTTP 服务器打印 2 个
GET
请求,#2 中的每个数字 1 个 - 暂停 5 秒
- 对剩余的数字重复步骤 #2 - #4
所以在我看来,colly 的行为符合预期。如果您仍然遇到意想不到的速率限制错误,请考虑尝试验证您的 limit rule is matching the domain .
关于go - 限制 gocolly 一次处理有限数量的 url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51093663/