analysis - 维基百科页面浏览量分析

我一直在挑战维基百科 pageviews分析。对我来说，这是第一个拥有如此大量数据的项目，我有点迷茫。当我从链接下载文件并将其解压缩时，我可以看到它具有类似表格的结构，其中的行如下所示:

1   |  2                             |3|4

en.m The_Beatles_in_the_United_States 2 0

我很难找出每一列中究竟可以找到什么。我的猜测:

语言版本和附加信息(.m = 移动？)

文章名称

我最担心的是最后两列。最后一个只有“0”值，我不知道它代表什么。我假设第三个显示的是观看次数，但我不确定。

如果有人能帮助我理解在每一列中究竟可以找到什么，或者推荐一些关于这个主题的读物，我将不胜感激。谢谢!

最佳答案

在这上面花费了更多时间之后，我终于找到了解决方案。我发布这个以防将来有人遇到同样的问题。维基百科解释了可以在数据库中找到的内容。这些解释很难找到，但您可以访问主题 here和 here .

基于此，您可以看到行具有以下结构:

域代码
页面标题
count_views
total_response_size(不再维护)

对每一列的一些解释:

第 1 列:

Domain name of the request, abbreviated. (...) Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org".

第 2 列:

For page-level files, it holds the title of the unnormalized part after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For project-level files, it is - .

第 3 列:

The number of times this page has been viewed in the respective hour.

第 4 列:

The total response size caused by the requests for this page in the respective hour. If I understand it correctly response size is discontinued due to low accuracy. That's why there are only 0s. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

希望有人觉得它有用。

关于analysis - 维基百科页面浏览量分析，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51217168/

analysis - 维基百科页面浏览量分析

上一篇：rxjs - 如何在 RxJS 中做不同的 throttle

下一篇：webpack - 使用 webpack sass-loader 时如何保留原始 url()