python - 从 web 服务器日志中免费实现用户 session 计数？

网络服务器日志分析器(例如 Urchin)通常会显示多个“ session ”。 session 定义为个人在有限的连续时间段内进行的一系列页面访问/点击。尝试使用 IP 地址以及用户代理和操作系统等补充信息以及 session 超时阈值(例如 15 或 30 分钟)来识别这些网段。

对于某些网站和应用程序，可以使用 cookie 登录和/或跟踪用户，这意味着服务器可以准确地知道 session 何时开始。我不是在谈论那个，而是在 Web 服务器不跟踪它们时启发式地推断 session (“session reconstruction”)。

我可以写一些代码，例如在 Python 中尝试根据上述标准重建 session ，但我不想重新发明轮子。我正在查看大小约为 400K 行的日志文件，因此我必须小心使用可扩展算法。

我的目标是从日志文件中提取唯一 IP 地址列表，并针对每个 IP 地址从该日志中推断出 session 数。不需要绝对的精确度和准确度...相当不错的估计就可以了。

基于 this description :

a new request is put in an existing session if two conditions are valid:

the IP address and the user-agent are the same of the requests already
inserted in the session,

the request is done less than fifteen minutes after the last request inserted.

从理论上讲，编写一个 Python 程序来构建字典(以 IP 为键)(以用户代理为键)的字典(以用户代理为键)，其值为一对:( session 数，最新 session 的最新请求)在理论上很简单.

但我宁愿尝试使用现有的实现(如果可用的话)，否则我可能会冒着花费大量时间调整性能的风险。

仅供引用，以免有人要求样本输入，这是我们日志文件的一行(已清理):

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0

最佳答案

好的，在没有任何其他答案的情况下，这是我的 Python 实现。我不是 Python 专家。欢迎提出改进建议。

#!/usr/bin/env python

"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""

## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0

import datetime
import operator

infileName = "ex100922.log"
outfileName = "visitor-ips.csv"

ipDict = {}

def inputRecords():
    infile = open(infileName, "r")

    recordsRead = 0
    progressThreshold = 100
    sessionTimeout = datetime.timedelta(minutes=30)

    for line in infile:
        if (line[0] == '#'):
            continue
        else:
            recordsRead += 1

            fields = line.split()
            # print "line of %d records: %s\n" % (len(fields), line)
            if (recordsRead >= progressThreshold):
                print "Read %d records" % recordsRead
                progressThreshold *= 2

            # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
            #   "a new request is put in an existing session if two conditions are valid:
            #    * the IP address and the user-agent are the same of the requests already
            #      inserted in the session,
            #    * the request is done less than fifteen minutes after the last request inserted."

            theDate, theTime = fields[0], fields[1]
            newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")

            ipAddr, userAgent = fields[8], fields[9]

            if ipAddr not in ipDict:
                ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
            else:
                if userAgent not in ipDict[ipAddr]:
                    ipDict[ipAddr][userAgent] = [1, newRequestTime]
                else:
                    ipdipaua = ipDict[ipAddr][userAgent]
                    if newRequestTime - ipdipaua[1] >= sessionTimeout:
                        ipdipaua[0] += 1
                    ipdipaua[1] = newRequestTime
    infile.close()
    return recordsRead

def outputSessions():
    outfile = open(outfileName, "w")
    outfile.write("#Fields: IPAddr Sessions\n")
    recordsWritten = len(ipDict)

    # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
    for ip, val in ipDict.iteritems():
        # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
        totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
        outfile.write("%s\t%d\n" % (ip, totalSessions))

    outfile.close()
    return recordsWritten

recordsRead = inputRecords()

recordsWritten = outputSessions()

print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)

更新:这花了 39 秒来输入和处理 342K 条记录并写入 21K 条记录。对于我的目的来说，这个速度已经足够了。显然 3/4 的时间花在了 strptime() 上!

关于python - 从 web 服务器日志中免费实现用户 session 计数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3773840/

python - 从 web 服务器日志中免费实现用户 session 计数？

上一篇：python - Apache mod_wsgi Django 设置 - 禁止您无权访问此服务器上的/mysite

下一篇：jquery - 同时使用 jQuery 和 FormEncode 验证表单而不重复