python - 计算非实数数据的增量熵

我有一组数据，其中包含 ID、时间戳和标识符。我必须仔细检查它，计算熵并保存数据的一些其他链接。在每一步中，都会将更多标识符添加到标识符字典中，我必须重新计算熵并附加它。我有大量的数据，并且由于标识符数量不断增加以及每一步后的熵计算，程序陷入困境。我阅读了以下解决方案，但它是关于由数字组成的数据。 Incremental entropy computation

我从本页复制了两个函数，熵的增量计算在每一步给出的值都与经典的全熵计算不同。这是我的代码:

from math import log
# ---------------------------------------------------------------------#
# Functions copied from  https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy using the classic equation
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = []  # Classical way of calculating entropy at every step
updated_entropies = []  # Incremental way of calculating entropy at every step
for item in input_data:
    temp = item[2].split(",")
    identifiers_sum = sum(total_identifiers.values())  # Sum of all identifiers
    old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1]  # Get previous entropy calculation
    for identifier in temp:
        S_new = len(temp)  # sum of new samples
        temp_dictionaty = {a:1 for a in temp}  # Store current identifiers and their occurrence
        if identifier not in total_identifiers:
            total_identifiers[identifier] = 1
        else:
            total_identifiers[identifier] += 1
    current_entropy = entropy(total_identifiers.values())  # Entropy for current set of identifiers
    updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
    updated_entropies.append(updated_entropy)

    entropy_value = entropy(total_identifiers.values())  # Classical entropy calculation for comparison. This step becomes too expensive with big data
    all_entropies.append(entropy_value)

print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum)  # Gives 12 while the sum is 14 ???
print("All Classical Entropies:     ", all_entropies)  # print for comparison
print("All Updated Entropies:       ", updated_entropies)

另一个问题是，当我打印“Sum oftotal_identifiers”时，它给出的是12而不是14! (由于数据量非常大，所以我是逐行读取实际文件并将结果直接写入磁盘，除了标识符字典之外不存储在内存中)。

最佳答案

上面的代码使用了定理4；在我看来，您想改用定理 5(来自下一段的论文)。

但是请注意，如果标识符的数量确实是问题所在，那么下面的增量方法也不起作用——在某些时候字典会变得太大。

下面您可以找到一个概念验证的 Python 实现，它遵循 Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams 中的描述。 .

import collections
import math
import random


def log2(p):
    return math.log(p, 2) if p > 0 else 0


CountChange = collections.namedtuple('CountChange', ('label', 'change'))


class EntropyHolder:
    def __init__(self):
        self.counts_ = collections.defaultdict(int)

        self.entropy_ = 0
        self.sum_ = 0

    def update(self, count_changes):
        r = sum([change for _, change in count_changes])

        residual = self._compute_residual(count_changes)

        self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual

        self._update_counts(count_changes)

        return self.entropy_

    def _compute_residual(self, count_changes):
        r = sum([change for _, change in count_changes])
        residual = 0

        for label, change in count_changes:
            p_new = (self.counts_[label] + change) / (self.sum_ + r)
            p_old = self.counts_[label] / (self.sum_ + r)

            residual += p_new * log2(p_new) - p_old * log2(p_old)

        return residual

    def _update_counts(self, count_changes):
        for label, change in count_changes:
            self.sum_ += change
            self.counts_[label] += change

    def entropy(self):
        return self.entropy_



def naive_entropy(counts):
    s = sum(counts)
    return sum([-(r/s) * log2(r/s) for r in counts])


if __name__ == '__main__':
    print(naive_entropy([1, 1]))
    print(naive_entropy([1, 1, 1, 1]))

    entropy = EntropyHolder()
    freq = collections.defaultdict(int)
    for _ in range(100):
        index = random.randint(0, 5)
        entropy.update([CountChange(index, 1)])
        freq[index] += 1

    print(naive_entropy(freq.values()))
    print(entropy.entropy())

关于python - 计算非实数数据的增量熵，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48601396/

python - 计算非实数数据的增量熵

上一篇：python - 类和实例上的描述符行为差异

下一篇：python - pygame 中的键盘布局感知组合