如何通过使用孤立的英语符号概率来找出英语的熵?
最佳答案
如果我们按照this SO answer的方式定义“隔离符号概率”,则必须执行以下操作:
警告:
代码示例:
这是一些实现上述过程的Python代码。它将文本规范化为小写,并排除标点符号和任何其他非字母,非空白字符。假定您已经整理了一个具有代表性的英语语料库,并在STDIN上提供了它(编码为ASCII)。
import re
import sys
from math import log
# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
return log(number) / log(2)
# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
return cleaner.sub(' ',text)
# Dictionary for letter counts
letter_frequency = {}
# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())
# Count letter frequencies
for letter in text:
if letter in letter_frequency:
letter_frequency[letter] += 1
else:
letter_frequency[letter] = 1
# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
probability = float(letter_frequency[letter]) / len(text)
length_sum += probability * log2(probability)
# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))
关于nlp - 如何找出英语的熵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9604460/