python - 离散数据的质量

标签 python pandas statistics

我有一个包含各种类型容器的表(df_1)。我有另一个表,其中包含它们所包含的内容(df_2)。我想根据 df_1 所包含的内容是否是该类型容器的典型内容来评估 df_1 的哪些行更有可能被分类为其真实类型。

df_1 = pd.DataFrame({'Container' : [1,2,3,4,5,6,7,8,9,10],
                          'Type' : ['Box','Bag','Bin','Bag','Bin','Box','Bag','Bin','Bin','Bin']})

df_2 = pd.DataFrame({'Container' : [1,1,1,1,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,9],
                          'Item' : ['Ball','Ball','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Bomb','Ball','Ball','Baloon','Brain','Ball','Ball','Bomb','Bum']})

最佳答案

以下方法考虑每个容器的内容物是否属于该类型的典型。它对在其他容器中找到的元素的存在(正)和在其他容器中未找到的元素的存在(负)赋予相同的权重。它忽略了在其他容器中找到某个项目的频率。它还忽略内容物是否是另一种类型容器的典型。 我认为这种方法会扩大规模。

# List of how typical the contents of each container are given the type of container
x = []

# Join
df_J = df_1 .set_index('Container').join(df_2 .set_index('Container'))
df_J['Container'] = df_J.index
df_J.index = range(len(df_J.index))
df_J ['Thing'] = 1

# Type of each container
Q_B = pd.DataFrame(df_1.Container).set_index('Container')
Q_B['Type'] = df_1.set_index('Container').Type
Di_Q_B = dict(zip(Q_B.index, Q_B.Type))

# Compare each container against all of the other containers
for Container in df_1.Container:

    # Test data: Everything in the container
    Te_C = df_2[df_2['Container'] == Container]
    del Te_C['Container']

    # Everything in all of the other containers
    Tr_C = df_J[df_J['Container'] != Container]

    # Training data: Everything in all of the other containers of that type
    Tr_E = Tr_C[Tr_C['Type'] == Di_Q_B[Container]]

    # Table of how many of each item is in each container
    S_Tr = pd.pivot_table(Tr_E, values='Thing', index=Tr_E.Container, columns='Item', aggfunc=np.sum).fillna(0)

    # Table of whether each item is in each container
    Q_Tr = S_Tr.apply(np.sign)

    # Table of how many containers in the training data contain each item
    X_Tr = Q_Tr.sum(axis=0)
    Y_Tr = pd.DataFrame(X_Tr)

    # Table of whether any containers in the training data contain each item
    Z_Tr = Y_Tr.apply(np .sign)

    # List of which items are in the training data
    Train = list(Z_Tr.index)

    # Identify which of the items in the container are typical
    Te_C['Typical'] = Te_C['Item'].map(lambda a: a in Train)

    # Count how many typical items are in the container
    u = Te_C['Typical'].sum()

    # Count how many atypical items items are in the container
    v = len(Te_C.index) - u

    # Gauge how typical the contents of the container are (giving equal weight to typical and atypical items)
    w = u - v
    x.append(w)

# How typical the contents of each container are given the type of container
df_1['Pa_Ty'] = x

这给出了结果 df_1:

enter image description here

关于python - 离散数据的质量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57255055/

相关文章:

python - 在 python 中计算余弦而不导入数学

python - Django + Testypie 问题 : AppRegistryNotReadyException

python - DASK 及其 from_pandas 函数的 RAM 问题

python - 在 azure ml 中运行笔记本时,如何最好地将 azure blob csv 格式转换为 pandas dataframe

r - 对数据集中每一行的列进行 t.test

r - dnorm 是如何工作的?

perl - 如何计算perl中的方差?

python - Django 创建自定义用户模型

python - 如何根据 Python 列表中的列号过滤数据框中的行?

python - 将给定格式说明符的表数据(卡片图像)读取到 Python 中