python - 从 pandas DataFrame 中删除非整数且超出指定数值范围的列

我有一个已导入数据的 DataFrame。然而，导入的数据可能不正确，所以我试图摆脱它。一个示例数据框:

    user    test1    test2    other
0   foo       1        7       bar
1   foo       2        9       bar
2   foo       3;as     5       bar
3   foo       3        5       bar

我想要清理列 test1 和 test2。我想删除不在指定范围内的值以及由于某些错误而包含字符串的值(如上面的条目 3;as 所示)。我通过定义可接受值的字典来做到这一点:

values_dict = {
    'test1' : [1,2,3],
    'test2' : [5,6,7],
}

以及我希望清理的列名列表:

headers = ['test1', 'test2']

我现在的代码:

# Remove string entries
for i in headers:
    df[i] = pd.to_numeric(df[i], errors='coerce')
    df[i] = df[i].fillna(0).astype(int)

# Remove unwanted values
for i in values_dict:
    df[i] = df[df[i].isin(values_dict[i])]

但似乎没有删除错误值以形成所需的数据框:

    user    test1    test2    other
0   foo       1        7       bar
1   foo       3        5       bar

感谢您的帮助!

最佳答案

你可以做这样的事情；使用 np.ological_and 从多列构造 and 条件，并使用它来对数据框进行子集化:

headers = ['test1', 'test2']
df[pd.np.logical_and(*(pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers))]

#  user  test1  test2   other
#0  foo      1      7     bar
#3  foo      3      5     bar

分割:

[pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers]

首先将感兴趣的列转换为数值类型，然后检查该列是否在特定范围内；这为每列创建一个 bool 系列:

#[0     True
# 1     True
# 2    False
# 3     True
# Name: test1, dtype: bool, 
# 0     True
# 1    False
# 2     True
# 3     True
# Name: test2, dtype: bool]

为了同时满足所有列的条件，我们需要一个and操作，可以使用numpy.logic_and进一步构造它；此处使用 * 将所有列条件解压为参数。

关于python - 从 pandas DataFrame 中删除非整数且超出指定数值范围的列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44297368/

python - 从 pandas DataFrame 中删除非整数且超出指定数值范围的列

上一篇：Python - 从包含文本的 pandas 系列中提取数字

下一篇：python - 如何在 Keras/Tensorflow 中返回增强数据