我有一个数据集“app_metadata.csv”,包含三列:item_id、类别、描述。 item_id 是整数,category 是字符串,description 是字符串。我使用以下代码加载数据集
app_metadata_df = pd.read_csv(app_metadata_csv_path)
但是数据集中存在损坏的数据,例如存在一行 item_id 不是数字而是文本。我想删除具有无效 item_id 值的行并将 item_id 列数据类型转换为 int。以下是我尝试过的,首先我用 error='coerce' 调用 pd.to_numeric
app_metadata_df.loc["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
然后我降低 NA 值
app_metadata_df.loc["item_id"] = app_metadata_df.loc["item_id"].dropna()
最后调用astype(int)将数据类型转换为int:
app_metadata_df.loc["item_id"] = app_metadata_df["item_id"].astype(int)
但是,它会抛出以下错误
invalid literal for int() with base 10: 'So'
看起来 to_numeric 没有将某些无效值转换为 NAN。为什么会发生这种情况以及如何解决这个问题?
最佳答案
试试这个:
app_metadata_df = pd.read_csv(app_metadata_csv_path)
app_metadata_df['item_id'] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
app_metadata_df = app_metadata_df[app_metadata_df['item_id'].notna()].reset_index()
app_metadata_df["item_id"] = app_metadata_df["item_id"].astype(int)
>>>打印(app_metadata_df)
index item_id category \
0 0 593676.0 HEALTH_AND_FITNESS
1 1 601235.0 GAME
2 2 860079.0 COMMUNICATION
3 3 64855.0 VIDEO_PLAYERS
4 4 597756.0 MEDICAL
... ... ... ...
98577 98594 683377.0 TOOLS
98578 98595 862905.0 FINANCE
98579 98596 165878.0 MUSIC_AND_AUDIO
98580 98597 683417.0 PHOTOGRAPHY
98581 98598 703224.0 GAME
description
0 Abs Workout, designed by professional fitness ...
1 The best building game on android is free to d...
2 Tamil Actress Stickers app has 200 + Tamil her...
3 The simplest VLC Remote you'll ever find. Peri...
4 This is the official mobile app of the Nationa...
... ...
98577 endoscope app for android an app to connect wi...
98578 Acerca de esta app<br>La App OCA está pensada ...
98579 This app provides free downloading of audio sh...
98580 <b>Water Paint : Colour Effect</b><br><br>Want...
98581 DIAMOND CRUSH with spectacular graphics and ex...
[98582 rows x 4 columns]
>>>打印(app_metadata_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98582 entries, 0 to 98581
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 98582 non-null int64
1 item_id 98582 non-null int32
2 category 98582 non-null object
3 description 98582 non-null object
dtypes: int32(1), int64(1), object(2)
memory usage: 2.6+ MB
关于python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73764054/