python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan

我有一个数据集“app_metadata.csv”，包含三列:item_id、类别、描述。 item_id 是整数，category 是字符串，description 是字符串。我使用以下代码加载数据集

app_metadata_df = pd.read_csv(app_metadata_csv_path)

但是数据集中存在损坏的数据，例如存在一行 item_id 不是数字而是文本。我想删除具有无效 item_id 值的行并将 item_id 列数据类型转换为 int。以下是我尝试过的，首先我用 error='coerce' 调用 pd.to_numeric

app_metadata_df.loc["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')

然后我降低 NA 值

app_metadata_df.loc["item_id"] = app_metadata_df.loc["item_id"].dropna()

最后调用astype(int)将数据类型转换为int:

app_metadata_df.loc["item_id"] = app_metadata_df["item_id"].astype(int)

但是，它会抛出以下错误

invalid literal for int() with base 10: 'So'

看起来 to_numeric 没有将某些无效值转换为 NAN。为什么会发生这种情况以及如何解决这个问题？

最佳答案

试试这个:

app_metadata_df = pd.read_csv(app_metadata_csv_path)

app_metadata_df['item_id'] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
app_metadata_df = app_metadata_df[app_metadata_df['item_id'].notna()].reset_index()

app_metadata_df["item_id"] = app_metadata_df["item_id"].astype(int)

`>>>打印(app_metadata_df)`

       index   item_id            category  \
0          0  593676.0  HEALTH_AND_FITNESS   
1          1  601235.0                GAME   
2          2  860079.0       COMMUNICATION   
3          3   64855.0       VIDEO_PLAYERS   
4          4  597756.0             MEDICAL   
...      ...       ...                 ...   
98577  98594  683377.0               TOOLS   
98578  98595  862905.0             FINANCE   
98579  98596  165878.0     MUSIC_AND_AUDIO   
98580  98597  683417.0         PHOTOGRAPHY   
98581  98598  703224.0                GAME   

                                             description  
0      Abs Workout, designed by professional fitness ...  
1      The best building game on android is free to d...  
2      Tamil Actress Stickers app has 200 + Tamil her...  
3      The simplest VLC Remote you'll ever find. Peri...  
4      This is the official mobile app of the Nationa...  
...                                                  ...  
98577  endoscope app for android an app to connect wi...  
98578  Acerca de esta app<br>La App OCA está pensada ...  
98579  This app provides free downloading of audio sh...  
98580  <b>Water Paint : Colour Effect</b><br><br>Want...  
98581  DIAMOND CRUSH with spectacular graphics and ex...  

[98582 rows x 4 columns]

`>>>打印(app_metadata_df.info())`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98582 entries, 0 to 98581
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        98582 non-null  int64 
 1   item_id      98582 non-null  int32 
 2   category     98582 non-null  object
 3   description  98582 non-null  object
dtypes: int32(1), int64(1), object(2)
memory usage: 2.6+ MB

关于python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73764054/

python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan

`>>>打印(app_metadata_df)`

`>>>打印(app_metadata_df.info())`

上一篇：python - 如何用 Kivy 按下并拖动来选择一组 ToggleButton？

下一篇：r - 检查日期间隔是否重叠