python - 在 Python 中使用 json.loads 时，如何处理 CSV 中的非 ascii 字符？

我查看了一些答案，包括 this但似乎没有人回答我的问题。

以下是 CSV 中的一些示例行:

_id category
ObjectId(56266da778d34fdc048b470b)  [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}]
ObjectId(56266e0c78d34f22058b46de)  [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]

这是我的代码:

import csv
import sys

from sys import argv
import json


def ReadCSV(csvfile):
with open('newCSVFile.csv','wb') as g:
    filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    with open(csvfile, 'rb') as f:
        reader = csv.reader(f) # ceate reader object
        next(reader) # skip first row

        for row in reader: #go trhough all the rows
            listForExport = [] #initialize list that will have two items: id and list of categories

            # ID section
            vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv
            vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases
            listForExport.append(vendorId) #add evendor ID to first item in list


            # categories section
            tempCatList = []  #temporarly list of categories for scond item in listForExport

            #this is line 41 where the error stems
            categories = json.loads(row[1]) #create's a dict with the categoreis from a given row

            for names in categories:  # loop through the categorie names using the key 'name'

                print names['name']

这是我得到的:

Cleaning Services
Traceback (most recent call last):
  File "csvtesting.py", line 57, in <module>
    ReadCSV(csvfile)
  File "csvtesting.py", line 41, in ReadCSV
    categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte

因此，代码提取了第一个类别Cleaning Services，但当我们到达非 ascii 字符时则失败。

我该如何处理这个问题？我很高兴删除所有非 ASCII 项目。

最佳答案

当您以 rb 模式打开输入 csv 文件时，我假设您使用的是 Python2.x 版本。好消息是 csv 部分没有问题，因为 csv 读取器将读取纯字节而不尝试解释它们。但是 json 模块会坚持将文本解码为 unicode，并且默认使用 utf8。由于您的输入文件不是 utf8 编码，因此会阻塞并引发 UnicodeDecodeError。

Latin1 有一个很好的属性:任何字节的 unicode 值只是该字节的值，因此您一定可以解码任何内容 - 是否有意义取决于实际编码是 Latin1...

所以你可以这样做:

categories = json.loads(row[1], encoding="Latin1")

或者，如果你想忽略非 ascii 字符，你可以先将字节字符串转换为 unicode 忽略错误，然后才加载 json:

categories = json.loads(row[1].decode(errors='ignore))     # ignore all non ascii characters

关于python - 在 Python 中使用 json.loads 时，如何处理 CSV 中的非 ascii 字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44771837/

python - 在 Python 中使用 json.loads 时，如何处理 CSV 中的非 ascii 字符？

上一篇：python - 如何在不实例化的情况下复制类属性

下一篇：Python:一个日期为 NAT 时的日期差异