我查看了一些答案,包括 this但似乎没有人回答我的问题。
以下是 CSV 中的一些示例行:
_id category
ObjectId(56266da778d34fdc048b470b) [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}]
ObjectId(56266e0c78d34f22058b46de) [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]
这是我的代码:
import csv
import sys
from sys import argv
import json
def ReadCSV(csvfile):
with open('newCSVFile.csv','wb') as g:
filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
with open(csvfile, 'rb') as f:
reader = csv.reader(f) # ceate reader object
next(reader) # skip first row
for row in reader: #go trhough all the rows
listForExport = [] #initialize list that will have two items: id and list of categories
# ID section
vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv
vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases
listForExport.append(vendorId) #add evendor ID to first item in list
# categories section
tempCatList = [] #temporarly list of categories for scond item in listForExport
#this is line 41 where the error stems
categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
for names in categories: # loop through the categorie names using the key 'name'
print names['name']
这是我得到的:
Cleaning Services
Traceback (most recent call last):
File "csvtesting.py", line 57, in <module>
ReadCSV(csvfile)
File "csvtesting.py", line 41, in ReadCSV
categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte
因此,代码提取了第一个类别Cleaning Services
,但当我们到达非 ascii 字符时则失败。
我该如何处理这个问题?我很高兴删除所有非 ASCII 项目。
最佳答案
当您以 rb
模式打开输入 csv 文件时,我假设您使用的是 Python2.x 版本。好消息是 csv 部分没有问题,因为 csv 读取器将读取纯字节而不尝试解释它们。但是 json
模块会坚持将文本解码为 unicode,并且默认使用 utf8。由于您的输入文件不是 utf8 编码,因此会阻塞并引发 UnicodeDecodeError。
Latin1 有一个很好的属性:任何字节的 unicode 值只是该字节的值,因此您一定可以解码任何内容 - 是否有意义取决于实际编码是 Latin1...
所以你可以这样做:
categories = json.loads(row[1], encoding="Latin1")
或者,如果你想忽略非 ascii 字符,你可以先将字节字符串转换为 unicode 忽略错误,然后才加载 json:
categories = json.loads(row[1].decode(errors='ignore)) # ignore all non ascii characters
关于python - 在 Python 中使用 json.loads 时,如何处理 CSV 中的非 ascii 字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44771837/