python - 如何修复 '' UnicodeDecodeError : 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined >'' ?

标签 python unicode file-io sqlite decode

目前,我正在尝试让 Python 3 程序通过 Spyder IDE/GUI 对充满信息的文本文件进行一些操作。但是,在尝试读取文件时出现以下错误:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

程序代码如下:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

最佳答案

正如您从 https://en.wikipedia.org/wiki/Windows-1252 中看到的那样, 代码0x9D在CP1252中没有定义。

“错误”是例如在您的 open 函数中:您没有指定编码,因此 python(仅在 Windows 中)将使用一些系统编码。一般来说,如果您读取的文件可能不是在同一台机器上创建的,那么指定编码确实更好。

我建议在您的 open 上也添加一个代码,用于编写 csv。明确一点确实更好。

我不知道原始文件格式,但打开 ,encoding='utf-8' 通常是一件好事(这是 Linux 和 MacOs 的默认设置)。

关于python - 如何修复 '' UnicodeDecodeError : 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined >'' ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49562499/

相关文章:

Python 正则表达式 - Unicode 文本匹配的位置和值

java - 文件访问android

python - 基于字典键分组

python - 为什么serial read(1)在收到一个字符后继续等待直到超时?

python - 从文件中读取变音符号并将其插入 XML

c++ - 用户输入的C++ Win32api输出Unicode

perl - “utf8 ”\x96 Perl中的“does not map to Unicode at <somefile.pl> at line no - 321”错误

java - 如何用JAVA向文件中写入特定数据

c++ - 为什么我分配的缓冲区大小不正确?

python - django:胖模型和瘦 Controller ?