python - 使用 python 解析推文中的 unicode

标签 python unicode utf-8

我已经尝试了各种方法来读取此文件中的推文(示例)。 unicode 字符 Victory Hand似乎不想解析。这是数据样本。

399491624029274112,Kyle aka K-LO,I unlocked 2 Xbox Live achievements in WWE 2K14! http://t.co/wRIxZTjYWg,False,0,Raptr,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,0
399491626584014848,Dots Group LLC,GeekWire Radio: Amazon vs. author  Xbox One first take  and favorite iPad apps - GeekWire http://t.co/jbbryoHpHe,False,0,IFTTT,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,2
399491630149169152,BETTINGGENIUS!,RT @xJohn69: Sergio Ramos giveaway!; XBOX + PS3; ; -RT; -Follow me and @NeillWagers; -S/Os appreciated; ; Goodluck http://t.co/D997faGSB5,False,0,Twitter for iPad,,,,,2013,11,10,11,0,1,1,0,1,0,0,0,0,2
399491635735953408,Princess of TV,Toy Story of Terror is amaze balls. Thanks Xbox for the free NowTV #disneyweekend,False,0,Twitter for iPhone,,,,,2013,11,10,11,0,2,0,0,1,0,0,0,0,2
399491654136369152,Sam Hambre,'9 Things You Should Know Before Buying a PlayStation 4'  http://t.co/Q3Ma1R83cF,False,0,Buffer,,,,,2013,11,10,11,0,7,0,1,0,0,0,0,0,0
399491655780167680,Rhi ✌,@Escape2theMoon that's done what? im not on rn obvs i dont even have access to an xbox :c ?,False,0,web,399490703761223680,Escape2theMoon,1404625770,,2013,11,10,11,0,7,0,0,1,0,0,0,0,0

您可以在最后一条推文的第二个字段中看到胜利之手。

我想要做的是从所有推文构建一个长字符串。很简单,我什至无法处理这个脚本:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import csv

current_file = codecs.open("C:/myfile.csv", encoding="utf-8")
data = csv.reader(current_file, delimiter=",")

tweets = ""

for record in data:
    tweets = tweets + " " + record[2].encode('utf-8', errors='replace')

我尝试了导入、编码、连接、转换为 unicode 等多种排列……但我无法超越胜利之手。我总是收到的错误是:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-114-fd9b136abd74> in <module>()
----> 1 for record in data:
      2     tweets = tweets + ' ' + record[2].encode('utf-8', 'replace')

UnicodeEncodeError: 'ascii' codec can't encode character u'\u270c' in position 23: ordinal not in range(128)

我做错了什么?如何将所有这些推文连接成一个字符串而不出现 unicode 问题?

最佳答案

问题出在 csv.reader 上,它试图将 unicode 转换回 ascii。来自 csv docs 的注释:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

按照建议,您可以使用此配方 from the docs examples :

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

使用 unicode_csv_reader 辅助实用程序,您的代码可能如下所示(稍加修改以使用闭包和连接而不是循环):

from operator import itemgetter

tweets_fname = "C:/myfile.csv"

with codecs.open(tweets_fname , encoding="utf-8") as current_file:
    data = unicode_csv_reader(current_file, delimiter=",")
    tweets = u' '.join(map(itemgetter(2), data))
    encoded_tweets = tweets.encode('utf8', 'replace')

关于python - 使用 python 解析推文中的 unicode,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20137321/

相关文章:

python 声音("Bell")

python - Keras 预测每次返回相同的结果

python - 使用 numpy/scipy 进行形状识别(也许是分水岭)

python - 无法将干净的 unicode 文本插入 pandas 中的 DataFrame

Java - 区分unicode NFC和NFD中的文件

php - 如何处理内联PHP外文字符?

python - 导入错误 : No module named yaml in Keras (neural network)

python unicode在用作字符串时而不是在打印时转换为原始文本字符

java - Glassfish 在尝试获取具有 UTF-8 名称的文件时返回 404 错误

java - Eclipse:在文本编辑器中使用UTF-8编码使字符串无法正常工作,我该如何解决这个问题?