python - 如何处理 python 中不同类型的编码?

标签 python python-2.7 encoding

在 Python 2.7 中使用这个简短的脚本来使用 VirusTotal API(这个 API 的要点是上传文件以在 virustotal 站点中扫描):

def scanAFile(fileToScan):
    host = "www.virustotal.com"
    selector = "https://www.virustotal.com/vtapi/v2/file/scan"
    fields = [("apikey", myPublicKey)]
    file_to_send = open(fileToScan, "rb").read()
    files = [("file", fileToScan, file_to_send)]
    json = postfile.post_multipart(host, selector, fields, files)

    return simplejson.loads(json)

我发现我要上传的每个文件都需要使用不同的编码,否则会出现此错误:

Traceback (most recent call last):
  File "/home/user/PythonDev/20150617_WW/agent_vt.py", line 139, in <module>
    scanQueue()
  File "/home/user/PythonDev/20150617_WW/agent_vt.py", line 76, in scanQueue
    jsonScan = scanAFile(fileToScan) #todo if file not found skip
  File "/home/user/PythonDev/20150617_WW/agent_vt.py", line 37, in scanAFile
    json = postfile.post_multipart(host, selector, fields, files)
  File "/home/user/PythonDev/20150617_WW/postfile.py", line 13, in post_multipart
    content_type, body = encode_multipart_formdata(fields, files)
  File "/home/user/PythonDev/20150617_WW/postfile.py", line 45, in encode_multipart_formdata
    body = CRLF.join(L)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

文件postfile.py在他们的网站上为virustotal提供,这是编码问题所在的功能:

def encode_multipart_formdata(fields, files):
    """
    fields is a sequence of (name, value) elements for regular form fields.
    files is a sequence of (name, filename, value) elements for data to be uploaded as files
    Return (content_type, body) ready for httplib.HTTP instance
    """
    BOUNDARY = '----------ThIs_Is_tHe_bouNdaRY_$'
    CRLF = '\r\n'
    L = []
    for (key, value) in fields:
        L.append('--' + BOUNDARY)
        L.append('Content-Disposition: form-data; name="%s"' % key)
        L.append('')
        L.append(value)
    for (key, filename, value) in files:
        L.append('--' + BOUNDARY)
        L.append('Content-Disposition: form-data; name="%s"; filename="%s"' % (key, filename))
        L.append('Content-Type: %s' % get_content_type(filename))
        L.append('')
        L.append(value)
    L.append('--' + BOUNDARY + '--')
    L.append('')
    body = CRLF.join(L)
    content_type = 'multipart/form-data; boundary=%s' % BOUNDARY
    return content_type, body

作为临时解决方案,我在 postfile.py 的开头添加了这段代码:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

但是每次都更新这个很烦人。有什么办法可以解决这个问题吗?

最佳答案

尝试使用此库进行编码检测 http://github.com/chardet/chardet

pip install chardet

然后使用它

import sys
import chardet

def scanAFile(fileToScan):
    code = chardet.detect(fileToScan)
    host = "www.virustotal.com"
    selector = "https://www.virustotal.com/vtapi/v2/file/scan"
    fields = [("apikey", myPublicKey)]
    file_to_send = open(fileToScan, "rb").read().decode(code['encoding'])
    files = [("file", fileToScan, file_to_send)]
    json = postfile.post_multipart(host, selector, fields, files)

    return simplejson.loads(json)

关于python - 如何处理 python 中不同类型的编码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31539355/

相关文章:

java - Gradle / eclipse : Different behavior of german "Umlaute" when using equality?

python - 检测鼠标光标是否被任何其他应用程序隐藏或可见

python - Flask 运行 request.method 默认为 'POST' 而不是 'GET'

python - 如何检查字符串是否具有相同的字符? Python

python - 是否可以生成随机对象变量名称并从 python 方法返回它们?如何?

python - SDL: "working natively in C++"是什么意思?

c - sprintf 中的编码

java - java二维数组中的代码位置

python - 检索 youtube 订阅 python api

Python 3.6.3 urlopen 从存储在远程服务器上的 html 文件的 URI 中删除服务器名称