python - 如何在 python 3.6 中取消转义字符？

我对如何在 python 中转义字符有点困惑。我正在使用 BeautifulSoup 解析一些 HTML，当我检索文本内容时，它看起来像这样:

\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support

我希望它看起来像这样:

State-of-the-art security and 100% uptime SLA. Outstanding support

下面是我的代码:

    self.__page = requests.get(url)
    self.__soup = BeautifulSoup(self.__page.content, "lxml")
    self.__page_cleaned = self.__removeTags(self.__page.content) #remove script and style tags
    self.__tree = html.fromstring(self.__page_cleaned) #contains the page html in a tree structure
    page_data = {}
    page_data["content"] =  self.__tree.text_content()

如何删除那些编码的反斜杠字符？我到处都找过了，但没有任何效果。

最佳答案

您可以使用编解码器模块将这些转义序列转换为正确的文本。

import codecs

s = r'\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'

# Convert the escape sequences
z = codecs.decode(s, 'unicode-escape')
print(z)
print('- ' * 20)

# Remove the extra whitespace
print(' '.join(z.split()))

输出

    [several blank lines here]
 



State-of-the-art security and 100% uptime SLA. 



Outstanding support
- - - - - - - - - - - - - - - - - - - - 
State-of-the-art security and 100% uptime SLA. Outstanding support

codecs.decode(s, 'unicode-escape') 函数非常通用。它可以处理简单的反斜杠转义，例如换行符和回车序列(\n 和 \r)，但其主要优势是处理 Unicode 转义序列，例如 \u00a0，这只是一个不间断空格字符。但是，如果您的数据中有其他 Unicode 转义，例如外来字母字符或表情符号，它也会处理它们。

<小时/>

正如 Evpok 在评论中提到的，如果文本字符串包含实际的 Unicode 字符以及 Unicode \u 或 \U<，则此不会工作 转义序列。

来自codecs docs :

unicode_escape

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

另请参阅 codecs.decode 的文档.

关于python - 如何在 python 3.6 中取消转义字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47104742/

python - 如何在 python 3.6 中取消转义字符？

上一篇：python - 将分类值的行放入 pandas 的列中

下一篇：python - Python 中的有序字典