我对如何在 python 中转义字符有点困惑。我正在使用 BeautifulSoup 解析一些 HTML,当我检索文本内容时,它看起来像这样:
\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support
我希望它看起来像这样:
State-of-the-art security and 100% uptime SLA. Outstanding support
下面是我的代码:
self.__page = requests.get(url)
self.__soup = BeautifulSoup(self.__page.content, "lxml")
self.__page_cleaned = self.__removeTags(self.__page.content) #remove script and style tags
self.__tree = html.fromstring(self.__page_cleaned) #contains the page html in a tree structure
page_data = {}
page_data["content"] = self.__tree.text_content()
如何删除那些编码的反斜杠字符?我到处都找过了,但没有任何效果。
最佳答案
您可以使用编解码器
模块将这些转义序列转换为正确的文本。
import codecs
s = r'\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'
# Convert the escape sequences
z = codecs.decode(s, 'unicode-escape')
print(z)
print('- ' * 20)
# Remove the extra whitespace
print(' '.join(z.split()))
输出
[several blank lines here]
State-of-the-art security and 100% uptime SLA.
Outstanding support
- - - - - - - - - - - - - - - - - - - -
State-of-the-art security and 100% uptime SLA. Outstanding support
codecs.decode(s, 'unicode-escape')
函数非常通用。它可以处理简单的反斜杠转义,例如换行符和回车序列(\n
和 \r
),但其主要优势是处理 Unicode 转义序列,例如 \u00a0
,这只是一个不间断空格字符。但是,如果您的数据中有其他 Unicode 转义,例如外来字母字符或表情符号,它也会处理它们。
正如 Evpok 在评论中提到的,如果文本字符串包含实际的 Unicode 字符以及 Unicode \u
或 \U<,则此不会工作
转义序列。
来自codecs docs :
unicode_escape
Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.
另请参阅 codecs.decode
的文档.
关于python - 如何在 python 3.6 中取消转义字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47104742/