Python Regex 删除所有 HTML 数据

我正在创建一个 python 程序来抓取网站并为网站编制索引，当我运行当前代码时出现错误；

UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>

我不确定为什么会出现此错误，但我相信这是由于我的正则表达式引起的。我解码文本然后通过多个正则表达式运行它以删除所有链接、括号、十六进制值等。

if (isinstance(page_contents, bytes)):     #bytes to string
        c = page_contents.decode('utf-8')
    else:
        c = page_contents
    if isinstance(c, bytes):
        print(' page not converted to string')

## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c)  # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c)  # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL)  # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL)  # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL)  # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c)  # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL)  # get rid of comments
c = re.sub('<|>', ' ', c)  # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c)  # get rid of HTML entities
page_text = re.sub('\s+', ' ', c)  # replace multiple spaces with a single space

然后我将文档拆分成单独的单词，然后对其进行排序和处理。但是当我打印出来时出现问题。它循环并打印出第一个 url(文档)扩展名的数据，但是当它移动到第二个时，输出错误。

docids.append(url)
docid = str(docids.index(url))

##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
#   Send document to stemmer
stemmed_doc = stem_doc(words)

# add the vocab counts and postings
for word in stemmed_doc.split():
    if (word in vocab):
        vocab[word] += 1
    else:
        vocab[word] = 1
    if (not word in postings):
        postings[word] = [docid]
    elif (docid not in postings[word]):
        postings[word].append(docid)

    print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])

我想知道这个错误是由于不正确的正则表达式还是其他原因造成的？

已解决

我添加了表达式

c = re.sub('[\W_]+', ' ', c)

用空格替换所有非字母数字

最佳答案

您遇到的问题似乎与编码有关，而与正则表达式无关。你试过改变吗

c = page_contents.decode('utf-8')

并使用另一种编码，例如:

c = page_contents.decode('latin-1')

关于Python Regex 删除所有 HTML 数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33524332/

Python Regex 删除所有 HTML 数据

上一篇：python - 在多索引 Pandas 数据框中创建多个新列

下一篇：python - AWS Lambda python 自定义响应编码