我正在创建一个 python 程序来抓取网站并为网站编制索引,当我运行当前代码时出现错误;
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>
我不确定为什么会出现此错误,但我相信这是由于我的正则表达式引起的。我解码文本然后通过多个正则表达式运行它以删除所有链接、括号、十六进制值等。
if (isinstance(page_contents, bytes)): #bytes to string
c = page_contents.decode('utf-8')
else:
c = page_contents
if isinstance(c, bytes):
print(' page not converted to string')
## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c) # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c) # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL) # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL) # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL) # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL) # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL) # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c) # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL) # get rid of comments
c = re.sub('<|>', ' ', c) # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c) # get rid of HTML entities
page_text = re.sub('\s+', ' ', c) # replace multiple spaces with a single space
然后我将文档拆分成单独的单词,然后对其进行排序和处理。但是当我打印出来时出现问题。它循环并打印出第一个 url(文档)扩展名的数据,但是当它移动到第二个时,输出错误。
docids.append(url)
docid = str(docids.index(url))
##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
# Send document to stemmer
stemmed_doc = stem_doc(words)
# add the vocab counts and postings
for word in stemmed_doc.split():
if (word in vocab):
vocab[word] += 1
else:
vocab[word] = 1
if (not word in postings):
postings[word] = [docid]
elif (docid not in postings[word]):
postings[word].append(docid)
print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])
我想知道这个错误是由于不正确的正则表达式还是其他原因造成的?
已解决
我添加了表达式
c = re.sub('[\W_]+', ' ', c)
用空格替换所有非字母数字
最佳答案
您遇到的问题似乎与编码有关,而与正则表达式无关。你试过改变吗
c = page_contents.decode('utf-8')
并使用另一种编码,例如:
c = page_contents.decode('latin-1')
?
关于Python Regex 删除所有 HTML 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33524332/