python - 为什么 Python 正则表达式不能处理格式化的 HTML 字符串？

标签 python regex

from bs4 import BeautifulSoup
import urllib
import re

soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)

我可以很好地下载并打印出 html，但是当我添加最后两行时它总是中断。

我收到这个错误:

Traceback (most recent call last):
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
    debugger.run(codeObject, __main__.__dict__, start_stepping=0)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
    _GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
    exec cmd in globals, locals
  File "C:\Users\Zack\Documents\Scripto.py", line 1, in <module>
    from bs4 import BeautifulSoup
  File "C:\Python27\lib\re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

非常感谢!

最佳答案

您可能需要 re.search("\\$.", str(souped))。

关于python - 为什么 Python 正则表达式不能处理格式化的 HTML 字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9446260/

上一篇：Python重用正则表达式

下一篇：python - 在 Python 中解析 TCL 列表

相关文章：

带有国际字母的 Java 正则表达式

python - 审查委员会设置问题 - 无法将目标 WSGI 脚本加载为 Python 模块

python - 用上一个/下一个值 +- 100 填充 Na 的列

python - 做NLP分类时如何处理非常不平衡的类？

Python 2.7 异常处理语法

javascript - 用 javascript 解析短代码

python - 求解 3 个或更多变量的线性不等式系统 - Python

仅包含 [a-zA-Z0-9.-_] 且最大长度为 20 且不能以 '.' 结尾且最多只能包含两个 '.' 的字符串的正则表达式语法

regex - RegEx需要将数字精确匹配到小数点后两位

JavaScript 正则表达式 "ends with"与 "does not end with"