我同时使用 xpath 和 beautifulsoup 来抓取网页。 Xpath 需要 tree 作为输入,而 beautifulsoup 需要 soup 作为输入。 这是获取树和汤的代码:
def get_tree(url):
r = requests.get(url)
tree = html.fromstring(r.content)
return tree
# get soup
def get_soup(url):
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
return soup
这两种方法都使用 requests.get(url)。这就是我要提前存储的内容。 这是 python 中的代码:
import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)
然后我得到这样的错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response
这是存储文本的代码,但出现错误:
import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'
我该如何解决这个问题?
最佳答案
infile = open("html", "rb") #this is a file object Not a string
您需要先使用 read()
阅读它,而不仅仅是打开 :-)-
infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)
关于python - 在 python 中存储 html,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26810651/