我使用两个不同的链接在 python 中编写了一个脚本(一个有分页,另一个没有)
来查看我的脚本是否可以获取所有下一页链接。如果没有分页选项,脚本必须打印此 No pagination found
行。
我已经应用了 @check_pagination
装饰器来检查分页是否存在,并且我想将此装饰器保留在我的抓取工具中。
我已经实现了上述目标,并遵守以下规定:
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.mobilehome.net/mobile-home-park-directory/maine/all",
"https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all"
]
def check_pagination(f):
def wrapper(lead):
if not lead.pages:
print('No pagination found')
return f(lead)
return wrapper
class LinkScraper:
def __init__(self, url):
self.url = url
self.home_page = requests.get(self.url).text
self.soup = BeautifulSoup(self.home_page,"lxml")
self.pages = [item.text for item in self.soup.find('div', {'class':'pagination'}).find_all('a')][:-1]
@check_pagination
def __iter__(self):
for p in self.pages:
link = requests.get(f'{self.url}/page/{p}')
yield link.url
for url in urls:
d = [page for page in LinkScraper(url)]
print(d)
现在,我希望在不使用类的情况下执行相同的操作,并在脚本中保留 decorator
来检查分页,但似乎我在某个地方出错了decorator
这就是它不打印 No pagination found
的原因,即使链接没有分页。任何解决此问题的帮助将不胜感激。
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.mobilehome.net/mobile-home-park-directory/maine/all",
"https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all"
]
def check_pagination(f):
def wrapper(*args,**kwargs):
if not f(*args,**kwargs):
print("No pagination found")
return f(*args,**kwargs)
return wrapper
def get_base(url):
page = requests.get(url).text
soup = BeautifulSoup(page,"lxml")
return [item.text for item in soup.find('div', {'class':'pagination'}).find_all('a')][:-1]
@check_pagination
def get_links(num):
link = requests.get(f'{url}/page/{num}')
return link.url
if __name__ == '__main__':
for url in urls:
links = [item for item in get_base(url)]
for link in links:
print(get_links(link))
最佳答案
只需将装饰器应用于get_base
:
def check_pagination(f):
def wrapper(*args,**kwargs):
result = f(*args,**kwargs)
if not result:
print("No pagination found")
return result
return wrapper
@check_pagination
def get_base(url):
page = requests.get(url).text
soup = BeautifulSoup(page,"lxml")
return [item.text for item in soup.find('div', {'class':'pagination'}).find_all('a')][:-1]
def get_links(num):
link = requests.get(f'{url}/page/{num}')
return link.url
if __name__ == '__main__':
for url in urls:
links = [item for item in get_base(url)]
for link in links:
print(get_links(link))
关于python - 无法使用装饰器处理具有不同分页的两个链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53638221/