python - 从python请求中的txt文件中的链接获取所有分页URL的列表

标签 python list function python-requests

大家好,定义一个函数,从 python 中的 txt 文件中的链接获取底部所有分页 URL 的列表。

这是我需要完成的示例。

输入链接

http://www.apartmentguide.com/apartments/Alabama/Hartselle/

期望的输出

www.apartmentguide.com/apartments/Alabama/Hartselle/?page=2
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=3
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=4
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=5
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=6
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=7
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=8
www.apartmentguide.com/apartments/Alabama/Hartselle/?page=9

依此类推,每个输入网址都有任何限制。

这是我到目前为止编写的函数,但它不起作用,我也不擅长使用 Python。

import requests
#from bs4 import BeautifulSoup
from scrapy import Selector as Se
import urllib2


lists = open("C:\Users\Administrator\Desktop\\3.txt","r")
read_list = lists.read()
line = read_list.split("\n")


def get_links(line):
    for each in line:
        r = requests.get(each)
        sel = Se(text=r.text, type="html")
        next_ = sel.xpath('//a[@class="next sprite"]//@href').extract()
        for next_1 in next_:
            next_2 = "http://www.apartmentguide.com"+next_1
            print next_2
        get_links(next_1)

get_links(line)

最佳答案

以下是执行此操作的两种方法。

import mechanize

import requests
from bs4 import BeautifulSoup, SoupStrainer
import urlparse

import pprint

#-- Mechanize --
br = mechanize.Browser()

def get_links_mechanize(root):
    links = []
    br.open(root)

    for link in br.links():
        try:
            if dict(link.attrs)['class'] == 'page':
                links.append(link.absolute_url)
        except:
            pass
    return links


#-- Requests / BeautifulSoup / urlparse --
def get_links_bs(root):
    links = []
    r = requests.get(root)

    for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
        if link.has_attr('href') and link.has_attr('class') and 'page' in link.get('class'):
            links.append(urlparse.urljoin(root, link.get('href')))

    return links


#with open("C:\Users\Administrator\Desktop\\3.txt","r") as f:
#    for root in f:
#        links = get_links(root) 
#        # <Do something with links>
root = 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/'

print "Mech:"
pprint.pprint( get_links_mechanize(root) )
print "Requests/BS4/urlparse:"
pprint.pprint( get_links_bs(root) )

其中一个使用 mechanize —— 它对于 URL 来说更智能一些,但是速度慢很多,并且可能有点矫枉过正,具体取决于您正在做的其他事情。

另一个使用 requests 获取页面(urllib2 就足够了),BeautifulSoup 解析标记,使用 urlparse 形成绝对 URL您列出的页面中的相对 URL。

请注意,这两个函数都返回以下列表:

['http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=2',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=3',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=4',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=5',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=2',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=3',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=4',
 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/?page=5']

其中有重复项。您可以通过更改来消除重复项

return links

return list(set(links))

无论您选择什么方法。

编辑:

我注意到上述函数仅返回第 2-5 页的链接,您必须导航这些页面才能看到实际上有 10 个页面。

一种完全不同的方法是抓取“根”页面以获取结果数量,然后预测会产生多少页面,然后从中构建链接。

由于每页有 20 个结果,因此计算出有多少页很简单,请考虑:

import requests, re, math, pprint

def scrape_results(root):
    links = []
    r = requests.get(root)

    mat = re.search(r'We have (\d+) apartments for rent', r.text)
    num_results = int(mat.group(1))                     # 182 at the moment
    num_pages = int(math.ceil(num_results/20.0))        # ceil(182/20) => 10

    # Construct links for pages 1-10
    for i in range(num_pages):
        links.append("%s?page=%d" % (root, (i+1)))

    return links

pprint.pprint(scrape_results(root))

这将是 3 种方法中最快的,但可能更容易出错。

编辑 2:

也许是这样的:

import re, math, pprint
import requests, urlparse
from bs4 import BeautifulSoup, SoupStrainer

def get_pages(root):
    links = []
    r = requests.get(root)

    mat = re.search(r'We have (\d+) apartments for rent', r.text)
    num_results = int(mat.group(1))                     # 182 at the moment
    num_pages = int(math.ceil(num_results/20.0))        # ceil(182/20) => 10

    # Construct links for pages 1-10
    for i in range(num_pages):
        links.append("%s?page=%d" % (root, (i+1)))

    return links

def get_listings(page):
    links = []
    r = requests.get(page)

    for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
        if link.has_attr('href') and link.has_attr('data-listingid') and 'name' in link.get('class'):
            links.append(urlparse.urljoin(root, link.get('href')))

    return links

root='http://www.apartmentguide.com/apartments/Alabama/Hartselle/'
listings = []
for page in get_pages(root):
    listings += get_listings(page)

pprint.pprint(listings)
print(len(listings))

关于python - 从python请求中的txt文件中的链接获取所有分页URL的列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28893762/

相关文章:

python - 如何在Python中生成在两个边界之间形成正态分布的样本数据?

python - python 中大量数字的奇怪行为

Java:更改预填充的字符串数组列表中的元素

java - 查找整数数组中的最大数字

JQuery $(document).ready ajax 加载

python - Anaconda 提示立即关闭——系统无法找到指定的注册表项或值

python - 如何在我的 github 自述文件中设置 pyversions 图标?

javascript - 如何在单个 Django View 中进行两个或多个 AJAX 调用

python - 比较嵌套字典和列表中的值

c++ - boost::bind 无法绑定(bind)到纯虚基类中定义的非静态函数模板成员类型