这是我第一次使用 Python 和网络抓取。一直在四处寻找,仍然无法得到我需要做的事情。
以下是我通过 Chrome 使用的元素的打印屏幕。
我想做的是,我试图从所选城市名称中获取公寓名称和地址。
import requests
from bs4 import BeautifulSoup
#url = 'http://www.homestead.ca/apartments-for-rent/'
rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.select(".primary .child-pages a")
#city_names=[dropdown_list_value.text for dropdown_list_value in dropdown_list]
#print (city_names)
cityLinks=[rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]
for cityLinks_select in dropdown_list: #Looping each city from the Apartment drop down list
print ('Selecting city:',cityLinks_select.text)
cityResponse = requests.get(cityLinks)
cityHtml = cityResponse.content
citySoup = BeautifulSoup(cityHtml,'lxml')
community_list = soup.select(".extended-search .property-container a[h2 h3]")
get and print the apartment link
get and print the apartment name
get and print the address of the apartment
最佳答案
正如我所说,一些数据是动态创建的,如果我们查看源本身,我们会看到:
<div class="content">
<div class="title-container">
<h2 class="building-name"><%= building.get('name') %></h2>
<h3 class="address"><%= building.get('address').address %></h3>
</div>
<div class="rent">
<h4 class="sub-title">Rent from</h4>
<% if (building.get('statistics').suites.rates.min !== 'undefined') { %>
<% $min_rate = commaSeparateNumber(parseInt(building.get('statistics').suites.rates.min)); %>
<span class="rent-value">$<%= $min_rate %></span>
<% } %>
</div>
我们只能从源头得到建筑物名称、地址和电话号码:
cityLinks = [rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]
# you need to iterate over the joined urls
for city in cityLinks: # Looping each city from the Apartment drop down list
cityResponse = requests.get(city)
cityHtml = cityResponse.content
citySoup = BeautifulSoup(cityHtml, 'lxml')
# all the info we can parse is inside the div class="building-info"
for div in citySoup.select("div.building-info"):
print(div.select_one("h1.building-name").text.strip())
print(div.select_one("h2.location").text.strip())
print(div.select_one("div.contact-container div.phone").text.strip())
如果我们模拟一个ajax请求,我们可以得到json格式的所有数据:
import requests
from bs4 import BeautifulSoup
from pprint import pprint as pp
rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)
html = response.content
soup = BeautifulSoup(html, 'lxml')
dropdown_list = soup.select(".primary .child-pages a")
cityLinks = (rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list)
# params for our request
params = {"show_promotions": "true",
"show_custom_fields": "true",
"client_id": "6",
"auth_token": "sswpREkUtyeYjeoahA2i",
"min_bed": "-1",
"max_bed": "100",
"min_bath": "0",
"max_bath": "10",
"min_rate": "0",
"max_rate": "4000",
"keyword": "false",
"property_types": "low-rise-apartment,mid-rise-apartment,high-rise-apartment,luxury-apartment,townhouse,house,multi-unit-house,single-family-home,duplex,tripex,semi",
"order": "max_rate ASC, min_rate ASC, min_bed ASC, max_bath ASC",
"limit": "50",
"offset": "0",
"count": "false"}
for city in cityLinks: # Looping each city from the Apartment drop down list
with requests.Session() as s:
r= s.get(city)
# we need to parse the city_id for out next request to work
soup = BeautifulSoup(r.content)
city_id = soup.select_one("div.hidden.search-data")["data-city-id"]
# update params with the city id
params["city_id"] = city_id
js = s.get("http://api.theliftsystem.com/v2/search", params=params).json()
pp(js)
现在我们得到如下数据:
[{u'address': {u'address': u'325 North Park Street',
u'city': u'Brantford',
u'city_id': 332,
u'country': u'Canada',
u'country_code': u'CAN',
u'intersection': u'',
u'neighbourhood': u'',
u'postal_code': u'N3R 2X4',
u'province': u'Ontario',
u'province_code': u'ON'},
u'availability_count': 6,
u'availability_status': 1,
u'availability_status_label': u'Available Now',
u'building_header': u'',
u'client': {u'email': u'bcadieux@homestead.ca',
u'id': 6,
u'name': u'Homestead Land Holdings',
u'phone': u'613-546-3146',
u'website': u'www.homestead.ca'},
u'contact': {u'alt_extension': u'',
u'alt_phone': u'',
u'email': u'rentals@homestead.ca',
u'extension': u'',
u'fax': u'(519) 752-6855',
u'name': u'',
u'phone': u'519-752-3596'},
u'details': {u'features': u'',
u'location': u'',
u'overview': u"Located on North Park Street and Memorial Avenue,this quiet building is within walking distance of the following: - Zehrs Plaza, North Park Plaza, Shoppers Drug Mart, Zehrs Grocery Store, Zellers, Pet Store, Party Supply Store, furniture store, variety store, Black's Photography, paint shop and veterinary clinic\xa0 - Restaurants and coffee shops\xa0 - Wayne Gretzky Recreational Arena\xa0 - Medical Clinic,Shoppers Home Health Care Clinic and Pharmacy\xa0 - Catholic Elementary School\xa0 - On bus route ",
u'suite': u''},
u'geocode': {u'distance': None,
u'latitude': u'43.1703624',
u'longitude': u'-80.2605725'},
u'id': 309,
u'matched_beds': [u'0', u'1', u'2'],
u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
u'min_availability_date': u'',
u'name': u'North Park Tower',
u'office_hours': u'',
u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
u'permalink': u'http://www.homestead.ca/apartments/325-north-park-street-brantford',
u'pet_friendly': True,
u'photo': u'1443018148_2.jpg',
u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443018148_2.jpg',
u'promotion': {u'featured': 0},
u'property_type': u'High-rise-apartment',
u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
u'max': 1.0,
u'min': 1.0},
u'bedrooms': {u'average': u'1.0',
u'max': 2,
u'min': 0},
u'rates': {u'average': 950.0,
u'max': 1275.0,
u'min': 625.0},
u'square_feet': {u'average': 0.0,
u'max': u'0.0',
u'min': u'0.0'}}},
u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443018148_2.jpg',
u'website': {u'description': u'', u'title': u'', u'url': u''}},
{u'address': {u'address': u'661 West Street',
u'city': u'Brantford',
u'city_id': 332,
u'country': u'Canada',
u'country_code': u'CAN',
u'intersection': u'',
u'neighbourhood': u'',
u'postal_code': u'N3R 6W9',
u'province': u'Ontario',
u'province_code': u'ON'},
u'availability_count': 6,
u'availability_status': 1,
u'availability_status_label': u'Available Now',
u'building_header': u'',
u'client': {u'email': u'bcadieux@homestead.ca',
u'id': 6,
u'name': u'Homestead Land Holdings',
u'phone': u'613-546-3146',
u'website': u'www.homestead.ca'},
u'contact': {u'alt_extension': u'',
u'alt_phone': u'',
u'email': u'rentals@homestead.ca',
u'extension': u'',
u'fax': u'(519) 751-0379',
u'name': u'',
u'phone': u'519-751-3867'},
u'details': {u'features': u'',
u'location': u'',
u'overview': u'Located in the North end of Brantford, Westgate Tower is in an area that resembles a city within a city. There are a variety of banks, grocery stores, drug stores, malls, a wide selection of fast food, fine dining restaurants and an after hours medical centre, within waking distance.',
u'suite': u''},
u'geocode': {u'distance': None,
u'latitude': u'43.1733242',
u'longitude': u'-80.2482991'},
u'id': 310,
u'matched_beds': [u'0', u'1', u'2'],
u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
u'min_availability_date': u'',
u'name': u'Westgate Apartments',
u'office_hours': u'',
u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
u'permalink': u'http://www.homestead.ca/apartments/661-west-street-brantford',
u'pet_friendly': True,
u'photo': u'1443017488_1.jpg',
u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017488_1.jpg',
u'promotion': {u'featured': 0},
u'property_type': u'High-rise-apartment',
u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
u'max': 1.0,
u'min': 1.0},
u'bedrooms': {u'average': u'1.0',
u'max': 2,
u'min': 0},
u'rates': {u'average': 975.0,
u'max': 1300.0,
u'min': 650.0},
u'square_feet': {u'average': 0.0,
u'max': u'0.0',
u'min': u'0.0'}}},
u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017488_1.jpg',
u'website': {u'description': u'', u'title': u'', u'url': u''}},
{u'address': {u'address': u'321 Fairview Drive',
u'city': u'Brantford',
u'city_id': 332,
u'country': u'Canada',
u'country_code': u'CAN',
u'intersection': u'',
u'neighbourhood': u'',
u'postal_code': u'N3R 2X6',
u'province': u'Ontario',
u'province_code': u'ON'},
u'availability_count': 8,
u'availability_status': 1,
u'availability_status_label': u'Available Now',
u'building_header': u'',
u'client': {u'email': u'bcadieux@homestead.ca',
u'id': 6,
u'name': u'Homestead Land Holdings',
u'phone': u'613-546-3146',
u'website': u'www.homestead.ca'},
u'contact': {u'alt_extension': u'',
u'alt_phone': u'',
u'email': u'rentals@homestead.ca',
u'extension': u'',
u'fax': u'(519) 752-6855',
u'name': u'',
u'phone': u'519-752-3596'},
u'details': {u'features': u'',
u'location': u'',
u'overview': u'Dornia Manor is a quiet, ninety-two unit apartment building located in the North end of Brantford. We offer one, two and three bedroom units and one penthouse suite. The building is located in close proximity to many major services such as banking, shopping, health services, recreational facilities, beauty shops, dry cleaners, schools and churches. There is a bus stop at the front door and highway 403 is within minutes.',
u'suite': u''},
u'geocode': {u'distance': None,
u'latitude': u'43.1706331',
u'longitude': u'-80.2584034'},
u'id': 308,
u'matched_beds': [u'1', u'2', u'3'],
u'matched_suite_names': [u'One Bedroom', u'Two Bedroom', u'Three Bedroom'],
u'min_availability_date': u'',
u'name': u'Dornia Manor',
u'office_hours': u'',
u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
u'permalink': u'http://www.homestead.ca/apartments/321-fairview-drive-brantford',
u'pet_friendly': True,
u'photo': u'1443017947_1.jpg',
u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017947_1.jpg',
u'promotion': {u'featured': 0},
u'property_type': u'High-rise-apartment',
u'statistics': {u'suites': {u'bathrooms': {u'average': 1.375,
u'max': 2.0,
u'min': 1.0},
u'bedrooms': {u'average': u'2.25',
u'max': 3,
u'min': 1},
u'rates': {u'average': 1124.5,
u'max': 1350.0,
u'min': 899.0},
u'square_feet': {u'average': 0.0,
u'max': u'0.0',
u'min': u'0.0'}}},
u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017947_1.jpg',
u'website': {u'description': u'', u'title': u'', u'url': u''}}]
这会为您提供网址、卧室和几乎所有您可能想要的东西。列表中的每个字典都是一个列表,您只需要使用键访问即可拉取您想要的数据,例如:
for dct in js:
add = dct["address"]
print(add["city"])
print(add["postal_code"])
print(add["province"])
print(dct["permalink"])
会给你:
Brantford
N3R 2X4
Ontario
http://www.homestead.ca/apartments/325-north-park-street-brantford
Brantford
N3R 6W9
Ontario
http://www.homestead.ca/apartments/661-west-street-brantford
Brantford
N3R 2X6
Ontario
http://www.homestead.ca/apartments/321-fairview-drive-brantford
联系信息在 dct["contact"]
下,统计在 = dct["statistics"]
下:
for dct in js:
contact = dct["contact"]
print(contact)
stats = dct["statistics"]
print(stats["suites"])
这会给你:
{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1275.0, u'average': 950.0, u'min': 625.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 751-0379', u'name': u'', u'alt_extension': u'', u'phone': u'519-751-3867', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1300.0, u'average': 975.0, u'min': 650.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1350.0, u'average': 1124.5, u'min': 899.0}, u'bedrooms': {u'max': 3, u'average': u'2.25', u'min': 1}, u'bathrooms': {u'max': 2.0, u'average': 1.375, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
您可以将所有这些放在一起以获得您需要的任何东西。你可以调整参数,如果你在 chrome 工具或 Firebug 中检查请求,实际上还有更多。
关于python - 使用 Python 进行网页抓取 - 选择 div、h2 和 h3 类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37994270/