python - 使用 BeautifulSoup 抓取 IMDb 页面

标签 python html web-scraping beautifulsoup html-parsing

我是 WebScraping/Python 和 BeautifulSoup 的新手,很难让我的代码正常工作。

我想抓取网址:http://m.imdb.com/feature/bornondate "得到:

  • 名人姓名
  • 名人形象
  • 职业
  • 最佳作品

该页面上的十位名人。我不确定我做错了什么。

这是我的代码:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有打印出来。 另外,如果这太模糊了,你能解释一下如何只做名人的名字吗?

如果有帮助,这里是第一个名人的 html 代码:

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>

最佳答案

首先,IMDb 明确禁止屏幕抓取 "Conditions of Use" :

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

尝试探索 IMDb JSON API 而不是网络抓取方法。


您当前的问题是 - 特定日期出生的人列表是通过单独调用 IMDb API 并使用 javascript 逻辑加载的/em> 涉及。

现在最简单的选择是切换到 selenium浏览器自动化工具。使用 headless PhantomJS 浏览器的工作示例:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

打印:

http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

关于python - 使用 BeautifulSoup 抓取 IMDb 页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28912004/

相关文章:

python - 使用Python登录https网站

python - json.dumps 和 str() 有什么区别?

python - 类型错误 : Object of type TextIOWrapper is not JSON serializable

html - Chrome 系统打印对话框文本白色突出显示

javascript - 图像未填满整个区域

python - 网页抓取和 python : Rendering javascript in html?

python - matplotlib 中 3D 线框可视化的问题

python - PySpark 将字典的字符串化数组分解成行

javascript - AngularJS 无法与 JQuery 显示 html 一起使用

python scrapy - 输出csv文件为空