python - 优化我的 Python Scraper

标签 python web-scraping beautifulsoup

这是一个冗长的问题,我可能只需要有人给我指出正确的方向。我正在构建一个网络抓取工具,以从 ESPN 的网站上获取篮球运动员信息。 URL 结构非常简单,每个玩家卡在 URL 中都有一个特定的 id。为了获取信息,我编写了一个从 1-~6000 的循环来从数据库中获取玩家。我的问题是是否有更有效的方法来做到这一点?

from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests 
import nltk
import re




age = [] # Empty List to store player ages

BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL
def get_age(BASE): #Creates a function
    #z = range(1,6000) # Create Range from 1 to 6000
    for i in range(1, 6000): # This is a for loop
        BASE_U = BASE + str(i) + '/' # Create URL For Player   
        r = requests.get(BASE_U)
        soup = BeautifulSoup(r.text)
        #Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information 
        # Get Age of Players        
        age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag
        p = str(age_tables) # Turns text into a string
    #At this point I had to look at all the text in the p object and determine a way to capture the age info
        if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error
        continue
        else:
            start = p.index("Age: ") + len("Age: ") # Gets the location of the players age 
            end = p[start:].index(")") + start  
            player_id.append(i) #Adds player_id to player_id list
            age.append(p[start:end]) # Adds player's age to age list

get_age(BASE)

任何帮助,即使是很小的帮助,我们都将不胜感激。即使它只是为我指明了正确的方向,而不一定是直接的解决方案

谢谢, 本

最佳答案

这就像网络安全中的端口扫描器,多线程会让你的编程速度大大加快。

关于python - 优化我的 Python Scraper,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30960302/

相关文章:

python - 为单个函数声明异常类型是否合理?

python - 如何从 3D 多边形制作高程模型?

python - Moses v1.0 多语言ini文件

python - 解析时无法收到电子邮件

python - 美丽汤中的错误 : You don't have permission to access "url" on this server.

python - Numpy 向量化相对距离

javascript - 服务器端网络抓取/导航解决方案(支持 JavaScript)

Python - 使用 BeautifulSoup 从 URL 列表中抓取文本的最简单方法

python - 未包装的标签仍然存在

python - 使用BeautifulSoup根据文本内容删除元素