python - 美丽汤webscrape进入mysql

标签 python mysql

到目前为止的代码下载并打印到屏幕上,但是我如何将打印的 Material 放入sql数据库中。如果我想将数据放入CSV文件中,似乎Python(在美好的一天)创建了该文件自动。显然,在传输到 mySql 时,我假设我必须事先创建一个数据库才能接收数据。我的问题是如何将数据从抓取中获取到数据库中,完全省略 csv 步骤。 预计我已经下载了 pyMySql 库。任何建议都值得赞赏..looknow

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.officialcharts.com/charts/singles-      chart/19800203/7501/" )

 bsObj = BeautifulSoup(html)
 nameList = bsObj. findAll("div" , {"class" : "artist",})
 for name in nameList:
 print(name. get_text())

 html = urlopen("http://www.officialcharts.com/charts/singles-    chart/19800203/7501/" )
 bsObj = BeautifulSoup(html)
 nameList = bsObj. findAll("div" , {"class" : "title"})
 for name in nameList:
 print(name. get_text())     

最佳答案

这里有几件事需要解决。

docs on PyMySQL他们非常擅长让您启动并运行。

在将这些内容放入数据库之前,您需要以艺术家和歌曲名称相互关联的方式获取它们。现在,您将获得一个单独的艺术家和歌曲列表,无法将它们关联起来。您将需要迭代 title-artist 类来执行此操作。

我会这样做 -

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors

# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

# Grab title-artist classes and iterate
bsObj = BeautifulSoup(html)
recordList = bsObj.findAll("div", {"class" : "title-artist",})

# Now iterate over recordList to grab title and artist
for record in recordList:
     title = record.find("div", {"class": "title",}).get_text().strip()
     artist = record.find("div", {"class": "artist"}).get_text().strip()
     print artist + ': ' + title

这将为 recordList 循环的每次迭代打印标题和艺术家。

为了将这些值插入 MySQL 数据库,我创建了一个名为 artist_song 的表,其中包含以下内容:

CREATE TABLE `artist_song` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `artist` varchar(255) COLLATE utf8_bin NOT NULL,
  `song` varchar(255) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
  AUTO_INCREMENT=1;

这不是最干净的方法,但这个想法是合理的。我们想要打开与 MySQL 数据库的连接(我将数据库称为 top_40),并为 recordList 循环的每次迭代插入一个艺术家/标题对:

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors


# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

# Grab title-artist classes and store in recordList
bsObj = BeautifulSoup(html)
recordList = bsObj.findAll("div", {"class" : "title-artist",})

# Create a pymysql cursor and iterate over each title-artist record.
# This will create an INSERT statement for each artist/pair, then commit
# the transaction after reaching the end of the list. pymysql does not
# have autocommit enabled by default. After committing it will close
# the database connection.
# Create database connection

connection = pymysql.connect(host='localhost',
                             user='root',
                             password='password',
                             db='top_40',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)

try:
    with connection.cursor() as cursor:
        for record in recordList:
            title = record.find("div", {"class": "title",}).get_text().strip()
            artist = record.find("div", {"class": "artist"}).get_text().strip()
            sql = "INSERT INTO `artist_song` (`artist`, `song`) VALUES (%s, %s)"
            cursor.execute(sql, (artist, title))
    connection.commit()
finally:
    connection.close()

编辑: 根据我的评论,我认为迭代表行会更清楚:

from urllib import urlopen
from bs4 import BeautifulSoup
import pymysql.cursors


# Webpage connection
html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")

bsObj = BeautifulSoup(html)

rows = bsObj.findAll('tr')
for row in rows:
    if row.find('span', {'class' : 'position'}):
        position = row.find('span', {'class' : 'position'}).get_text().strip()
        artist = row.find('div', {'class' : 'artist'}).get_text().strip()
        track = row.find('div', {'class' : 'title'}).get_text().strip()

关于python - 美丽汤webscrape进入mysql,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32036119/

相关文章:

python - 如何在新数据上使用 sklearn TfidfVectorizer

python - Django新手错误: - TypeError: view must be a callable or a list/tuple in the case of include()

php - 在 Yii 的 Controller 中创建构造方法

mysql - 在mysql中显示交替行

python - 在 python 中使用 compile 和 eval 时出现奇怪的问题

python - 如何获取由开始和结束日期时间数组定义的 pandas 数据框范围的平均值?

mysql - 选择包含空参数的不同内连接

php - 如何使用 PHP 从短语中搜索随机单词?

mysql - 搜索某个日期是否在事件中

python - 使用 Tornado 和 Pika 进行异步队列监控