python - 使用Python + BeautifulSoup串联提取标签，创建列表列表

我对 python/BeautifulSoup 有点陌生，想知道我是否可以就如何完成以下任务获得一些指导。

我有一个网页的 html，其结构如下:

1) 包含在包含所有图像名称(Name1、Name2、Name3)的标签中的代码块。

2) 包含在具有图像 url 的标记中的代码块。

3) 网页上出现的日期。我将它放入“日期”变量中(这已经被提取)

我试图从代码中提取包含 [['image1','url1', 'date'], ['image2','url2','date']] 的列表列表稍后我会将其转换为字典(通过 dict(zip(labels, values)) 函数)，并插入到 mysql 表中。

我所能想到的就是如何提取两个包含所有图像和所有 url 的列表。关于如何完成我想做的事情有什么想法吗？

需要牢记的几点:

1) 图片的数量和名称总是变化的 (1:1)

2) 日期总是出现一次。

附言另外，如果有更优雅的方法通过 bs4 提取数据，请告诉我!

from bs4 import BeautifulSoup
name = []
url = []
date = '2017-10-12'

text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>'
soup = BeautifulSoup(text, 'lxml')
#print soup.prettify()
#get names
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}):
    for imglinks in imgz.find_all('img', src = True): 
        #print imgz
        url.append((imglinks['src']).encode("utf-8"))
#3 get ad URLS
for ultag in soup.find_all('ul'):
    for litag in ultag.find_all('li'): 
        name.append((litag.text).encode("utf-8")) #dump all urls into a list
print url
print name

最佳答案

这是拉取 url 和名称的另一种可能途径:

url = [tag.get('src') for tag in soup.find_all('img')]
name = [tag.text.strip() for tag in soup.find_all('li')]

print(url)
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg']

print(name)
# ['NAME1', 'NAME2', 'NAME3']

至于最终的列表创建，这里有一些在功能上类似于@t.m.adam 所建议的东西:

print([pair + [date] for pair in list(map(list, zip(url, name)))])
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'],
#  ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'],
#  ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

请注意，map 现在很少使用，在 some places 中完全不鼓励使用它。 .

或者:

n = len(url)
print(list(map(list, zip(url, name, [date] * n))))
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

关于python - 使用Python + BeautifulSoup串联提取标签，创建列表列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46709709/

python - 使用Python + BeautifulSoup串联提取标签，创建列表列表

上一篇：python - 如何用 Python 搜索相似的列？

下一篇：python - 一个 url 是通过 selenium IDE 正确启动的，但不是通过脚本