Python 如何抓取图像、文本和音频文件 url 的链接

标签 python python-3.x xpath web-scraping

我正在尝试从以下网址 ( http://www.ancient-hebrew.org/m/dictionary/1000.html ) 中抓取数据。

因此,每个希伯来语单词部分都以 img url 开头,后跟 2 个文本,即实际的希伯来语单词及其发音。例如,url 中的第一个条目如下“img1 img2 img3 אֶלֶף e-leph ”,使用 wget 下载 html 后,希伯来语单词是 unicode

我正在尝试按顺序收集这些信息,以便我首先获取图像文件,然后获取希伯来语单词,然后获取发音。最后我想找到音频文件的 URL。

此外,每个单词的每一行似乎都以 < A 标签开头。

我是网络抓取新手,因此以下是我所能做的。

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = '1000.html'

try:
    page = urlopen(url)
except:
    print("Error opening the URL")

soup = BeautifulSoup(page, 'html.parser')

content = soup.find('<!--501-1000-->', {"<A Name= "})

images = ''
for i in content.findAll('*.jpg'):
    images = images + ' ' +  i.text

with open('scraped_text.txt', 'w') as file:
    file.write(images)


如您所见,我的代码并没有真正完成这项工作。最后,我想获取 URL 中每个单词的信息,并将其保存为文本文件或 json 文件,以更简单的为准。

例如, 图片:URLsOfImages,希伯来语单词:txt,发音:txt,URLtoAudio:txt

以及下一个单词等等。

最佳答案

我写了一个脚本,应该可以帮助你。它包含您请求的所有信息。由于希伯来字母,它不能保存为 json 文件,否则它会存储为字节。我知道您不久前发布了这个问题,但我今天发现了它并决定尝试一下。无论如何,这就是:

import requests
from bs4 import BeautifulSoup
import re
import json


url = 'http://www.ancient-hebrew.org/m/dictionary/1000.html'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

def images():
    #Gathers all the images (this includes unwanted gifs)
    imgs = soup.find_all('img')

    #Gets the src attribute to form the full url
    srcs = [img['src'] for img in imgs]
    base_url = 'https://www.ancient-hebrew.org/files/'

    imgs = {}
    section = 0
    #Goes through each source of all the images
    for src in srcs:
        #Checks if it is a gif, these act as a separator
        if src.endswith('.gif'):
            #If it is a gif, change sections (acts as separator)
            section += 1
        else:
            #If it is a letter image, use regex to extract the part of src we want and form full url
            actual_link = re.search(r'files/(.+\.jpg)', src)
            imgs.setdefault(section, []).append(base_url + actual_link.group(1))
    return imgs

def hebrew_letters():
    #Gets hebrew letters, strips whitespace, reverses letter order since hebrew letters get messed up
    h_letters = [h_letter.text.strip() for h_letter in soup.find_all('font', attrs={'face': 'arial'})]
    return h_letters

def english_letters():
    #Gets english letters by regex, this part was difficult because these letters are not surrounded by tags in the html
    letters = ''.join(str(content) for content in soup.find('table', attrs={'width': '90%'}).td.contents)
    search_text = re.finditer(r'/font>\s+(.+?)\s+<br/>', letters)
    e_letters = [letter.group(1) for letter in search_text]
    return e_letters

def get_audio_urls():
    #Gets all the mp3 hrefs for the audio part
    base_url = 'https://www.ancient-hebrew.org/m/dictionary/'
    links = soup.find_all('a', href=re.compile(r'\d+\s*.mp3$'))
    audio_urls = [base_url+link['href'].replace('\t','') for link in links]
    return audio_urls

def main():
    #Gathers scraped data
    imgs = images()
    h_letters = hebrew_letters()
    e_letters = english_letters()
    audio_urls = get_audio_urls()

    #Encodes data into utf-8 (due to hebrew letters) and saves it to text file
    with open('scraped_hebrew.txt', 'w', encoding='utf-8') as text_file:
        for img, h_letter, e_letter, audio_url in zip(imgs.values(), h_letters, e_letters, audio_urls):
            text_file.write('Image Urls: ' + ' - '.join(im for im in img) + '\n')
            text_file.write('Hebrew Letters: ' + h_letter + '\n')
            text_file.write('English Letters: ' + e_letter + '\n')
            text_file.write('Audio Urls: ' + audio_url + '\n\n')


if __name__ == '__main__':
    main()

关于Python 如何抓取图像、文本和音频文件 url 的链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56531176/

相关文章:

python - 使用长度和字母顺序对列表进行排序

python - CSV writerows 在一些行之后

xpath - XPATH删除元素串联中的多余空格

python - 将字符串转换为二元组列表

python - 使用 Selenium 和 Python 单击 Javascript 选项卡,无需唯一的类 ID 或元素名称

java - 如何定位如图所示的元素

python - Grep 可靠地所有 C#defines

python - PyCharm 从项目文件夹中的所有文件返回错误

python - 提交后如何将字段值保留在表单中?

python-3.x - 改变先知情节的特点