python - 如何抓取网络新闻并将段落合并到每篇文章中

标签 python web-scraping beautifulsoup request web-crawler

我正在从该网站抓取新文章 https://nypost.com/search/China+COVID-19/page/2/?orderby=relevance 我使用 for 循环来获取每篇新闻文章的内容,但我无法组合每篇文章的段落。我的目标是将每篇文章存储在一个字符串中,并且所有字符串应存储在 myarticle 列表中。

当我打印(myarticle[0])时,它会为我提供所有文章。我希望它应该给我一篇文章。

如有任何帮助,我们将不胜感激!

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
                articletext = containerr.find_all('p')
                for paragraph in articletext:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)
                    
                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
            # join paragraphs to re-create the article 
            myarticle = [''.join(article) for article in thearticle]
    
    print(myarticle[0])

为了清楚起见,完整的代码附在下面

def scrape(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    urls = [f"{url}{x}" for x in range(1,2)]
    params = {
       "orderby": "relevance",
    }
    pagelinks = []
    title = []
    thearticle = []
    paragraphtext = []
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params) 
        # controlling the crawl-rate
        start_time = time() 
        #pause the loop
        sleep(randint(8,15))
        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break


        #parse the content
        soup_page = bs(response.text, 'lxml') 
        #select all the articles for a single page
        containers = soup_page.findAll("li", {'class': 'article'})
        
        #scrape the links of the articles
        for i in containers:
            url = i.find('a')
            pagelinks.append(url.get('href'))
        #scrape the titles of the articles
        for i in containers:
            atitle = i.find(class_ = 'entry-heading').find('a')
            thetitle = atitle.get_text()
            title.append(thetitle)
            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
                articletext = containerr.find_all('p')
                for paragraph in articletext:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)
                    
                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
            # join paragraphs to re-create the article 
            myarticle = [''.join(article) for article in thearticle]
    
    print(myarticle[0])
print(scrape('https://nypost.com/search/China+COVID-19/page/'))

最佳答案

您不断追加到现有列表 [],它不断增长,您需要在每个循环中清除它。

    articletext = containerr.find_all('p')
    for paragraph in articletext:
        #get the text only
        text = paragraph.get_text()
        paragraphtext.append(text)

    #combine all paragraphs into an article
    thearticle.append(paragraphtext)
# join paragraphs to re-create the article 
myarticle = [''.join(article) for article in thearticle]

应该是这样的

    articletext = containerr.find_all('p')
    thearticle = [] # clear from the previous loop
    paragraphtext = [] # clear from the previous loop
    for paragraph in articletext:
        #get the text only
        text = paragraph.get_text()
        paragraphtext.append(text)

    thearticle.append(paragraphtext)
    myarticle.append(thearticle)

但是你可以将其进一步简化为:

article = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
myarticle.append(article.get_text())

关于python - 如何抓取网络新闻并将段落合并到每篇文章中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61888917/

相关文章:

python - 用于文本提取的文档布局分析

python - for 循环中的变量未显示在另一个循环中

r - 使用 XML R 包用图像抓取 html 表

python - BeautifulSoup - lxml 和 html5lib 解析器抓取差异

python - 如何迭代 xml 文件,检查 lxml 属性是否存在并将其及其值连接到另一个变量?

python ctypes uint8缓冲区使用

python - Django:如何将上下文提供给 FormView get() 方法(也使用请求参数)

php - 使用DOMDocument和DOMXPath类从网站中检索类别

python - 警告 : Some characters could not be decoded, 并被替换字符替换

python - 抓取时无法检索中文文本