python - 将 html 写入 csv 时出现奇怪的格式

标签 python html csv beautifulsoup


我有一个 python 脚本,可以从 html 文档中的标签中查找文本,我需要将它们写入以下格式的 csv 文件中:




from bs4 import BeautifulSoup
import re
import csv

with open('xx01_med_dansk') as fp:
    soup = BeautifulSoup(fp, 'html5lib')
#    print(soup.prettify())

    with open('dk_snip.csv', 'w') as f:
        wr = csv.writer(f)

        var1 = soup.find('li', text = re.compile('Scan vendor:'), attrs = {'class' : 'property_name'})
        var2 = soup.find('li', text = re.compile('Vendor ID:'), attrs = {'class' : 'property_name'})

        vendor = var1.find_next('li')
        final = vendor.string

        vend_id = var2.find_next('li')
        final2 = vend_id.text

        for dk_desc in soup.find_all(re.compile("textarea")):
            final3 = dk_desc.text

        to_csv = final+final2+final3

我不太确定如何格式化数据以在 csv 文件中正确输入..

HTML 文件:

<!DOCTYPE html>
<html lang="en">
      <li class="property_name">
       <label for="id_194-description">
      <li class="property_value">
       <textarea class="mceNoEditor" cols="40" id="id_194-description" name="194-description" rows="10" style="width:100%">According to its version, the installation of Oracle Database on the remote host is no longer supported.

Lack of support implies that no new security patches for the product will be released by the vendor. As a result, it is likely to contain security vulnerabilities.</textarea>
      <li class="property_name">
       <label for="id_194-consequence">
      <li class="property_value">
       <textarea class="mceNoEditor" cols="40" id="id_194-consequence" name="194-consequence" rows="10" style="width:100%">The remote host is running an unsupported version of a database server.</textarea>
      <li class="property_name">
       <label for="id_194-solution">
      <li class="property_value">
       <textarea class="mceNoEditor" cols="40" id="id_194-solution" name="194-solution" rows="10" style="width:100%">Upgrade to a version of Oracle Database that is currently supported.</textarea>
      <li class="property_name">
       <label for="id_194-cve_id">
        Cve id:
      <li class="property_value">
       <textarea class="mceNoEditor" cols="40" id="id_194-cve_id" maxlength="8192" name="194-cve_id" rows="10" style="width:100%; height:80px"></textarea>
      <input id="id_194-override" name="194-override" type="hidden" value="11953"/>
      <input id="id_194-priority" name="194-priority" type="hidden"/>
      <li class="property_name">
       Vulnerability priority
      <li class="property_value">
       <select name="prio_194">
        <option selected="selected" value="0">
       : Oracle Database Unsupported (Nessus)
      <li class="property_name">
      <li class="property_value">
       <input type="submit" value="Save vulnerability changes"/>
    <br style="clear:both"/>
   <div class="box">
     Related vulnerabilities
     Oracle Database Unsupported (Nessus)
     <li class="property_name">
     <li class="property_value">
      According to its version, the installation of Oracle Database on the remote host is no longer supported.
      Lack of support implies that no new security patches for the product will be released by the vendor. As a result, it is likely to contain security vulnerabilities.
     <li class="property_name">
     <li class="property_value">
      The remote host is running an unsupported version of a database server.
     <li class="property_name">
     <li class="property_value">
      Upgrade to a version of Oracle Database that is currently supported.
    <br style="clear:both"/>
   <div class="box">
     Create new snippet
    <form action="/report/vulnerabilityEditor/?
								model=snippet" method="POST">
      <li class="property_name">
       <label for="id_language">
      <li class="property_value">
       <select id="id_language" name="language" style="width:100%">
        <option selected="" value="1">
         Danish (DK)
        <option value="2">
         English (EN)
        <option value="3">
         Icelandic (IS)
      <input id="id_vulnerability" name="vulnerability" type="hidden" value="194"/>
      <li class="property_name">
       <label for="id_title">
      <li class="property_value">
       <input id="id_title" maxlength="100" name="title" style="width:100%" type="text"/>
      <li class="property_name">
       <label for="id_recommendation">
      <li class="property_value">
       <input id="id_recommendation" maxlength="255" name="recommendation" style="width:100%" type="text"/>
      <li class="property_name">
       <label for="id_snippet">
      <li class="property_value">
       <textarea cols="40" id="id_snippet" name="snippet" rows="10" style="width:100%"></textarea>
      <li class="property_name">
       Scan type
      <li class="property_value">
       <select multiple="multiple" name="scan_type" size="6" style="width:100%">
        <option selected="selected" value="5">
         COMPANY PCI
        <option selected="selected" value="7">
        <option selected="selected" value="8">
         Firewall Audit
        <option selected="selected" value="6">
         Penetration Test
        <option selected="selected" value="9">
         WIFI Test
        <option selected="selected" value="10">
         APP Test
        <option selected="selected" value="1">
         External Security Analysis
        <option selected="selected" value="2">
         Internal Security Analysis
        <option selected="selected" value="3">
         Web Application Test
        <option selected="selected" value="4">
         Host Discovery Analysis
       -- Use ctrl to mark multiple types
      <li class="property_name">
      <li class="property_value">
       <input type="submit" value="Save new snippet"/>
     <br style="clear:both;"/>
   <div class="box">
     Edit snippets
    <input id="property_vulnerability_id" type="hidden" value="194"/>
    <input id="property_url_filter_snippets" type="hidden" value="/report/filterSnippets/"/>
     <li class="property_name">
     <li class="property_value">
      <select id="language" name="language">
       <option value="0">
       <option value="1">
       <option value="2">
       <option value="3">
     <li class="property_name">
      Scan Type
     <li class="property_value">
      <select id="scantype" name="scantype">
       <option value="0">
       <option value="5">
       <option value="7">
       <option value="8">
        Firewall Audit
       <option value="6">
        Penetration Test
       <option value="9">
        WIFI Test
       <option value="10">
        APP Test
       <option value="1">
        External Security Analysis
       <option value="2">
        Internal Security Analysis
       <option value="3">
        Web Application Test
       <option value="4">
        Host Discovery Analysis
    <br style="clear:both;"/>
    <div class="snippet">
     <form action="/report/vulnerabilityEditor/?action=edit&amp;id=194&amp;sid=1290&amp;model=snippet" method="POST">
      <input id="id_1290-vulnerability" name="1290-vulnerability" type="hidden" value="194"/>
       <li class="property_name">
        <label for="id_1290-language">
       <li class="property_value">
        <select id="id_1290-language" name="1290-language" style="width:100%">
         <option value="1">
          Danish (DK)
         <option selected="" value="2">
          English (EN)
         <option value="3">
          Icelandic (IS)
       <li class="property_name">
        <label for="id_1290-title">
       <li class="property_value">
        <input id="id_1290-title" maxlength="100" name="1290-title" style="width:100%" type="text" value="Oracle Database Unsupported"/>
       <li class="property_name">
        <label for="id_1290-recommendation">
       <li class="property_value">
        <input id="id_1290-recommendation" maxlength="255" name="1290-recommendation" style="width:100%" type="text" value="Upgrade to a version of Oracle Database that is currently supported."/>
       <li class="property_name">
        <label for="id_1290-snippet">
       <li class="property_value">
        <a href="https://cyberopswiki/index.php/How_to:_Add_figure_number_in_snippet" target="_blank">
         How to: Add figure number in snippet.
       <li class="property_value">
        <textarea cols="40" id="id_1290-snippet" name="1290-snippet" rows="10" style="width:100%">&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;It has been detected, that the installed version of Oracle Application Server is&amp;nbsp;&lt;strong&gt;XXXX.&amp;nbsp;&lt;/strong&gt;This version is known to be vulnerable to a number of unspecified vulnerabilities, categorized as 'urgent'.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;As this version is no longer supported for this platform, updates or patches may no longer be released, which have the consequence that vulnerabilities can not be patched, leaving the system vulnerable.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;In version there are, according to more than 54 vulnerabilities which affects the installed version.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: center; line-height: normal;" align="center"&gt;&lt;strong&gt;&lt;em&gt;&lt;span lang="EN-US" style="font-size: 8pt;"&gt;Figure 1: &lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;&lt;span lang="EN-US" style="font-size: 8pt;"&gt;Oracle Application Server version.&lt;/span&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;More information on these vulnerabilities can be found at:&amp;nbsp;&lt;/span&gt;&lt;span style="font-size: 10pt;"&gt;&lt;a href=""&gt;&lt;span lang="EN-US" style="color: blue; mso-ansi-language: EN-US;"&gt;;/span&gt;&lt;/a&gt;&lt;a href=""&gt;&lt;span lang="EN-US" style="color: blue; mso-ansi-language: EN-US;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&amp;nbsp;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;It is recommended that the installed version is updated as soon as possible to the latest version.&lt;/span&gt;&lt;/p&gt;</textarea>
       <li class="property_name">
        Scan type
       <li class="property_value">
        <select multiple="multiple" name="scan_type" size="6" style="width:100%">
         <option selected="selected" value="5">
          COMPANY PCI
         <option selected="selected" value="7">
         <option selected="selected" value="8">
          Firewall Audit
         <option selected="selected" value="6">
          Penetration Test
         <option selected="selected" value="9">
          WIFI Test
         <option selected="selected" value="10">
          APP Test
         <option selected="selected" value="1">
          External Security Analysis
         <option selected="selected" value="2">
          Internal Security Analysis
         <option selected="selected" value="3">
          Web Application Test
         <option selected="selected" value="4">
          Host Discovery Analysis
        -- Use ctrl to mark multiple types
       <li class="property_name">
       <li class="property_value">
        <input type="submit" value="Update snippet"/>
     <br style="clear:both;"/>
    <div class="snippet">
     <form action="/report/vulnerabilityEditor/?action=edit&amp;id=194&amp;sid=172&amp;model=snippet" method="POST">
      <input id="id_172-vulnerability" name="172-vulnerability" type="hidden" value="194"/>
       <li class="property_name">
        <label for="id_172-language">
       <li class="property_value">
        <select id="id_172-language" name="172-language" style="width:100%">
         <option selected="" value="1">
          Danish (DK)
         <option value="2">
          English (EN)
         <option value="3">
          Icelandic (IS)
       <li class="property_name">
        <label for="id_172-title">
       <li class="property_value">
        <input id="id_172-title" maxlength="100" name="172-title" style="width:100%" type="text" value="Forældet Oracle Application Server 10g"/>
       <li class="property_name">
        <label for="id_172-recommendation">
       <li class="property_value">
        <input id="id_172-recommendation" maxlength="255" name="172-recommendation" style="width:100%" type="text"/>
       <li class="property_name">
        <label for="id_172-snippet">
       <li class="property_value">
        <a href="https://cyberopswiki/index.php/How_to:_Add_figure_number_in_snippet" target="_blank">
         How to: Add figure number in snippet.
       <li class="property_value">
        <textarea cols="40" id="id_172-snippet" name="172-snippet" rows="10" style="width:100%">&lt;p style="font-size: 13px;"&gt;Det konstateret, at den installerede version af Oracle Application Server er&amp;nbsp;&lt;strong&gt;XXXX.&amp;nbsp;&lt;/strong&gt;Denne version indeholder flere kendte samt uspecificeret s&amp;aring;rbarheder, der kategoriseres som v&amp;aelig;rende 'yderst kritiske' og 'kritiske'.&lt;/p&gt;
&lt;p style="font-size: 13px;"&gt;Da der ikke l&amp;aelig;ngere komme opdateringer til denne platform, vil disse s&amp;aring;rbarheder ikke blive udbedret, hvorfor systemet er meget udsat.&lt;/p&gt;
&lt;p style="font-size: 13px;"&gt;I version findes der if&amp;oslash;lge ikke mindre end 54 s&amp;aring;rbarheder, der ber&amp;oslash;rer denne version. Mere information om disse findes p&amp;aring; adressen&amp;nbsp;&lt;a href=""&gt;;/a&gt;&lt;a href=""&gt;&amp;nbsp;&lt;/a&gt;.&lt;/p&gt;
&lt;p style="font-size: 13px;"&gt;Det anbefales leverand&amp;oslash;ren af software l&amp;oslash;sningen kontakts, s&amp;aring; der hurtigst muligt kan opgraderes til en nyere, supporteret version.&amp;nbsp;&lt;/p&gt;</textarea>

根据 Martins 的建议,我将代码修改如下:

from bs4 import BeautifulSoup
import re
import csv
import glob

def get_danish(text):
    return re.compile(r'\b({0})\b'.format(text), flags=re.IGNORECASE).search

with open('dk_snip.csv', 'w', newline='') as f_out:
    csv_out = csv.writer(f_out)
#    csv_out.writerow(["Nessus", "ID", "Descrip"])

    for filename in glob.glob('/home/rj/Documents/snip/snips/*'):
        print("Processing:", filename)

        with open(filename) as f_in:
            soup = BeautifulSoup(f_in, 'html5lib')

            var1 = soup.find('li', text = re.compile('Scan vendor:'), attrs = {'class' : 'property_name'})
            var2 = soup.find('li', text = re.compile('Vendor ID:'), attrs = {'class' : 'property_name'})

            vendor = var1.find_next('li').get_text(strip=True)
            vend_id = var2.find_next('li').get_text(strip=True)

#    rows = [[vendor, vend_id, dk_desc.get_text(strip=True)] for dk_desc in soup.find_all("textarea")[:3]]

            for textarea in soup.find_all("textarea"):
                desc = textarea.get_text(strip=True)

                if get_danish('dette'):
                    csv_out.writerows([vendor, vend_id, desc])


您获得的标签需要被删除。一种方法是使用 BeautifulSoup .get_text(strip=True)功能。

我假设您想要为每个 textarea 重复 NessusID 值。下面显示了如何做到这一点:

from bs4 import BeautifulSoup
import csv
import re
import glob
import random

def get_language(text):
    # This will need to be added using another library - currently random
    return random.choice(["en", "dk"])

with open('dk_snip.csv', 'w', newline='') as f_out:
    csv_out = csv.writer(f_out)
    csv_out.writerow(["Nessus", "ID", "Text"])

    for filename in glob.glob('*.html'):        # search all HTML files in the current folder
        print("Processing:", filename)

        with open(filename) as f_in:
            soup = BeautifulSoup(f_in, 'html5lib')

            var1 = soup.find('li', text=re.compile('Scan vendor:'), attrs = {'class' : 'property_name'})
            var2 = soup.find('li', text=re.compile('Vendor ID:'), attrs = {'class' : 'property_name'})

            nessus = var1.find_next('li').get_text(strip=True)
            id = var2.find_next('li').get_text(strip=True)

            for textarea in soup.find_all("textarea"):
                desc = textarea.get_text(strip=True)

                if get_language(desc) == 'dk':
                    csv_out.writerow([nessus, id, desc])

这将为您提供以下输出 CSV 文件:

Nessus,55786,"According to its version, the installation of Oracle Database on the remote host is no longer supported.

Lack of support implies that no new security patches for the product will be released by the vendor. As a result, it is likely to contain security vulnerabilities."
Nessus,55786,The remote host is running an unsupported version of a database server.
Nessus,55786,Upgrade to a version of Oracle Database that is currently supported.

注意:由于您的文本包含换行符,CSV 格式会自动用双引号将这些单元格括起来。它将正确加载到另一个包中。

该脚本搜索当前文件夹中的所有匹配文件。对于每个 textarea,都会调用 get_language(),需要使用另一个库进行编码。如果检测到 dk(或任何需要的内容),则该行将添加到 CSV 文件中。

如果您的文本区域包含 HTML,您可能需要使用对 BeautifulSoup 的另一个调用来进一步处理它:

soup_desc = BeautifulSoup(desc, 'html5lib')

for text in soup_desc.stripped_strings:

关于python - 将 html 写入 csv 时出现奇怪的格式,我们在Stack Overflow上找到一个类似的问题:


html - 有没有证据表明 CSS 的顺序是相关的?

javascript - 通过 HTML 中的 URL 预填写表单字段

python - 无法读取行中包含分号的 CSV 文件。无法在 pandas 中使用 drop 函数删除行

python - 在python中合并文件时,csv中不断出现双引号,如何删除?

python - 约束线性回归/二次规划 python

python - 如何用循环创建变量?

jquery - 尝试居中时文本受到干扰

python - 如何使用 pandas 将 n 个 .csv 文件(可能是 20-30 个文件)与 1 个大 .csv 文件水平(轴 = 1)合并?

python - Keras 模型产生相同的输出

python - 如何将 PyMongo 与 Flask 蓝图一起使用?