python - 将数据保存为新行,但在单个单元格中 lxml python

标签 python beautifulsoup lxml

我想要这样的数据...

“基本款 Jersey

按照 jar 头上所说的做

主要成分:100% 棉。”

在一个单元格中,但我得到的数据是这样的......

“基本款平纹针织 Fabric 正如 jar 头上所说的那样主要:100% 棉。”

这是 HTML

<div class="about-me">
    <h4>ABOUT ME</h4>
    <span><div>Basic jersey</div><div>Does what it says on the tin</div><br>Main: 100% Cotton.</span>
</div>
这是我的代码
from selenium import webdriver
from lxml import html
import pandas as pd
import collections, os
from bs4 import BeautifulSoup

def Save_to_Csv(data):
    filename = 'data.csv'
    df = pd.DataFrame(data)
    df.set_index('Title', drop=True, inplace=True)
    if os.path.isfile(filename):
       with open(filename,'a') as f:
           df.to_csv(f, mode='a', sep=",", header=False, encoding='utf-8')
    else:
        df.to_csv(filename, sep=",", encoding='utf-8')

with open('urls.txt', 'r') as f:
        links = [link.strip() for link in f.readlines()]
driver = webdriver.Chrome()
for urls in links:
    global image
    driver.get(urls)
    source = driver.page_source
    tree = html.fromstring(source)
    data = BeautifulSoup(source, 'html.parser')
    imgtag = data.find_all('li', attrs={'class':'image-thumbnail'})
    image = []
    for imgsrc in imgtag:
        image.append(imgsrc.img['src'].replace('?$S$&wid=40&fit=constrain', '?$XXL$&wid=513&fit=constrain'))
    title = tree.xpath('string(.//div/h1)')         
    price = tree.xpath('string(.//span[@class="current-price"])')
    sku = tree.xpath('string(.//div[@class="product-code"]/span)')
    aboutme = tree.xpath(('string(.//div[@class="about-me"]/span)'))

    foundings = collections.OrderedDict()
    foundings['Title'] = [title]
    foundings['Price'] = [price]
    foundings['Product_Code'] = [sku]
    foundings['Abouy_Me'] = [aboutme]
    foundings['Image'] = [image]
    Save_to_Csv(foundings)

    print title, price, sku, aboutme, image
driver.close()

最佳答案

使用您提供的 HTML,您可以使用 stripped_strings 解决此问题生成器如下:

from bs4 import BeautifulSoup

html = """
<div class="about-me">
    <h4>ABOUT ME</h4>
    <span><div>Basic jersey</div><div>Does what it says on the tin</div><br>Main: 100% Cotton.</span>
</div>"""

soup = BeautifulSoup(html, "html.parser")

print('\n'.join(soup.span.stripped_strings))

这会将每个组件放入剥离列表中,然后用换行符将它们连接在一起:

Basic jersey
Does what it says on the tin
Main: 100% Cotton.

关于python - 将数据保存为新行,但在单个单元格中 lxml python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51986817/

相关文章:

python - 在 Python 中使用适当的类型提示对 Sequence 进行子类化

pandas - 使用python从html中提取表数据,其中行存储在div中

html - 使用 xpath 从 ul 中选择 li 元素

python - 如何使用递归在 BeautifulSoup 中进行抓取?

python - 如何纠正这个 python3.4 lxml 多核检测?

python - 如何在 Ubuntu 上安装 lxml

python - 在 DataFrame 中将 Pandas 系列转换为 DateTime

python - 通过environment.yml使用conda安装npm包

python - 在不使用 C 函数的情况下更新 ctypes python 中结构指针的值

python - Unicode编码错误: 'ascii' codec can't encode characters in position 15-17: ord inal not in range(128)