python - 使用 Python 和 Pandas 根据 Dataframe 内容重命名文件

标签 python pandas

我正在尝试读取 xlsx 文件,将列中的所有引用号与文件夹内的文件进行比较,如果它们对应,则将它们重命名为与引用号关联的电子邮件。

Excel 文件包含以下字段:

 Reference     EmailAddress
   1123        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2644494408554b4f524e665f474e49490845494b" rel="noreferrer noopener nofollow">[email protected]</a>
   1233        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d8b2b7b0b6f6bcaab9b3b798bfb5b9b1b4f6bbb7b5" rel="noreferrer noopener nofollow">[email protected]</a>
   1334        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0e7d6f637b6b6220636f607b6b624e776f666161206d6163" rel="noreferrer noopener nofollow">[email protected]</a>
   ...         .....

我的文件夹applicants仅包含名为Reference列的doc文件:

enter image description here

如何将 applicantsCVs 文件夹的内容与 Excel 文件中的Reference 字段进行比较,如果匹配,则将所有文件重命名为相应的电子邮件地址?

这是我迄今为止尝试过的:

import os
import pandas as pd

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
references = dfOne['Reference']

emailAddress = dfOne['EmailAddress']

cleanedEmailList = [x for x in emailAddress if str(x) != 'nan']

print(cleanedEmailList)
excelArray = []
filesArray = []

for root, dirs, files in os.walk("applicantCVs"):
    for filename in files:
        print(filename) #Original file name with type 1233.doc
        reworkedFile = os.path.splitext(filename)[0]
        filesArray.append(reworkedFile)

for entry in references:
    excelArray.append(str(entry))

for i in excelArray:
    if i in filesArray:
        print(i, "corresponds to the file names")

我将引用名称与文件夹内容进行比较,如果相同则将其打印出来:

 for i in excelArray:
        if i in filesArray:
            print(i, "corresponds to the file names")

我尝试使用 os.rename(filename, cleanEmailList ) 重命名它,但它不起作用,因为 cleanedEmailList 是一个电子邮件数组。

如何匹配和重命名文件?

更新:

from os.path import dirname
import pandas as pd
from pathlib import Path
import os

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")

emailAddress = dfOne['EmailAddress']
reference = dfOne['Reference'] = dfOne.references.astype(str)

references = dict(dfOne.dropna(subset=[reference, "EmailAddress"]).set_index(reference)["EmailAddress"])
print(references)
files = Path("applicantCVs").glob("*")

for file in files:
    new_name = references.get(file.stem, file.stem)
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

最佳答案

基于样本数据:

Reference     EmailAddress
   1123        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f7959895d9849a9e839fb78e969f9898d994989a" rel="noreferrer noopener nofollow">[email protected]</a>
   1233        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c7673747232786e7d77735c7b717d7570327f7371" rel="noreferrer noopener nofollow">[email protected]</a>
   nan         jane.smith#example.com
   1334        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="582b39352d3d34763539362d3d34182139303737763b3735" rel="noreferrer noopener nofollow">[email protected]</a>

首先,您组装一个 dict,其中引用集作为键,新名称作为值:

references = dict(df.dropna(subset=["Reference","EmailAddress"]).set_index("Reference")["EmailAddress"])
{'1123': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="32505d501c415f5b465a724b535a5d5d1c515d5f" rel="noreferrer noopener nofollow">[email protected]</a>',
 '1233': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a5cfcacdcb8bc1d7c4cecae5c2c8c4ccc98bc6cac8" rel="noreferrer noopener nofollow">[email protected]</a>',
 '1334': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1764767a62727b397a767962727b576e767f78783974787a" rel="noreferrer noopener nofollow">[email protected]</a>'}

请注意,这里的引用是str。如果它们不在您的原始数据库中,您可以使用 astype(str)

然后使用pathlib.Path查找数据目录中的所有文件:

files = Path("../data/renames").glob("*")
[WindowsPath('../data/renames/1123.docx'),
 WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/1233.txt')]

重命名可以变得非常简单:

for file in files:
    new_name = references.get(file.stem, file.stem )
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

references.get 询问新的文件名,如果没有找到,则使用原始的主干。

[WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b2d0ddd09cc1dfdbc6daf2cbd3dadddd9cd1dddf9cd6ddd1ca" rel="noreferrer noopener nofollow">[email protected]</a>'),
 WindowsPath('../data/renames/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6c0603040242081e0d07032c0b010d0500420f030142181418" rel="noreferrer noopener nofollow">[email protected]</a>')]

关于python - 使用 Python 和 Pandas 根据 Dataframe 内容重命名文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55922680/

相关文章:

python - 如何将数组绘制为热图时间序列

python - 在 groupby 之后取回 Pandas DataFrame 中的数据

python - 有条件地创建新的数据框列以显示现有列的内容

python - Pandas - 如何根据其他列中的条件对列中的句子求和,并将结果文档存储在列表中

python - 满足 "Hello World"局部最优的简单遗传算法

python - 具有多个参数的复杂排序?

pandas - 在多个条件下过滤数据框索引

pandas - 查找每行具有最大值的列索引

javascript - QuerySet 不是 JSON 可序列化的 Django

python - 查找二维列表中特定列的长度