python - 使用 Python 和 Pandas 根据 Dataframe 内容重命名文件

我正在尝试读取 xlsx 文件，将列中的所有引用号与文件夹内的文件进行比较，如果它们对应，则将它们重命名为与引用号关联的电子邮件。

Excel 文件包含以下字段:

 Reference     EmailAddress
   1123        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2644494408554b4f524e665f474e49490845494b" rel="noreferrer noopener nofollow">[email protected]</a>
   1233        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d8b2b7b0b6f6bcaab9b3b798bfb5b9b1b4f6bbb7b5" rel="noreferrer noopener nofollow">[email protected]</a>
   1334        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0e7d6f637b6b6220636f607b6b624e776f666161206d6163" rel="noreferrer noopener nofollow">[email protected]</a>
   ...         .....

我的文件夹applicants仅包含名为Reference列的doc文件:

如何将 applicantsCVs 文件夹的内容与 Excel 文件中的Reference 字段进行比较，如果匹配，则将所有文件重命名为相应的电子邮件地址？

这是我迄今为止尝试过的:

import os
import pandas as pd

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
references = dfOne['Reference']

emailAddress = dfOne['EmailAddress']

cleanedEmailList = [x for x in emailAddress if str(x) != 'nan']

print(cleanedEmailList)
excelArray = []
filesArray = []

for root, dirs, files in os.walk("applicantCVs"):
    for filename in files:
        print(filename) #Original file name with type 1233.doc
        reworkedFile = os.path.splitext(filename)[0]
        filesArray.append(reworkedFile)

for entry in references:
    excelArray.append(str(entry))

for i in excelArray:
    if i in filesArray:
        print(i, "corresponds to the file names")

我将引用名称与文件夹内容进行比较，如果相同则将其打印出来:

 for i in excelArray:
        if i in filesArray:
            print(i, "corresponds to the file names")

我尝试使用 os.rename(filename, cleanEmailList ) 重命名它，但它不起作用，因为 cleanedEmailList 是一个电子邮件数组。

如何匹配和重命名文件？

更新:

from os.path import dirname
import pandas as pd
from pathlib import Path
import os

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")

emailAddress = dfOne['EmailAddress']
reference = dfOne['Reference'] = dfOne.references.astype(str)

references = dict(dfOne.dropna(subset=[reference, "EmailAddress"]).set_index(reference)["EmailAddress"])
print(references)
files = Path("applicantCVs").glob("*")

for file in files:
    new_name = references.get(file.stem, file.stem)
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

最佳答案

基于样本数据:

Reference     EmailAddress
   1123        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f7959895d9849a9e839fb78e969f9898d994989a" rel="noreferrer noopener nofollow">[email protected]</a>
   1233        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c7673747232786e7d77735c7b717d7570327f7371" rel="noreferrer noopener nofollow">[email protected]</a>
   nan         jane.smith#example.com
   1334        <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="582b39352d3d34763539362d3d34182139303737763b3735" rel="noreferrer noopener nofollow">[email protected]</a>

首先，您组装一个 dict，其中引用集作为键，新名称作为值:

references = dict(df.dropna(subset=["Reference","EmailAddress"]).set_index("Reference")["EmailAddress"])

{'1123': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="32505d501c415f5b465a724b535a5d5d1c515d5f" rel="noreferrer noopener nofollow">[email protected]</a>',
 '1233': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a5cfcacdcb8bc1d7c4cecae5c2c8c4ccc98bc6cac8" rel="noreferrer noopener nofollow">[email protected]</a>',
 '1334': '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1764767a62727b397a767962727b576e767f78783974787a" rel="noreferrer noopener nofollow">[email protected]</a>'}

请注意，这里的引用是str。如果它们不在您的原始数据库中，您可以使用 astype(str)

然后使用pathlib.Path查找数据目录中的所有文件:

files = Path("../data/renames").glob("*")

[WindowsPath('../data/renames/1123.docx'),
 WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/1233.txt')]

重命名可以变得非常简单:

for file in files:
    new_name = references.get(file.stem, file.stem )
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

references.get 询问新的文件名，如果没有找到，则使用原始的主干。

[WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b2d0ddd09cc1dfdbc6daf2cbd3dadddd9cd1dddf9cd6ddd1ca" rel="noreferrer noopener nofollow">[email protected]</a>'),
 WindowsPath('../data/renames/<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6c0603040242081e0d07032c0b010d0500420f030142181418" rel="noreferrer noopener nofollow">[email protected]</a>')]

关于python - 使用 Python 和 Pandas 根据 Dataframe 内容重命名文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55922680/

python - 使用 Python 和 Pandas 根据 Dataframe 内容重命名文件

上一篇：python 字符串剥离不适用于尾随双引号

下一篇：python - 如何将时间序列数据输入自动编码器网络进行特征提取？