python - 使用palantir代工厂中的存储库输出.docx文档

由于类型转换厂文档相当零散并且没有真正提供答案: 是否可以以某种方式使用类型转换代码存储库(python-docx 库可用并使用)和 df 作为输入来生成 word 文档(.docx)作为输出？我认为也许使用转换输入/输出和 py-docx document.save() 功能的组合可能有效，但我无法想出合适的解决方案。

    from pyspark.sql import functions as F
    from transforms.api import transform, transform_df, Input, Output
    import os, docx
    import pandas as pd
    
    @transform(
        output = Output("some_folder/"),
        source_df = Input(""),
    )
    
    def compute(source_df, output):
        df = source_df.dataframe()
        test = df.toPandas()
        document = docx.Document()
        doc.add_paragraph(str(test.loc[1,1])
        document.save('test.docx')
        output.write_dataframe(df)

此代码 ofc 不起作用，但希望有一个可行的解决方案(在理想的情况下，可以有多个 .docx 作为输出)。

最佳答案

最好的选择是使用 Spark 将文件生成分布到执行器上。此转换会为每一行生成一个 Word 文档，并将其存储在数据集容器中，建议使用 Compass(Foundry 的文件夹系统)。浏览数据集以下载底层文件

# from pyspark.sql import functions as F
from transforms.api import transform, Output
import pandas as pd
import docx

'''
# ====================================================== #
# === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
# ====================================================== #

Description
-----------
Generates a spark dataframes containing docx files with strings contained in a source spark dataframe

Strategy
--------
1. Create dummy spark dataframe with primary key and random text
2. Use a udf to open filesystem and write a docx with the contents of text column above 

'''


@transform(
    output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
)
def compute(ctx, output):
    # gen data
    pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
    data = ctx.spark_session.createDataFrame(pdf)

    # function to write files
    def strings_to_doc(df, transform_output):
        rdd = df.rdd

        def generate_files(row):
            filename = row['name'] + '.docx'

            with transform_output.filesystem().open(filename, 'wb') as worddoc:
                doc = docx.Document()
                doc.add_heading(row['name'])
                doc.add_paragraph(row['content'])
                doc.save(worddoc)

        rdd.foreach(generate_files)

    return strings_to_doc(data, output)

一个pandas udf如果您更喜欢 pandas 数据框的输入，但您被迫定义一个不方便使用的架构，那么也可以使用。

关于python - 使用palantir代工厂中的存储库输出.docx文档，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71079515/

python - 使用palantir代工厂中的存储库输出.docx文档

上一篇：go - 如何停止golang中的周期函数

下一篇：c++ - 在 C++ 中散列原始字节？