各位专家大家好,
我遇到了一个问题,我需要一个解决方案。请帮我解决这个问题。
因此,我有一个从存储在 s3 中的 XML 文件创建的动态框架。
该框架有一个嵌套字段“ReceiptNumber”,动态框架的架构如下所示:
root
|-- Receipt: struct
| |-- Front: struct
| | |-- FrontNumber: string
| | |-- CountryorTerritoryCode: string
| | |-- TaxId: string
| |-- ReceiptAmount: double
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- ReceiptNumber: double
| |-- TaxVarianceAmount: double
| |-- TransferDetails: array
| | |-- element: struct
| | | |-- BillCategoryCode: string
| | | |-- BillCategoryDetailCode: string
| | | |-- Porting: array
| | | | |-- element: struct
| | | | | |-- AddressDetails: struct
| | | | | | |-- ConsigneeAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- AddressText2: string
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- SenderAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- ThirdPartyAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | |-- BillOptionCode: string
| | | | | |-- LeadPortingNumber: string
| | | | | |-- Package: array
| | | | | | |-- element: struct
| | | | | | | |-- BillDetails: struct
| | | | | | | | |-- Bill: array
| | | | | | | | | |-- element: struct
| | | | | | | | | | |-- BillInformation: array
| | | | | | | | | | | |-- element: struct
| | | | | | | | | | | | |-- BasisCurrencyCode: string
| | | | | | | | | | | | |-- BasisValue: double
| | | | | | | | | | | | |-- BilldUnitQuantity: int
| | | | | | | | | | | | |-- CurrencyCode: string
| | | | | | | | | | | | |-- DescriptionCode: string
| | | | | | | | | | | | |-- DescriptionOfBills: string
| | | | | | | | | | | | |-- ExemptionAmount: double
| | | | | | | | | | | | |-- IncentiveAmount: double
| | | | | | | | | | | | |-- NetAmount: double
| | | | | | | | | | | | |-- TaxIndicator: double
| | | | | | | | | | |-- ClassificationCode: string
| | | | | | | |-- ContainerType: string
| | | | | | | |-- MiscellaneousDetails: struct
| | | | | | | | |-- MiscellaneousLineItems: struct
| | | | | | | | | |-- LineItem: struct
| | | | | | | | | | |-- LineNumber: int
| | | | | | | | | | |-- LineText: string
| | | | | | | |-- PackageBillableKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageDimension: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageQuantity: struct
| | | | | | | | |-- ActualQuantity: struct
| | | | | | | | | |-- Quantity: int
| | | | | | | |-- PackageWeight: struct
| | | | | | | | |-- ActualWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeightType: double
| | | | | | | |-- TrackingNumber: string
| | | | | | | |-- Zone: int
| | | | | |-- PayerRoleCd: int
| | | | | |-- PickUpRecordNumber: long
| | | | | |-- PortingReferences: struct
| | | | | | |-- Reference: array
| | | | | | | |-- element: struct
| | | | | | | | |-- ReferenceNumber: string
| | | | | | | | |-- Sequence: int
| | | | | |-- TransferDateCCYYMMDD: int
| |-- TypeCode: string
| |-- TypeDetailCode: double
在编写动态框架之前我想要更改的是使字段“ReceiptNumber”成为如下字符串类型
....
....
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- <b>ReceiptNumber: string</b>
| |-- TaxVarianceAmount: double
....
....
可以通过apply_mapping
实现吗?
有其他解决方案吗?
最佳答案
最后,我能够用一些不同的方法解决这个问题。
所以,回顾一下,我有一个 Glue ETL 类型的作业,用 python 脚本编写。
它负责处理 XML 文件。处理 XML 文件后,其架构类似于上面的内容,正如我在问题中提到的。
因此,我想将其节点之一的类型“ReceiptNumber”从 int
更改为 string
。
所以,首先我像往常一样从 s3 文件创建了一个动态框架
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
然后,将动态框架转换为 pyspark 数据框架,如下所示
df = d0.toDF();
然后,我利用了以下链接中编写的函数来修改嵌套结构体字段及其类型。
Pyspark: How to Modify a Nested Struct Field
根据该函数,我创建了一个 new_schema
,如下所示使用它,并将其转换为如下所示的新动态框架。
df = df.withColumn("Receipt_json", to_json("Receipt")).drop("Receipt")
df = df.withColumn("Receipt", from_json("Receipt_json", new_schema)).drop("Receipt_json")
d0 = DynamicFrame.fromDF(df, glueContext, "d0")
从具有修改字段“ReceiptNumber”(从 int
到 string
)的新动态框架中,我创建了如下所示的 JSON 架构。
receiptSchema = d0.schema()
withReceiptSchema = json.dumps(receiptSchema.jsonValue())
最后,我使用新架构再次创建了架构,如下所示,并将其写在 JSON 文件中,如下所示。
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"withSchema": withReceiptSchema, "rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
# writing the down the data from above schema in a JSON file
glueContext.write_dynamic_frame.from_options(frame = d0, connection_type = "s3", connection_options = {"path": s3_write_path}, format = "json")
我希望,如果有人在使用 Aws Glue Jobs 时遇到此类错误或障碍,此答案可能会有所帮助。
关于python-3.x - 在 awsglue 中更改动态框架特定列的数据类型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71592277/