go - Go 中的 Apache Beam 左连接

标签 go google-cloud-platform google-cloud-dataflow apache-beam

有没有简单的方法可以使用 Go 执行 2 个 PCollection 的左连接? 我发现 SQL 连接仅在 Java 中可用。

package main

import (
    "context"
    "flag"

    "github.com/apache/beam/sdks/v2/go/pkg/beam"
    "github.com/apache/beam/sdks/v2/go/pkg/beam/log"
    "github.com/apache/beam/sdks/v2/go/pkg/beam/x/beamx"
)

type customer struct {
    CustID int
    FName  string
}

type order struct {
    OrderID int
    Amount  int
    Cust_ID int
}

func main() {

    flag.Parse()
    beam.Init()

    ctx := context.Background()

    p := beam.NewPipeline()
    s := p.Root()

    var custList = []customer{
        {1, "Bob"},
        {2, "Adam"},
        {3, "John"},
        {4, "Ben"},
        {5, "Jose"},
        {6, "Bryan"},
        {7, "Kim"},
        {8, "Tim"},
    }

    var orderList = []order{
        {123, 100, 1},
        {125, 30, 3},
        {128, 50, 7},
    }

    custPCol := beam.CreateList(s, custList)

    orderPCol := beam.CreateList(s, orderList)

    // Left Join custPcol with orderPCol
    // Expected Result
    // CustID | FName   |OrderID| Amount
    //     1  | Bob     |   123 | 100
    //     2  | Adam    |       |
    //     3  | John    |   125 | 100
    //     4  | Ben     |       |
    //     5  | Jose    |       |
    //     6  | Bryan   |       |
    //     7  | Kim     |   125 | 100
    //     8  | Tim     |       |

    if err := beamx.Run(ctx, p); err != nil {
        log.Exitf(ctx, "Failed to execute job: %v", err)
    }

}

我想加入这 2 个 PCollection 并执行进一步的操作。我看到了有关 CoGroupByKey 的文档,但无法将其转换为普通 SQL Join 可以执行的格式。

对此有什么建议吗?

最佳答案

尝试这样

type resultType struct {
    CustID  int
    FName   string
    OrderID int
    Amount  int
}

result := beam.ParDo(s, func(c customer, iterOrder func(*order) bool) resultType {
    var o order

    for iterOrder(&o) {
        if c.CustID == o.Cust_ID {
            return resultType{
                CustID:  c.CustID,
                FName:   c.FName,
                OrderID: o.OrderID,
                Amount:  o.Amount,
            }
        }
    }

    return resultType{
        CustID: c.CustID,
        FName:  c.FName,
    }
}, custPCol, beam.SideInput{Input: orderPCol})

或者如果您想使用 CoGroupByKey ...

custWithKeyPCol := beam.ParDo(s, func(c customer) (int, customer) {
    return c.CustID, c
}, custPCol)

orderWithKeyPCol := beam.ParDo(s, func(o order) (int, order) {
    return o.Cust_ID, o
}, orderPCol)

resultPCol := beam.CoGroupByKey(s, custWithKeyPCol, orderWithKeyPCol)

beam.ParDo0(s, func(CustID int, custIter func(*customer) bool, orderIter func(*order) bool) {
    c, o := customer{}, order{}
    for custIter(&c) {
        if ok := orderIter(&o); ok {
            fmt.Println(CustID, c.FName, o.OrderID, o.Amount)
        }
        fmt.Println(CustID, c.FName)
    }
}, resultPCol)

关于go - Go 中的 Apache Beam 左连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75319138/

相关文章:

函数名称前带有下划线的结构标记

node.js - 如何在不指定其祖先的情况下获取实体?

零停机时间的 Kubernetes 部署

google-cloud-dataflow - 使用 Cloud Dataflow 使用窗口从 PubSub 写入 Google Cloud Storage

google-cloud-platform - 为什么 Dataflow 步骤未启动?

go - 在go-zookeeper中创建后,节点数据为空

Golang grpc.server : Understanding notions of server, 和服务

go - `KUBERNETES_PORT_443_TCP_ADDR` 是如何设置的?任何指向 Kubernetes 源代码的指针?

google-cloud-platform - 无法在 Dataproc 2.0 镜像中创建 Avro 表

python - 如何在 Python 中合并 Apache-Beam DataFlow 中的解析文本文件?