OrientDB ETL 加载 CSV,其中顶点在一个文件中,边在另一个文件中

标签 orientdb orientdb2.2 orientdb-etl

我有一些数据位于 2 个 CSV 文件中,一个包含顶点,另一个文件包含另一个文件中的边。我正在研究如何使用 ETL 来设置它,已经接近但还没有完全实现——它基本上可以工作,但我的边缘有属性,我不确定它们是否正确加载。 This question很有帮助,但我仍然缺少一些东西......

这是我的数据:

顶点.csv:

label,data,date
v01,0.1234,2015-01-01
v02,0.5678,2015-01-02
v03,0.9012,2015-01-03

edges.csv:

u,v,weight,date
v01,v02,12.4,2015-06-17
v02,v03,17.9,2015-09-14

我使用这个导入我的顶点:

commonVertices.json:

{
"begin": [ 
             { "let": { "name":       "$filePath",  
                        "expression": "$fileDirectory.append($fileName)" 
                      } 
             },
         ],
"config": { "log": "info"},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
                        "nullValue": "N/A",
                        "dateFormat": "yyyy-mm-dd"
                      }
             },
"transformers": [
                    { "vertex": { "class": "myVertex" } },
                    { "code":   { "language": "Javascript",
                                  "code":     "print('    Current record: ' + record); record;" }
                    }
                ],
"loader": { "orientdb": {
            "dbURL": "plocal:my_orientdb",
            "dbType": "graph",
            "batchCommit": 1000,
            "classes": [ { "name": "myVertex", "extends", "V" },
                       ],
            "indexes": []
            }
          }
}

vertices.json:

{ "config": { "log":           "info",
              "fileDirectory": "./",
              "fileName":      "vertices.csv"
            }
}

commonEdges.json:

{
    "begin": [
        { "let": { "name": "$filePath",
                   "expression": "$fileDirectory.append($fileName )"
                 }
        },
    ],

    "config": { "log": "info"
              },

    "source": { "file": { "path": "$filePath" } },

    "extractor": { "csv": { "ignoreEmptyLines": true,
                            "nullValue": "N/A",
                            "dateFormat": "yyyy-mm-dd"
                          }
                 },

    "transformers": [
            { "merge":  { "joinFieldName": "u", "lookup": "myVertex.label" } },
            { "edge":   { "class":         "myEdge",
                          "joinFieldName": "v",
                          "lookup":        "myVertex.label",
                          "direction":     "out",
                          "unresolvedLinkAction": "NOTHING"
                        }
            },
            { "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
        ],

    "loader": {
        "orientdb": {
            "dbURL": "plocal:my_orientdb",
            "dbType": "graph",
            "batchCommit": 1000,
            "useLightweightEdges": false,
            "classes": [
                { "name": "myEdge",   "extends", "E" }
            ],
            "indexes": []
        }
    }
}

edges.json:

{
    "config": {
        "log": "info",
        "fileDirectory": "./",
        "fileName": "edges.csv"
    }
}

我正在使用 oetl.sh 运行它,如下所示:

$ oetl.sh vertices.json commonVertices.json
$ oetl.sh edges.json commonEdges.json

一切都在运行,但是当我查询边缘时...我是 OrientDB 的新手,所以也许它正在获取我的边缘中的属性,但是当我查询边缘时,我看不到权重和日期字段:

orientdb {db=my_orientdb}> SELECT FROM myEdge
+----+-----+------+-----+-----+
|#   |@RID |@CLASS|out  |in   |
+----+-----+------+-----+-----+
|0   |#33:0|myEdge|#25:0|#26:0|
|1   |#34:0|myEdge|#26:0|#27:0|
+----+-----+------+-----+-----+

顶点表包含来自 Edges.csv 的 [weight] 字段,而 [date] 字段正以一种奇怪的方式被破坏。该月的日期被 Edge.csv 文件中的日期覆盖,这是不可取的,但对我来说奇怪的是月份本身并没有发生变化:

orientdb {db=my_orientdb}> SELECT FROM myVertex
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|#   |@RID |@CLASS  |data  |date               |label|weight|out_myEdge|in_myEdge|
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|0   |#25:0|myVertex|0.1234|2015-01-17 00:06:00|v01  |12.4  |[#33:0]   |         |
|1   |#26:0|myVertex|0.5678|2015-01-14 00:09:00|v02  |17.9  |[#34:0]   |[#33:0]  |
|2   |#27:0|myVertex|0.9012|2015-01-03 00:01:00|v03  |      |          |[#34:0]  |
+----+-----+--------+------+-------------------+-----+------+----------+---------+

我确信这可能是一个简单的调整,任何帮助都会很棒!

最佳答案

在边缘转换器中使用edgeFields来绑定(bind)边缘中的属性。示例:

 "transformers": [
            { "merge":  { "joinFieldName": "u", "lookup": "myVertex.label" } },
            { "edge":   { "class":         "myEdge",
                          "joinFieldName": "v",
                          "lookup":        "myVertex.label",
                          "edgeFields": { "weight": "${input.weight}", "date": "${input.date}" },
                          "direction":     "out",
                          "unresolvedLinkAction": "NOTHING"
                        }

            },
            { "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
        ],

希望有帮助。

关于OrientDB ETL 加载 CSV,其中顶点在一个文件中,边在另一个文件中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38628356/

相关文章:

mysql - Orientdb GC overhead limit exceeded/out of memory 错误和性能低下

orientdb - systemd 在 Ubuntu 16.04 上启动后立即停止 OrientDB

orientdb - 使用 Traverse from 投影 OrientDB 中的记录

etl - 使用 etl 将边导入 OrientDB

database-design - 如何使用图数据库设计集合?

insert - 如何在插入时从 OrientDB 获取 recordid?

java - 带有嵌入式对象的 OrientDB POJO 映射

concurrency - OrientDB 的存储独占锁到底意味着什么?