c++ - 当字段可为空时，如何使用 C++ 接口(interface)在 Avro 中写入数据？

首先，我对这个问题进行了搜索。我找到了 C 接口(interface)的答案和 Java 的答案。没有找到适用于 C++ 的。不幸的是，C 示例中调用的方法在 C++ API 中不存在，因此不能仅仅模仿该特定 stackoverflow 讨论/主题中提供的答案。
我正在尝试一些应该相当简单的事情。然而，一两个小时后，我才设法接近答案，但还没有找到答案。为了简单起见，我将尝试写入的记录减少到仅 1 个字段。该字段是一个可以为空的字符串。在 Avro 中，这意味着该字段是可选的。字段的 null 方面是通过 Avro union 完成的，其中约定是 null 值首先出现在该字段的架构中。
到目前为止，我从大量的试验和错误中学到了什么:

您需要在模板化的 codec_traits 结构中使用编码器和解码器来记录要写入的记录。这通常在某处的标题中定义。

如果我正在从文件中加载架构，那么您需要在单独的文件中以 JSON 格式定义该架构。

在您的 C++ 代码中，您使用您加载的模式声明一个 avro::DataFileWriter，以及来自上述 header 的记录。然后，您有一个用数据填充的本地记录，然后调用 write() 方法。

应该足够简单。然而并没有那么多。对于上述列表中的详细信息，以下包含我当前使用的代码:

header :

    #ifndef RECURSIVE_HH
    #define RECURSIVE_HH
    
    #include "Specific.hh"
    #include "Encoder.hh"
    #include "Decoder.hh"
    
    namespace recursive_record
    {
       struct recursive_data
       {
          std::string   fstring;
    
       };
    }
    
    namespace avro
    {
       template<> struct codec_traits<recursive_record::recursive_data>
       {
          static void encode( Encoder& e, const recursive_record::recursive_data& v )
          {
             avro::encode( e, v.fstring );
    
          }
    
          static void decode( Decoder& d, recursive_record::recursive_data& v )
          {
             avro::decode( d, v.fstring );
    
          }
       };
    }
    
    #endif /* RECURSIVE_HH */

JSON 模式文件:

    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": [
                    "null",
                    "string"
                ]
            }
        ]
    }

主 C++ 文件(请注意，出于简洁的原因，我已经截断了该文件，因此以下代码中没有使用(或者更确切地说是看到)一些包含的 header :

    #include "recursive.h"
    #include "Encoder.hh"
    #include "Decoder.hh"
    #include "Generic.hh"
    #include "GenericDatum.hh"
    #include "ValidSchema.hh"
    #include "DataFile.hh"
    #include "Types.hh"
    #include "Compiler.hh"
    #include "Stream.hh"
    
    avro::ValidSchema loadSchema(const char* filename)
    {
        std::ifstream ifs(filename);
        avro::ValidSchema result;
        avro::compileJsonSchema(ifs, result);
        return result;
    }
    
    
    int main( int argc, char** argv )
    {
       /**********************************************************************************
                                  AVRO WRITER EXAMPLE
       **********************************************************************************/
       try
       {
          //Filename definitions skipped for brevity
    
          avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
          avro::DataFileWriter<recursive_record::recursive_data>   dfw( filename, recursiveSchema );
          recursive_record::recursive_data       record;
          record.fstring = std::string("First string");
    
          dfw.write( record );
          dfw.close();
    
       }
       catch( const std::exception& e )
       {
          // Log a message
          return -1;
    
       }
    }

“所以有什么问题？”你可能会问。好吧，文件实际上是成功写入的，至少代码没有崩溃并且生成了一个 Avro 数据文件。到现在为止还挺好。但是，如果您尝试读取该文件，则会收到以下错误:

    AVRO read error: vector::_M_range_check: __n (which is 12) >= this->size() (which is 2)

世界卫生大会-？？？是的。 '整个下午都在做这件事。
经过大量实验后，我发现问题出在给定字段的可空方面。我还注意到，如果我从架构中删除了可为空的选项，那么架构就变成了这样:

    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": "string"
            }
        ]
    }

而且我什么也没做，那么新的 Avro 数据文件不仅写入成功，而且读取成功，因此:

    [rh6lgn01][1881] MY_EXAMPLES/generate_recursive$ recursive
    schema=recursive.json
    file=./DATA/recursive.avro
    recursiveSchema valid = true
    ReadFile(): Type = record
    ProcessRecord(): New record found.  Field count = 1
    ProcessRecord(): {
    ProcessRecord():   Field 0: type = string
    ProcessDatum():   Field 0: value = First string (length= 12)
    ProcessRecord(): }
    rowCount = 1
    
    AVRO Writing and Reading Complete
    [rh6lgn01][1882] MY_EXAMPLES/generate_recursive$

当我阅读 Java 问题时，我有一些希望。有一个答案指出 - 在 Java 中 - 有一个 @Nullable 标记，您可以将其与记录中的字段相关联。这是该问题的链接:
Storing null values in avro files
C++ 语言中当然没有这样的机制。我确实在 Types.hh header 中找到了以下似乎相关的代码行:

    /// define a type to identify Null in template functions
    struct AVRO_DECL Null { };

但是，我无法对如何以类似方式使用它做出正面或反面。所以我要么错过了一些东西，要么它有不同的目的。我害怕前者，但怀疑后者。
这是指向 stackoverflow C 问题的链接及其答案，以供完成:
Write nullable item to avro record in Avro C
我正在使用 Avro C++ 库的 1.9.2 版，在 GNU/Linux 机器上运行(这无关紧要，但为了完成)。
我将继续插入并寻求答案，但如果有人以前这样做过并且可以提供一些启示，我将不胜感激。
谢谢!

最佳答案

好吧，一直玩到凌晨和今天一整天，我终于弄明白了。所以我想我会发布我自己问题的答案，以防其他人可能正在搜索相同的信息。虽然我会尽量简明扼要，但如果你不详细，我建议你现在停止阅读。
最后我发现有两种方法可以解决这个问题。两者产生相同的结果，即能够将数据写入 Avro 数据文件中的字段/列，其中该文件已在模式中声明为可选。也就是说，它的类型附加了“空 union ”。我将以与我在原始问题中表达的方法最相关的方法开始我的回答。然后，我将提供一个替代解决方案，并以一两个观察结果结束。请注意，在这两种方法中，JSON 模式与您在我最初的帖子中阅读的内容保持不变。唯一改变的项目是标题和代码体。架构没有改变。有关该内容，请参阅我的初始帖子。
所以第一种方法。与我的第一次尝试一样，这种方法涉及创建自定义编码器和解码器(如我原始帖子中的头文件所示)、一些 JSON 模式(我的模式位于单独的文件中)，然后是主要的代码主体。为了简短起见，问题出在我怀疑的标题中。要解决这个问题，您需要避免自己为超出最基本场景的任何内容编写该标题； Avro C++ 发行版附带的示例中所示的场景。相反，您应该让名为“avrogencpp”的 Avro 工具完成有关创建自定义编码器/解码器的繁重工作。我建议做出该选择的原因仅仅是因为 avrogencpp 在该 header 中生成的代码至少可以说是令人费解的。一旦您阅读并理解了它，内容就很有意义，但是对于最多包含多个字段的记录，长度对于人类来说变得相当笨拙。因此，让机器做他们最擅长的事情。无论如何，这是我使用的命令:

avrogencpp -i recursive.json -o recursive.h -n recursive_namespace

结果是一个头文件，它位于其内部，有一个名为“Root”的结构定义，它与我在未更改架构中定义的顶级或最外层记录的名称相匹配(并非巧合)。因此，我可以在代码主体中编写以下内容:

      avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
      avro::DataFileWriter<recursive_namespace::Root>   dfw( filename, recursiveSchema );
      recursive_namespace::Root  record;
      // snipped for brevity
      record.fstring.set_string( "String set via direct record value assignment" );
      dfw.write( record );
      dfw.close();

这将是成功的，如输出所示:

[rh6lgn01][2174] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
recursiveSchema valid = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = String set via direct record value assignment (length = 45)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2175] MY_EXAMPLES/generate_recursive$

就是这样。现在到第二种方法。这使用了 GenericDatum 类，类似于此 stackoverflow 讨论中显示的问题和答案:
How to read data from AVRO file using C++ interface?
在某种程度上，有人可能会争辩说这种方法的好处在于您不需要自定义编码器/解码器，因此也不需要 avrogencpp 工具。虽然这是真的，但我必须承认想知道在 Avro 中使用通用“接口(interface)”的性能。 '只是看起来它可能比直接路线慢一点。但是，它可以读取任何文件，因此更加灵活。我跑题了。回到解决方案。您需要的唯一代码在主体中。诚然，我将要介绍的内容被剪断到最基本的部分，以展示这种方法。因此，在现实生活中，您需要充实它以包含其他类型等。但是它会传达这个想法，这就是您所需要的。就是这样:

      avro::DataFileWriter<avro::GenericDatum>   writer( filename, schema );
      avro::GenericDatum    datum( schema );

      if( avro::AVRO_RECORD == datum.type() )
      {
         avro::GenericRecord  &record = datum.value<avro::GenericRecord>();
         for( uint32_t i = 0; i < record.fieldCount(); i++ )
         {
            avro::GenericDatum &fieldDatum = record.fieldAt( i );

            // So if the datum is a union, then it's likely that
            // the datum is an optional field.  We'd need to flesh
            // this out considerably to ensure that this was indeed
            // the case, but for brevity reasons, this will work:
            if( true == fieldDatum.isUnion() )
            {
                // Assuming the well-known Avro convention of the null
                // being first in the optional "syntax", then merely
                // jump to the second field which has the "actual type"
                // that the field/column is supposed to represent.
                // Again, this is in dire need of fleshing-out...
                fieldDatum.selectBranch( 1 );
                switch( fieldDatum.type() )
                {
                    case avro::AVRO_STRING:
                    {
                       std::string &newValue = fieldDatum.value<std::string>();
                       newValue = "New string set via switching branches in the union";
                       break;
                    }
                }
            }
            writer.write( datum );
      }
      writer.close();

此变体产生以下内容:

[rh6lgn01][2177] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
Top level was a record
The record had 1 fields.
Field datum was a union = true
Field datum 0 was a union.  Current branch = 0
Field datum 0 is now a string.  Current branch = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = New string set via switching branches in the union (length = 50)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2178] MY_EXAMPLES/generate_recursive$

所以这也是一个令人满意的解决方案。
对我来说，我可能会采用后一种方法，因为它看起来“更干净”。也就是说，我认为更正确的原因是我使用通用“接口(interface)”来读取 Avro 文件，因此再次使用它来写入似乎更加一致。另外我更喜欢第二种方法，因为不需要使用 avrogencpp。 YMMV。
我希望这个答案对将来的某人有所帮助。祝你好运!
杰瑞

关于c++ - 当字段可为空时，如何使用 C++ 接口(interface)在 Avro 中写入数据？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62785894/

c++ - 当字段可为空时，如何使用 C++ 接口(interface)在 Avro 中写入数据？

上一篇：c++ - 为什么ceres covariance.Compute()似乎永远运行而不返回？

下一篇：c++ - 了解循环C++中的循环