marklogic - mlcp 不会加载目录中的大量文件

请参阅下面的编辑

我们使用 MarkLogic Content Pump 将数据加载到 ML8 数据库中。我们有一个一切正常的开发环境和一个 mlcp 无法通过要处理的文件数量评估的产品。

我们要加载 210 万个 JSON 文档。

在开发服务器(ML8 + CentOS6)上我们看到:

15/07/13 13:19:35 INFO contentpump.ContentPump: Hadoop library version: 2.0.0-alpha
15/07/13 13:19:35 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/13 13:19:35 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
15/07/13 13:23:06 INFO input.FileInputFormat: Total input paths to process : 2147329
15/07/13 13:24:08 INFO contentpump.LocalJobRunner:  completed 0%
15/07/13 13:34:43 INFO contentpump.LocalJobRunner:  completed 1%
15/07/13 13:43:42 INFO contentpump.LocalJobRunner:  completed 2%
15/07/13 13:51:15 INFO contentpump.LocalJobRunner:  completed 3%

完成正常，数据加载正常。

现在我们在不同的机器上使用相同的数据，即我们得到的产品服务器(ML8 + CentOS7)

15/07/14 17:02:21 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/14 17:02:21 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.

除了不同的操作系统之外，我们在 de prod 服务器 2.6.0(而不是 2.0.0)上还有更新版本的 mlcp。如果我们使用相同的命令导入只有 2000 个文件的目录，它可以在产品上运行...

在计算要处理的文件数量时，作业陷入困境...

可能是什么问题？

开始编辑我们将 mlcp 置于 DEBUG 中并使用一个小 samle.zip 进行测试

结果:

[ashraf@77-72-150-125 ~]$ mlcp.sh import -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true  -mode local -output_uri_replace  "\".*,''\"" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Command: IMPORT
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Arguments: -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true -mode local -output_uri_replace ".*,''" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read 
15/07/16 16:36:31 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Running in: localmode
15/07/16 16:36:31 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/16 16:36:32 DEBUG contentpump.LocalJobRunner: Thread pool size: 4
15/07/16 16:36:32 INFO input.FileInputFormat: Total input paths to process : 1
15/07/16 16:36:33 DEBUG contentpump.LocalJobRunner: Thread Count for Split#0 : 4
15/07/16 16:36:33 DEBUG contentpump.CompressedDocumentReader: Starting file:/home/ashraf/sample2.zip
15/07/16 16:36:33 DEBUG contentpump.MultithreadedMapper: Running with 4 threads
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:34 INFO contentpump.LocalJobRunner:  completed 0%
15/07/16 16:36:39 INFO contentpump.LocalJobRunner:  completed 100%
2015-07-16 16:39:11.483 WARNING [19] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
15/07/16 16:39:12 DEBUG contentpump.CompressedDocumentReader: Closing file:/home/ashraf/sample2.zip
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: com.marklogic.contentpump.ContentPumpStats: 
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: ATTEMPTED_INPUT_RECORD_COUNT: 1993
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: SKIPPED_INPUT_RECORD_COUNT: 0
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: Total execution time: 160 sec

只有第一个json文件在数据库中，其余的都被删除/丢失了？

JSON 文件中的换行符是否存在问题？

(AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''

任何提示都会很棒。

雨果

最佳答案

我真的不知道发生了什么。我认为支持人员会对这种情况感兴趣。您能给他们或我发送一封包含更多详细信息(也许还包括文件)的邮件吗？

作为一种解决方法:在产品服务器上使用与在开发中使用的相同的MLCP版本应该不难，只需将其放在另一个版本旁边(或您喜欢的任何地方)即可，并确保您引用了该设置(提示:在 Roxy 中您有 mlcp-home 设置)。

您还可以考虑压缩 json 文档并使用 -input_compressed 选项。

呵呵!

关于marklogic - mlcp 不会加载目录中的大量文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31433976/

marklogic - mlcp 不会加载目录中的大量文件

上一篇：php - Laravel 5 - 跳过迁移

下一篇：angularjs - 如何用 sinon 模拟 Angular 的 $http ？