neo4j - Cypher 加载 CSV 急切且 Action 持续时间长

我正在加载一个 85K 行的文件 - 19M，
服务器有 2 个内核，14GB RAM，运行 centos 7.1 和 oracle JDK 8
使用以下服务器配置可能需要 5-10 分钟 :

dbms.pagecache.memory=8g                  
cypher_parser_version=2.0  
wrapper.java.initmemory=4096  
wrapper.java.maxmemory=4096

挂载在/etc/fstab 中的磁盘:

UUID=fc21456b-afab-4ff0-9ead-fdb31c14151a /mnt/neodata            
ext4    defaults,noatime,barrier=0      1  2

将此添加到/etc/security/limits.conf:

*                soft      memlock         unlimited
*                hard      memlock         unlimited
*                soft      nofile          40000
*                hard      nofile          40000

将此添加到/etc/pam.d/su

session         required        pam_limits.so

将此添加到/etc/sysctl.conf:

vm.dirty_background_ratio = 50
vm.dirty_ratio = 80

通过运行禁用日志:

 sudo e2fsck /dev/sdc1
 sudo tune2fs /dev/sdc1
 sudo tune2fs -o journal_data_writeback /dev/sdc1
 sudo tune2fs -O ^has_journal /dev/sdc1
 sudo e2fsck -f /dev/sdc1
 sudo dumpe2fs /dev/sdc1

除此之外，
运行分析器时，我得到了很多“渴望”，我真的不明白为什么:

 PROFILE LOAD CSV WITH HEADERS FROM 'file:///home/csv10.csv' AS line
 FIELDTERMINATOR '|'
 WITH line limit 0
 MERGE (session :Session { wz_session:line.wz_session })
 MERGE (page :Page { page_key:line.domain+line.page }) 
   ON CREATE SET page.name=line.page, page.domain=line.domain, 
 page.protocol=line.protocol,page.file=line.file


Compiler CYPHER 2.3

Planner RULE

Runtime INTERPRETED

+---------------+------+---------+---------------------+--------------------------------------------------------+
| Operator      | Rows | DB Hits | Identifiers         | Other                                                  |
+---------------+------+---------+---------------------+--------------------------------------------------------+
| +EmptyResult  |    0 |       0 |                     |                                                        |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph  |    9 |       9 | line, page, session | MergeNode; Add(line.domain,line.page); :Page(page_key) |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +Eager        |    9 |       0 | line, session       |                                                        |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph  |    9 |       9 | line, session       | MergeNode; line.wz_session; :Session(wz_session)       |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +ColumnFilter |    9 |       0 | line                | keep columns line                                      |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +Filter       |    9 |       0 | anon[181], line     | anon[181]                                              |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +Extract      |    9 |       0 | anon[181], line     | anon[181]                                              |
| |             +------+---------+---------------------+--------------------------------------------------------+
| +LoadCSV      |    9 |       0 | line                |                                                        |
+---------------+------+---------+---------------------+--------------------------------------------------------+

所有标签和属性都有索引/约束
谢谢您的帮助
利奥尔

最佳答案

贺利欧，

我们试图在这里解释 Eager Loading:

Marks 的原始博客文章在这里:http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

Rik 试图用更简单的术语来解释它:

http://blog.bruggen.com/2015/07/loading-belgian-corporate-registry-into_20.html

试图理解“Eager Operation”

我之前读过这个，但直到安德烈斯再次向我解释后才真正理解它:在所有正常操作中，Cypher 延迟加载数据。例如，请参阅手册中的此页面 - 它基本上只是在执行操作时尽可能少地加载到内存中。这种懒惰通常是一件非常好的事情。但它也会给你带来很多麻烦——正如迈克尔向我解释的那样:

"Cypher tries to honor the contract that the different operations within a statement are not affecting each other. Otherwise you might up with non-deterministic behavior or endless loops. Imagine a statement like this:
MATCH (n:Foo) WHERE n.value > 100 CREATE (m:Foo {m.value = n.value + 100});

If the two statements would not be isolated, then each node the CREATE generates would cause the MATCH to match again etc. an endless loop. That's why in such cases, Cypher eagerly runs all MATCH statements to exhaustion so that all the intermediate results are accumulated and kept (in memory).

Usually with most operations that's not an issue as we mostly match only a few hundred thousand elements max.

With data imports using LOAD CSV, however, this operation will pull in ALL the rows of the CSV (which might be millions), execute all operations eagerly (which might be millions of creates/merges/matches) and also keeps the intermediate results in memory to feed the next operations in line.

This also disables PERIODIC COMMIT effectively because when we get to the end of the statement execution all create operations will already have happened and the gigantic tx-state has accumulated."

这就是我的负载 csv 查询的情况。 MATCH/MERGE/CREATE 导致将一个急切的管道添加到执行计划中，并且它有效地禁用了我的操作批处理“使用定期提交”。显然，即使使用看似简单的 LOAD CSV 语句，也有不少用户遇到了这个问题。很多时候你可以避免它，但有时你不能。”

关于neo4j - Cypher 加载 CSV 急切且 Action 持续时间长，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31788513/

neo4j - Cypher 加载 CSV 急切且 Action 持续时间长

上一篇：csv - 在 Spark 中高效聚合多个 CSV

下一篇：angularjs - 使用 angularjs 的地理编码服务