java - 为什么 Apache Spark 在客户端执行过滤器

作为 apache spark 的新手，在 Spark 上获取 Cassandra 数据时遇到了一些问题。

List<String> dates = Arrays.asList("2015-01-21","2015-01-22");
CassandraJavaRDD<A> aRDD = CassandraJavaUtil.javaFunctions(sc).
                    cassandraTable("testing", "cf_text",CassandraJavaUtil.mapRowTo(A.class, colMap)).
                    where("Id=? and date IN ?","Open",dates);

此查询未过滤 cassandra 服务器上的数据。当这个 java 语句正在执行时，它会占用内存并最终抛出 spark java.lang.OutOfMemoryError 异常。查询应该过滤掉 cassandra 服务器上的数据，而不是客户端上提到的 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md .

虽然我在 cassandra cqlsh 上使用过滤器执行查询，但它的性能很好，但在没有过滤器(where 子句)的情况下执行查询会超时，这是预期的。所以很明显，spark 没有在客户端应用过滤器。

SparkConf conf = new SparkConf();
            conf.setAppName("Test");
            conf.setMaster("local[8]");
            conf.set("spark.cassandra.connection.host", "192.168.1.15")

为什么在客户端应用过滤器，以及如何改进在服务器端应用过滤器。

我们如何在windows平台的cassandra集群之上配置spark集群？？

最佳答案

没有将 Cassandra 与 Spark 一起使用，通过阅读您提供的部分(感谢)，我看到:

Note: Although the ALLOW FILTERING clause is implicitly added to the generated CQL query, not all predicates are currently allowed by the Cassandra engine. This limitation is going to be addressed in the future Cassandra releases. Currently, ALLOW FILTERING works well with columns indexed by secondary indexes or clustering columns.

我很确定(但尚未测试)不支持“IN”谓词:请参阅 https://github.com/datastax/spark-cassandra-connector/blob/24fbe6a10e083ddc3f770d1f52c07dfefeb7f59a/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/japi/rdd/CassandraJavaRDD.java#L80

因此，您可以尝试将 where-clause 限制为 Id(假设有一个二级索引)并对日期范围使用 spark 过滤。

关于java - 为什么 Apache Spark 在客户端执行过滤器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31141998/

java - 为什么 Apache Spark 在客户端执行过滤器

上一篇：java - 当应用程序在后台时停止服务

下一篇：java - 无法使用具有构建器模式的 jackson 3+ 实例化 POJO