我只是从 solr 节点读取记录。我的代码只读取给定日期范围内的记录。我检查过,它适用于 50K 记录,但我尝试了 100k,然后发现超出了 GC 过载限制。
我的代码在scala中是这样的:
def querySolr(core: String, selectQuery: String, server: SolrClient,
pageNum: Int, pageStart: Int, pageSize: Int): (Long, SolrDocumentList) = {
val query = new SolrQuery(core)
query.setQuery(selectQuery)
query.setStart(pageStart)
query.setRows(pageSize)
val response: QueryResponse = server.query(query)
val results: SolrDocumentList = response.getResults
val total = results.getNumFound
(total, results)
}
def pageCalc(page: Int, pageSize: Int, totalItems: Long): (Int, Long, Long) = {
val from = ((page - 1) * pageSize) + 1
val to = totalItems min (from + pageSize - 1)
val totalPages = (totalItems / pageSize) + (if (totalItems % pageSize > 0) 1 else 0)
(from, to, totalPages)
}
def getRecordsFromSolr(core: String, solrhost: String, userName: String, password: String,
query: String): List[SolrDocument] = {
val startTime = System.nanoTime()
val url = "https://" + solrhost + ":8983/solr/" + core
val solrPort = 8983
val builder: SSLContextBuilder = new SSLContextBuilder()
builder.loadTrustMaterial(null, new TrustSelfSignedStrategy())
val sslsf: SSLConnectionSocketFactory = new SSLConnectionSocketFactory(
builder.build(), SSLConnectionSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER
)
val credsProvider: CredentialsProvider = new BasicCredentialsProvider()
credsProvider.setCredentials(
new AuthScope(solrhost, solrPort),
new UsernamePasswordCredentials(userName, password))
val httpclient: CloseableHttpClient =HttpClients.custom().setSSLSocketFactory(sslsf).setDefaultCredentialsProvider(credsProvider).build()
val server: SolrClient = new HttpSolrClient(url, httpclient)
logger.info("solr connection completed")
val pageSize = 1000
var pageNum = 1
var nextPage: (Int, Long, Long) = (0, 1000, 0)
var offset: Long = 0
var totalResult = querySolr(core, query, server, pageNum, 0, pageSize)
var total = totalResult._1
var results: List[SolrDocument] = totalResult._2.toList
while (total > offset) {
offset += pageSize
pageNum += 1
nextPage = pageCalc(pageNum, pageSize, total)
totalResult = querySolr(core, query, server, pageNum, nextPage._1, pageSize)
total = totalResult._1
results = (results ++ totalResult._2.toList)
}
}
java.lang.OutOfMemoryError:超出 GC 开销限制
如何避免内存泄漏。我尝试过每个核心 8GB。并且表包含数百万条记录。
我发现 60K 记录出现以下错误:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 18311053 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.
最佳答案
读取太大的 solr 响应时,通常会出现 OutOfMemoryError 错误。
因此解决方案是最小化 solr 响应:
- 限制行大小
- 限制返回的字段列表(参数fl)。特别是包含大型索引文档(例如 pdf)的字段可能会增长到很大。
如果这没有帮助,我建议分析您的 solr 响应。 尝试找出实际的 solr 查询并在浏览器中执行它。
关于java - 如何解决scala代码中超出GC过载限制的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35377485/