Presto集群中某节点丢失上报心跳状态(服务未宕机),查看该节点日志发现以上gc清理日志.
[510515.508s][warning][gc,alloc] http-worker-159944: Retried waiting for GCLocker too often allocating 11648 words
[510515.508s][warning][gc,alloc] page-buffer-client-callback-8474: Retried waiting for GCLocker too often allocating 19073 words
[510515.508s][warning][gc,alloc] http-worker-153295: Retried waiting for GCLocker too often allocating 14450 words
[510515.508s][warning][gc,alloc] http-worker-160657: Retried waiting for GCLocker too often allocating 15585 words
[510515.508s][warning][gc,alloc] http-worker-159719: Retried waiting for GCLocker too often allocating 20110 words
[510515.508s][warning][gc,alloc] http-worker-161471: Retried waiting for GCLocker too often allocating 13867 words
[510515.508s][warning][gc,alloc] http-worker-159190: Retried waiting for GCLocker too often allocating 7651 words
[510561.503s][warning][gc,alloc] task-processor-82635: Retried waiting for GCLocker too often allocating 18700 words
[510561.503s][warning][gc,alloc] task-processor-81526: Retried waiting for GCLocker too often allocating 18910 words
[510561.504s][warning][gc,alloc] task-processor-81315: Retried waiting for GCLocker too often allocating 15505 words
[510561.504s][warning][gc,alloc] task-processor-81319: Retried waiting for GCLocker too often allocating 15505 words
通过日志和最近的查询历史来看有大对象分配逻辑导致清理gc变慢,gc1的大文件清理和分配逻辑如下详细文件GitHub:
for (uint try_count = 1, gclocker_retry_count = 0; /* we'll return */; try_count += 1) {
bool should_try_gc;
uint gc_count_before;
{
MutexLockerEx x(Heap_lock);
result = humongous_obj_allocate(word_size);
if (result != NULL) {
size_t size_in_regions = humongous_obj_size_in_regions(word_size);
g1_policy()->add_bytes_allocated_in_old_since_last_gc(size_in_regions * HeapRegion::GrainBytes);
return result;
}
should_try_gc = !GCLocker::needs_gc();
gc_count_before = total_collections();
}
if (should_try_gc) {
bool succeeded;
result = do_collection_pause(word_size, gc_count_before, &succeeded,
GCCause::_g1_humongous_allocation);
if (result != NULL) {
assert(succeeded, "only way to get back a non-NULL result");
log_trace(gc, alloc)("%s: Successfully scheduled collection returning " PTR_FORMAT,
Thread::current()->name(), p2i(result));
return result;
}
if (succeeded) {
log_trace(gc, alloc)("%s: Successfully scheduled collection failing to allocate "
SIZE_FORMAT " words", Thread::current()->name(), word_size);
return NULL;
}
log_trace(gc, alloc)("%s: Unsuccessfully scheduled collection allocating " SIZE_FORMAT "",
Thread::current()->name(), word_size);
} else {
// Failed to schedule a collection.
if (gclocker_retry_count > GCLockerRetryAllocationCount) {
log_warning(gc, alloc)("%s: Retried waiting for GCLocker too often allocating "
SIZE_FORMAT " words", Thread::current()->name(), word_size);
return NULL;
}
GCLocker::stall_until_clear();
gclocker_retry_count += 1;
}
}
代码if (gclocker_retry_count > GCLockerRetryAllocationCount)
标记了gc清理异常信息
需要注意的是GCLockerRetryAllocationCount的默认值是2,表示当分配中的垃圾回收次数超过这个阈值之后则直接失败。
为了保证gc的正常清理,需要调整此参数,根据cpu的核数调整至100(或者cpu的核数等值)
调整方式: 修改jvm增加以下配置
-XX:+UnlockDiagnosticVMOptions
-XX:GCLockerRetryAllocationCount=100
-XX:+UnlockDiagnosticVMOptions
一定要添加否则GCLockerRetryAllocationCount
将不会生效