我正在尝试使用 Metal 创建程序游戏,并且我正在使用基于八叉树的 block 方法来实现细节级别。
我使用的方法涉及 CPU 为地形创建八叉树节点,然后使用计算着色器在 GPU 上创建其网格。该网格存储在用于渲染的 block 对象中的顶点缓冲区和索引缓冲区中。
所有这些似乎都运行良好,但是在渲染 block 时我很早就遇到了性能问题。目前我收集了一组要绘制的 block ,然后将其提交给我的渲染器,渲染器将创建一个 MTLParallelRenderCommandEncoder
,然后为每个 block 创建一个 MTLRenderCommandEncoder
,然后将其提交给GPU。
从外观上看,大约 50% 的 CPU 时间用于为每个 block 创建 MTLRenderCommandEncoder
。目前我只是为每个 block 创建一个简单的 8 顶点立方体网格,我有一个 4x4x4 block 阵列,在这些早期阶段我下降到大约 50fps。 (实际上,每个 MTLParallelRenderCommandEncoder
中似乎最多只能有 63 个 MTLRenderCommandEncoder
,所以它不是完全 4x4x4)
我读到过 MTLParallelRenderCommandEncoder
的要点是在单独的线程中创建每个 MTLRenderCommandEncoder
,但我并没有很幸运地让它工作.同样是多线程,它不会绕过被渲染为最大值的 63 个 block 的上限。
我觉得以某种方式将每个 block 的顶点和索引缓冲区合并为一个或两个更大的缓冲区以供提交会有所帮助,但我不确定如何在没有大量 memcpy()
调用的情况下做到这一点以及这是否会提高效率。
这是我的代码,它接受节点数组并绘制它们:
func drawNodes(nodes: [OctreeNode], inView view: AHMetalView){
// For control of several rotating buffers
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
makeDepthTexture()
updateUniformsForView(view, duration: view.frameDuration)
let commandBuffer = commandQueue.commandBuffer()
let optDrawable = layer.nextDrawable()
guard let drawable = optDrawable else{
return
}
let passDescriptor = MTLRenderPassDescriptor()
passDescriptor.colorAttachments[0].texture = drawable.texture
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
let parallelRenderPass = commandBuffer.parallelRenderCommandEncoderWithDescriptor(passDescriptor)
// Currently 63 nodes as a maximum
for node in nodes{
// This line is taking up around 50% of the CPU time
let renderPass = parallelRenderPass.renderCommandEncoder()
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
renderPass.setFrontFacingWinding(.CounterClockwise)
renderPass.setCullMode(.Back)
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.setTriangleFillMode(.Lines)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
renderPass.endEncoding()
}
parallelRenderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.addCompletedHandler { (commandBuffer) -> Void in
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
commandBuffer.commit()
}
最佳答案
你注意到:
I've read that the point of the
MTLParallelRenderCommandEncoder
is to create eachMTLRenderCommandEncoder
in a separate thread...
你是对的。您正在做的是按顺序创建、编码和结束命令编码器 — 这里没有任何并行操作,因此 MTLParallelRenderCommandEncoder
没有为您做任何事情。如果您消除并行编码器并在每次通过 for 循环时使用 renderCommandEncoderWithDescriptor(_:)
创建编码器,您将获得大致相同的性能……也就是说,您仍然由于创建所有这些编码器的开销,存在相同的性能问题。
因此,如果您要按顺序编码,只需重复使用相同的编码器即可。此外,您应该尽可能多地重用其他共享状态。这是对可能的重构(未经测试)的快速介绍:
let passDescriptor = MTLRenderPassDescriptor()
// call this once before your render loop
func setup() {
makeDepthTexture()
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
// set up render pipeline state and depthStencil state
}
func drawNodes(nodes: [OctreeNode], inView view: AHMetalView) {
updateUniformsForView(view, duration: view.frameDuration)
// Set up completed handler ahead of time
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler { _ in // unused parameter
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
// Semaphore should be tied to drawable acquisition
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
guard let drawable = layer.nextDrawable()
else { return }
// Set up the one part of the pass descriptor that changes per-frame
passDescriptor.colorAttachments[0].texture = drawable.texture
// Get one render pass descriptor and reuse it
let renderPass = commandBuffer.renderCommandEncoderWithDescriptor(passDescriptor)
renderPass.setTriangleFillMode(.Lines)
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
for node in nodes {
// Update offsets and draw
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
}
renderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.commit()
}
然后,使用 Instruments 进行分析,以查看您可能遇到的进一步性能问题(如果有的话)。有一个很棒的WWDC 2015 session关于显示几个常见“陷阱”、如何在分析中诊断它们以及如何修复它们的内容。
关于swift - Metal block 渲染,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34047447/