-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choosing a binding model #19
Comments
@Kangz thank you for the analysis! I agree that basing on the Vulkan API is most reasonable for the binding model. I'm not convinced that we should actually be diverging from it.
According to SO, D3D12 can use only one heap per class (samplers vs non-samplers) at any given time, and switching the heaps within a command list is not recommended. And yes, Vulkan doesn't have a similar restriction to its descriptor pools, since the binding API operates with descriptor sets (
That sounds a bit complicated. |
I confirmed with our D3D dev that this statement is incorrect. In D3D12, you can have one descriptor heap bound to a command list at a time. You can switch as many times as you like, though you may get a GPU wait-for-idle on some implementations.
D3D12 devs tell me they decided it wasn't worth forcing drivers to pack parameters in order they are specified to support inheritance between signatures. The frequency that applications change signatures is relatively low. Even then, the cost of just rebinding all root parameters on a signature isn't huge. @kvark 's approach of having two total heaps would work if the max non-sampler heap size of 1 million (for Tier3 hardware, the driver can support larger but only 1 million is guaranteed) is enough to hold all descriptor sets ever made. A more flexible approach that was recommended was to expose pools / heaps that must be bound. For D3D12, you would allocate a descriptor heap+sampler heap per NXT pool/heap. Then the application is limited to only binding sets from the currently bound pool/heap. With this, an app could choose to have a pool per command recording thread as well as double/triple/etc. buffer across frames or use multiple pools per command list if desired. This could be handy for componentized software. |
I think it's not about the number of sets "ever made" as it is about the sets allocated at any particular point. I'm curious if this limitation is close to what real applications do, or planning to do in the future. Edit: 1M is the max number of descriptors (for tier-3 HW), not descriptor sets.
That appears to be a mix of
The approach maps rather seamlessly to both Vk and D3D12 and accommodates a large part of use cases, except these:
I like the idea, however I'm concerned that the latter restriction would hit Vulkan users a bit too hard. Another approach (proposed by @msiglreith) is to have both heaps and pools. The user would create a heap, then create a pool from the heap space. Heaps would have to be bound during command list construction, like they are in D3D12, whereas pools would function as they do in Vk. This is a little bizzare and more complex but it seems to capture the limitations of both APIs a bit more flexibly: we'd have the heaps backed by nothing in Vk backend, and pools backed by shims in D3D12 backend, with more original use-cases seamlessly translating to the new API. I don't have a specific preference at this point, just providing the approach for our consideration. I'd love to just have a single heap for samplers and a single heap for non-samplers, hidden as implementation detail, but this essentially comes down to the chances of hitting the 1M descriptor sets limitation. |
I believe this is a limitation of what the hardware is capable of. Some hardware can do more, but 1 million is the minimum guaranteed.
Do you have evidence Vulkan users are actively take advantage of this flexibility? What do those developers do in their D3D12 backend, if they have one?
Can you please go into more detail what you mean by "pools backed by shims"? If pools in Vulkan are analogous to heaps in D3D12, it doesn't make sense to me to expose both of them. I must be missing something.
If would be nice if developers didn't have to worry about this implementation detail. We could certainly start with this approach and revisit if people hit the limits in preview builds. However, I worry about the complexity of the hiding when I read things like "we'll solve this problem by building a compacting GC". The design we come up with also needs to work well for applications that are made up of individual components that are not aware of the resources/descriptors allocated by other components. Think animation library or particle system. |
@RafaelCintron
I don't have hard evidence, and I'd love to hear what seasoned Vulkan/D3D12 devs have to say about this limitation. What I have in mind is the aforementioned "individual components" (animation library, texture manager, particle system, etc) having their own descriptor pools (and creating resources on separate threads, possibly), but then mixing those resources in draw calls (e.g. an animated mesh with a texture). That scenario doesn't seem to be allowed by the approach you described, if I understand it correctly.
They are not fully analogous, otherwise we'd have a portability solution on our hands :) With the nested approach I described (where pools are produced from and owned by heaps), a d3d12 backend would treat a descriptor pool type as just a slice view into a descriptor heap (what I called a "shim"). |
Thanks @RafaelCintron for the clarifications on D3D12's descriptor heap and lack of "inheritance", I added the descriptor heap info to the first comment. I agree with most of the points above, and making modular applications work well will be very important. To make things equally efficient on both APIs the two solutions we have for now are:
My preference goes to having the details hidden because I believe it could be made efficient and enables modular apps. That's the approach we are taking on NXT and we'll be able to measure there. On D3D12, for a command buffer, we plan to use descriptor heaps as some sort of LRU cache and rotate them (if needed) in the command list when they are full. For Vulkan, problems would come from descriptor pool fragmentation, so I'm thinking of only using linear allocation, and compacting pools once in a while. Up to now we didn't really look at Metal because it has an old-style binding model, but the docs for Metal 2 just came online and they have a concept similar to descriptor sets / descriptor tables. The gist of it is that they are backed by regular buffers (called indirect argument buffers), and the restriction is that you have to encode in the command-list which resources or heap need to be resident. I'll update the first comment with info about it. It looks like the pool-in-heap solution doesn't provide the API with useful info for a Metal 2 backend, which is another advantage of the hiding the details. |
Are you completely discarding the approach of only exposing descriptor pools but not heaps? It's also efficient on both APIs, it just has the descriptor limit, the importance of which are yet to be clarified. This approach is especially appealing to me if we can find a room to extend it beyond the descriptor count limit in the future, so that the MVP (minimally viable product) we ship is kept simple.
I think we can still make this work with the pool-in-heap model (and pools-only, for the matter). A bit of logic just needs to be implemented for the Metal backend without API changes:
So essentially we can have "Descriptor pool := MTLBuffer", and "Descriptor set := slice of indirect MTLBuffer". The problem with the approach I described is that Apple doesn't expose the layout of the indirect buffer, so in the worst case we'd have to rewrite a buffer for different shaders even if no data has been changed (*). I'd love to hear Apple folks on how they expect users to share indirect buffers between shaders. |
I did discard it without explanation, sorry. The problem with this is that if you choose a descriptor limit and use a single backing heap in D3D12 for all descriptor pools, then you still risk fragmentation, and are going to need some compaction to fit stuff inside that single heap. The only API it would help is Vulkan, and this is the API where compaction looks the cheapest: just take several half-filled pools and copy to a new one that's just the right size. Also enforcing this limit on applications isn't good for modular apps etc.
This is minor: what you pass in the commandEncoder::useHeaps method are resource heaps not descriptor heaps. The first comment wasn't clear about this so I updated it. I hadn't realized you could bind Metal 2 IAB's at an offset, the same GC happening on Vulkan would be cheap to do there too, and if the 500.000 descriptor limit is per-process it becomes even more important to hide this limit as multiple pages could be competing. (it could be per MTLDevice though) |
To give Apple's opinion, as @Kangz noted, Metal 2 has moved slightly in the direction of D3D12/Vulkan with Argument Buffers So, I'm ok with going in the direction @Kangz recommends. I also prefer the suggestion to hide the implementation details and have the API create the descriptor sets. |
Here is an update from our experiments. First of all, we successfully implemented the idea I presented above, which results in a clean Vulkan descriptor sets/pools to Metal mapping. We came up with the following limitations to make this work:
I don't consider these a problem given that we are going to generate the Metal shaders ourselves. Mapping to D3D12 efficiently is still a big question. We currently have dual scheme of both heaps and pools (where pools are allocated from the heaps), but looking into reducing that API surface. @RafaelCintron are the limitations of |
@kvark , which |
@RafaelCintron oh, sorry about being unclear. |
@kvark , D3D devs tell me this is due to hardware limitations. On at least on IHV, you can only bind one descriptor heap at a time with size maximum of 1 million. If an API wants to pretend that multiple can be bound at a time, that means that this IHV would have to potentially relocate descriptors from multiple application heaps into the one hardware heap at draw-time in the worst case (e.g. they didn’t get lucky and the app’s heaps happened to all fit into one hardware heap). An app can certainly make a large descriptor heap in DX12 (a million entries is the largest size guaranteed to work assuming memory is available), and then dedicate subregions of that heap to individual “pools” that the application separately suballocates out of. The problem is, say you have pool A, B and C each 500k descriptors. Only 2 will fit in a max 1 million entry heap. If you use A and B together, works fine. Then if you want to use B and C together you have to switch to another heap and likely copy all of B from the AB heap to a new BC heap. Assuming you use much smaller “pools” relative to descriptor heap size, and most pools fit in one heap, the number of times you’d have to switch heaps and shuffle descriptors around may not be huge. But D3D12 lets you see the cost of doing so. |
This has been resolved for a while. |
Actually, leaving this open since it's an investigation we might want to refer back to. |
The ship has sailed on this. It's already part of the spec. It can still be referred to, even if it's closed. |
Where did we/the-ship arrive? |
To the Vulkan binding model (without push constants for now) |
This is the detailed investigation I promised to put on Github last meeting. It explains at a high-level the binding models of all 3 explicit APIs, and exposes why we believe Vulkan's binding model should be used as a starting point.
See also these slides which contain more info about bindless vs. fixed-function and the Vulkan binding model, and our initial investigation for NXT (with some outdated code samples).
The binding models of explicit APIs
Metal’s binding model
Metal’s binding model is basically the same as OpenGL’s and D3D11: there are per-stage global arrays of textures, buffers and samplers. Binding of shader variables to resources is done by tagging arguments to the main function with an index in the tables.
There are functions on the encoders (aka command buffers) to update ranges of the tables, with one update function for each stage, for each table type (and sometimes more than one, mostly for convenience). The update to the tables is immediate and the driver handles any synchronization that might be needed transparently. For example:
D3D12’s binding model
D3D12’s binding model is more geared towards “bindless” GPUs in which resources aren’t set in registers in fixed-function hardware and referenced by their index, but instead described by a descriptor living in GPU memory and referenced by their descriptor’s GPU virtual address.
Things look as if the shaders had only access to one global UBO by default, called “root signature” that is an array of elements that can be one of three things:
A shamelessly re-used slide.
Individual elements root signature can be updated directly in command lists with updates appearing immediately to subsequent draws / dispatches. For example:
Using a root signature in shaders is done like below. The root signature layout is described and gives each “binding” a register index, then these register indices are bound to variable names and finally, the “main” function is tagged with the root signature layout. The extra level of indirection using register indices seems to be to help with porting from D3D11 shaders.
On the API side, a corresponding ID3D12RootSignature object must be created that represents the layout of the root signature. This is needed because the actual layout on GPU might not match what was declared, to allow the driver to optimize things or do some emulation on hardware that isn’t bindless enough.
When compiling an ID3D12PipelineState, the root signature must be provided so that the compiled shader code can be specialized for the actual layout of the root signature on the GPU. Then in ID3D12GraphicsCommandList::SetGraphicsRootSignature must(?) be called before any update to the root table or call to ID3D12GraphicsCommandList::SetPipelineState, so that the command list too knows what the actual layout on the GPU is.
Having the same root signature used to compile multiple pipeline makes sure the driver doesn’t have to shuffle data around when pipelines are changed, since they are guaranteed to access descriptors by following the same layout.
More info on root signatures can be found in the MSDN docs.
One thing we didn’t mention yet is how to populate a descriptor heap with data. MSDN docs aren’t super-clear but my understanding is that there are two types of descriptor heaps:
ID3D12Device::CopyDescriptors can be used or descriptors created directly in the heaps. Then to be able to use a heap with, for example SetGraphicsRootDescriptorTable, the heap must be bound to the current command list with ID3D12GraphicsCommandList::SetDescriptorHeaps. Limitations are that only one sampler heap and one SRV/UAV/CBV descriptor heap can be bound at a time. Also heaps cannot be arbitrary large on some hardware, with samplers being at max 2048 descriptors and SRV/UAV/CBV heaps a million descriptors. Switching heaps can cause a wait-for-idle on some hardware.
One restriction worth pointing out is that descriptor heaps cannot be created to contain both samplers and other descriptors at the same time.
Vulkan’s binding model
Vulkan’s binding model is essentially a subset of D3D12’s and more opaque to the application. This is because Vulkan needs to run on mobile GPU that are more fixed-function than what D3D12 targets.
Another shameless slide reuse:
This is similar to D3D12’s root signature except that:
Binding shader variables to specific locations in a descriptor set is done like shown below (also shows push constants) (from this slide):
On the API side, like in D3D12, a layout object needs to be created and used to during pipeline creation (vkCreateGraphicsPipelines). This object is VkPipelineLayout (vkCreatePipelineLayout) and is made of multiple VkDescriptorSetLayouts (vkCreateDescriptorSetLayout). Then descriptor sets are used in a command buffer (without need for synchronization) via vkCmdBindDescriptorSets and push constants are set with vkCmdPushConstants.
Descriptors are created from a VkDescriptorPool that is similar to D3D12 descriptor heaps, except that the VkDescriptorSets returned are opaque and can only be written to or copied via specialized API functions (not general GPU copy like D3D12(?)).
More info about this in the specification’s section on descriptor sets.
Metal 2
Metal 2 adds the Indirect Argument Buffer concept, which are opaque-layout buffers containing constants and descriptors that can be bound all at once. It is essentially a descriptor set.
In addition to allocating, populating and using IABs, applications must specify on the encoders which resources or resource heaps need to be resident for the draw. Also it looks like Metal allocates all descriptor in a single descriptor heap, and has the same type of limitation as D3D12, but for the whole app. (500.000 descriptors max, 2048 samplers max).
Why use Vulkan’s binding model
During our NXT investigation we found that the Vulkan binding model would be the best one to use, as converting from Vulkan to D3D12 or Metal doesn’t add much overhead while converting to D3D12 or Metal to Vulkan would be expensive. (and we are not smart enough to design a whole new binding model) Also for reasons, NXT uses the term “bind groups” instead of “descriptor sets”.
Vulkan to Metal
When the pipeline layout is compiled we can do “register allocation” of the descriptors in their respective tables. Then the shaders can be changed to refer to the position in the table instead of the (set, location) as in Vulkan. Finally when a descriptor set is bound in the command buffers, we just need to call the necessary MTLGraphicsCommandEncoder::set<Textures/Buffers/Samplers> at most once each. The arguments can have been precomputed at descriptor set creation).
Vulkan to D3D12
Essentially we would only use special case root signatures that look like the Vulkan binding model presented above. Then all operations on descriptors sets and push constants decay naturally to their D3D12 counterparts.
One thing is that it looks like D3D12 can only use a limited number of descriptor heaps in a command list while (AFAIK) Vulkan doesn’t have this limitation. In NXT we plan to solve this mismatch by not exposing descriptor pools and doing our own descriptor set bookkeeping (even in the Vulkan backend): with some sort of compacting GC for descriptor sets we should be able to keep the number of heaps small.
Another issue is that D3D12 has separate samplers heap, which can solved by having potentially two root descriptor tables per descriptor set that has samplers. The hope is that in most applications samplers aren’t present in many places in a pipeline layout.
Metal to Vulkan
In Metal, each a pipeline essentially has its own pipeline layout / root signature layout, so in Vulkan new descriptor sets would have to be created at each pipeline changes. Things could work a bit better on D3D12, as the D3D11 on D3D12 layer shows (Metal and D3D11’s binding models are the same). However this would prevent a large class of optimizations applications can do by reusing parts of the root table or descriptor sets.
D3D12 to Vulkan
There are many small mismatches because D3D12 is more flexible and explicit than Vulkan:
Inheritance
Another advantage of using Vulkan is that it has very good inheritance semantics for descriptor sets: if you set a pipeline with a different pipeline layout, but with descriptor set layouts matching up to position N, then you don’t need to bind these descriptor sets again. This could translate to not having to reset the whole tables in Metal, and only changing a couple root table descriptors in D3D12.
The text was updated successfully, but these errors were encountered: