* Add short duration texture cache
This texture cache takes textures that lose their last pool reference and keeps them alive until the next frame, or until an incompatible overlap removes it. This is done since under certain circumstances, a texture's reference can be wiped from a pool despite it still being in use - though typically the reference will return when rendering the next frame.
While this may slightly increase texture memory usage when quickly going through a bunch of temporary textures, it's still bounded due to the overlap removal rule.
This greatly increases performance in Hyrule Warriors: Age of Calamity. It may positively affect some UE4 games which dip framerate severely under certain circumstances.
* Small optimization
* Don't forget this.
* Add short cache dictionary
* Address feedback
* Address some feedback
* Add MVK basics.
* Use appropriate output attribute types
* 4kb vertex alignment, bunch of fixes
* Add reduced shader precision mode for mvk.
* Disable ASTC on MVK for now
* Only request robustnes2 when it is available.
* It's just the one feature actually
* Add triangle fan conversion
* Allow NullDescriptor on MVK for some reason.
* Force safe blit on MoltenVK
* Use ASTC only when formats are all available.
* Disable multilevel 3d texture views
* Filter duplicate render targets (on backend)
* Add Automatic MoltenVK Configuration
* Do not create color attachment views with formats that are not RT compatible
* Make sure that the host format matches the vertex shader input types for invalid/unknown guest formats
* FIx rebase for Vertex Attrib State
* Fix 4b alignment for vertex
* Use asynchronous queue submits for MVK
* Ensure color clear shader has correct output type
* Update MoltenVK config
* Always use MoltenVK workarounds on MacOS
* Make MVK supersede all vendors
* Fix rebase
* Various fixes on rebase
* Get portability flags from extension
* Fix some minor rebasing issues
* Style change
* Use LibraryImport for MVKConfiguration
* Rename MoltenVK vendor to Apple
Intel and AMD GPUs on moltenvk report with the those vendors - only apple silicon reports with vendor 0x106B.
* Fix features2 rebase conflict
* Rename fragment output type
* Add missing check for fragment output types
Might have caused the crash in MK8
* Only do fragment output specialization on MoltenVK
* Avoid copy when passing capabilities
* Self feedback
* Address feedback
Co-authored-by: gdk <gab.dark.100@gmail.com>
Co-authored-by: nastys <nastys@users.noreply.github.com>
* Change AggregateType to include vector type counts
* Replace VariableType uses with AggregateType and delete VariableType
* Support new local vector types on SPIR-V and GLSL
* Start using vector outputs for texture operations
* Use vectors on more texture operations
* Use vector output for ImageLoad operations
* Replace all uses of single destination texture constructors with multi destination ones
* Update textureGatherOffsets replacement to split vector operations
* Shader cache version bump
Co-authored-by: Ac_K <Acoustik666@gmail.com>
* Vulkan: Don't flush commands when creating most sync
When the WaitForIdle method is called, we create sync as some internal GPU method may read back written buffer data. Some games randomly intersperse compute dispatch into their render passes, which result in this happening an unbounded number of times depending on how many times they run compute.
Creating sync in Vulkan is expensive, as we need to flush the current command buffer so that it can be waited on. We have a limited number of active command buffers due to how we track resource usage, so submitting too many command buffers will force us to wait for them to return to the pool.
This PR allows less "important" sync (things which are less likely to be waited on) to wait on a command buffer's result without submitting it, instead relying on AutoFlush or another, more important sync to flush it later on.
Because of the possibility of us waiting for a command buffer that hasn't submitted yet, any thread needs to be able to force the active command buffer to submit. The ability to do this has been added to the backend multithreading via an "Interrupt", though it is not supported without multithreading.
OpenGL drivers should already be doing something similar so they don't blow up when creating lots of sync, which is why this hasn't been a problem for these games over there.
Improves Vulkan performance on Xenoblade DE, Pokemon Scarlet/Violet, and Zelda BOTW (still another large issue here)
* Add strict argument
This is technically a separate concern from whether the sync is a host syncpoint.
* Remove _interrupted variable
* Actually wait for the invoke
This is required by AMD GPUs, and also may have caused some issues on other GPUs.
* Remove unused using.
* I don't know why it added these ones.
* Address Feedback
* Fix typo
* Add conversion for 16 bit RGBA formats (not supported in Rosetta)
* Rebase fix
Rebase fix
* Forgot to remove this
* Fix RGBA16 format conversion
* Add RGBA4 -> RGBA8 conversion
* Handle host stride alignment
* Address Feedback Part 1
* Can't count
* Don't zero out rgb when alpha is 0
* Separate RGBA4 and 5-bit component formats
Not sure of a better way to name them...
* Add A1B5G5R5 conversion
* Put this in the right place.
* Make format naming consistent for capabilities
* Change method names
* Generic Math Update
Updated Several functions in Ryujinx.Common/Utilities/BitUtils to use generic math
* Updated BitUtil calls
* Removed Whitespace
* Switched decrement
* Fixed changed method calls.
The method calls were originally changed on accident due to me relying too much on intellisense doing stuff for me
* Update Ryujinx.Common/Utilities/BitUtils.cs
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
We have a conversion from LDG on the compute shader to a special constant buffer binding that's used to exceed hardware limits on compute, but it was only running if the byte offset could be identified. The fallback that checks all of the bindings at runtime only checks the storage buffers.
This PR adds checking ube ranges to the LoadGlobal fallback. This extends the changes in #4011 to only check ube entries which are accessed by the shader.
Fixes particles affected by the wind in The Legend of Zelda: Breath of the Wild. May fix other weird issues with compute shaders in some games.
Try a bunch of games and drivers to make sure they don't blow up loading constants willynilly from searchable buffers.
* Make all structs readonly when applicable. It should reduce amount of needless defensive copies
* Make structs with trivial boilerplate equality code record structs
* Remove unnecessary readonly modifiers from TextureCreateInfo
* Make BitMap structs readonly too
* GPU: Use lazy checks for specialization state
This PR adds a new class, the SpecializationStateUpdater, that allows elements of specialization state to be updated individually, and signal the state is checked when it changes between draws, instead of building and checking it on every draw. This also avoids building spec state when
Most state updates have been moved behind the shader state update, so that their specialization state updates make it in before shaders are fetched.
Downside: Fields in GpuChannelGraphicsState are no longer readonly. To counteract copies that might be caused this I pass it as `ref` when possible, though maybe `in` would be better? Not really sure about the quirks of `in` and the difference probably won't show on a benchmark.
The result is around 2 extra FPS on SMO in the usual spot. Not much right now, but it will remove costs when we're doing more expensive specialization checks, such as fragment output type specialization for macos. It may also help more on other games with more draws.
* Address Feedback
* Oops
* GPU: Swap bindings array instead of copying
Reduces work on UpdateShaderState. Now the cost is a few reference moves for arrays, rather than copying data.
Downside: bindings arrays are no longer readonly.
* Micro optimisation
* Add missing docs
* Address Feedback
* Track buffer migrations and flush source on incomplete copy
Makes sure that the modified range list is always from the latest iteration of the buffer, and flushes earlier iterations of a buffer if the data has not been migrated yet.
* Cleanup 1
* Reduce cost for redundant signal checks on Vulkan
* Only inherit the range list if there are pending ranges.
* Fix OpenGL
* Address Feedback
* Whoops
* Ensure that vertex attribute buffer index is valid on GPU
* Remove vertex buffer validation code from OpenGL
* Remove some fields that are no longer necessary
This replacement is meant to be done with the original identified byteOffset, not the one assigned later on by the below conditionals (that already has the constant offset added, for instance).
This fixes videos being pixelated in Xenoblade 3, and other regressions that might have happened since #3847.
* GAL: Send all buffer assignments at once rather than individually
The `(int first, BufferRange[] ranges)` method call has very significant performance implications when the bindings are spread out, which they generally always are in Vulkan. This change makes it so that these methods are only called a maximum of one time per draw.
Significantly improves GPU thread performance in Pokemon Scarlet/Violet.
* Address Feedback
Removed SetUniformBuffers(int first, ReadOnlySpan<BufferRange> buffers)
* GPU: Access non-prefetch command buffers directly
Saves allocating new arrays for them constantly - they can be quite small so it can be very wasteful. About 0.4% of GPU thread in SMO, but was a bit higher in S/V when I checked.
Assumes that non-prefetch command buffers won't be randomly clobbered before they finish executing, though that's probably a safe bet.
* Small change while I'm here
* Address feedback
I did this on ncbuffer2 when we were using it for LDN 3, but I noticed that it can apply to the current buffer manager too, and it's an easy performance win.
The only buffer access that can come from another thread is the overlap search for buffers that have been unmapped. Everything else, including modifications, come from the main GPU thread. That means we only need to lock the range list when it's being modified, as that's the only time where we'll cause a race with the unmapped handler.
This has a significant performance improvements in situations where FIFO is high, like the other two PRs. Joined together they give a nice boost (73.6 master -> 79 -> 83 fps in SMO).
A quick fix to prevent reading the wrong value of Count when reregistering ranges for a new target buffer. Buffer flushes from another thread can modify the range list when the lock isn't active, which can change the count.
This prevents some crashes in Pokemon Scarlet/Violet. It's probably likely that buffer migration during flush is causing some other issues in this game, but this at least prevents the crashing.
* Prune ForceDirty and CheckModified caches on unmap
Since we're now using this for modified checks on the HLE indirect draw method, I'm worried that leaving these to forever gather cache entries isn't the best idea for performance in the long term, and it could keep old buffer objects alive for longer than they should be.
This PR adds the ability to prune invalid entries before checking these caches, and queues it whenever gpu memory is unmapped. It also aligns modified checks to the page size, as I figured it would be possible for a huge number of overlapping over a game's runtime.
This prevents Super Mario Odyssey from having 10s of thousands of entries in the modified cache in Metro Kingdom, and them duplicating when entering and leaving a building (should be cleared, as they were unmapped).
* Address Feedback
The type in the `texOp` in the textureSize instruction doesn't have the exact type on SPIR-V (for example, it is missing the Array flag). This PR gives it the proper type before giving it to the unscaling helper.
This fixes the ground textures being broken on Pokemon Scarlet/Violet when scaling. It wasn't finding the texture, so the descriptor index it provided was -1...
* Eliminate CB0 accesses
Still some work to do, decouple from hle?
* Forgot the important part somehow
* Fix and improve alignment test
* Address Feedback
* Remove some complexity when checking storage buffer alignment
* Update Ryujinx.Graphics.Shader/Translation/Optimizations/GlobalToStorage.cs
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Implement HLE macro for DrawElementsIndirect
* Shader cache version bump
* Use GL_ARB_shader_draw_parameters extension on OpenGL
* Fix DrawIndexedIndirectCount on Vulkan when extension is not supported
* Implement DrawIndex
* Alignment
* Fix some validation errors
* Rename BaseIds to DrawParameters
* Fix incorrect index buffer and vertex buffer size in some cases
* Add HLE macros for DrawArraysInstanced and DrawElementsInstanced
* Perform a regular draw when indirect data is not modified
* Use non-indirect draw methods if indirect buffer was not GPU modified
* Only check if draw parameters match if the shader actually uses them
* Expose Macro HLE setting on GUI
* Reset FirstVertex and FirstInstance after draw
* Update shader cache version again since some people already tested this
* PR feedback
Co-authored-by: riperiperi <rhy3756547@hotmail.com>