Scientists have built a framework that gives generative AI systems like DALL·E 3 and Stable Diffusion a major boost by condensing them into smaller models — without compromising their quality.
While I think the realism of some models is fantastic and the flexibility of others is great it is starting to feel like we’re reaching a plateau on quality. Most of the white papers I’ve seen posted lately are about speed or some alternate way of doing what ControlNet or inpainting can already do.
There’s way more to a game’s look than textures though. Arguably ray tracing will have a greater impact than textures. Not to mention, for retro games, you could just generate the textures beforehand, no need to do it in real time.
I meant putting the whole image through AI. Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.
There is no single “whole” image when talking about a video game. It’s a combination of dynamic layers carefully interacting with each-other.
You can take any individual texture and make it look different/more realistic and it may work with some interaction but might end up breaking the game. Especially if hit boxes depend on the texture.
We may see ai remakes of video games at some point but it will require the ai to reprogram from scratch.
Now when we talk about movies and other linear media, i expect to see this technology relatively soon.
What i ment is that the final image is dynamic so players may have a unique configuration which makes it harder for ai to understand whats going on.
Using the final render of each frame would cause a lot of texture bleeding for example when a red character stands in front of a red background. Or is jumping on top of an animal, you may have wild frames where the body shape drastically changes or is suddenly realistically riding the animal then petting it the next frame to then have it die on frame 3, all because every frame is processed as its own work.
Upscaling final renders is indeed possible but mostly because it doesnt change things all that much of the general shapes, Small artifacts are also very common here but often not noticeable by the human eye and dont effect a modern game.
In older games, especially mario where hitboxes are pixel dependent youd either have a very confusing games with tons of clipping because the game doesn’t consider the new textures or it abides to new textures affecting the gameplay.
Source: i have studied game development and have recreated mario era games as part of assignments, currently i am self-studying the technological specifics of how machine learning and generative algorithms operate.
Those are valid points, but nothing there is insurmountable with even little bit of advancement.
For example this is a relatively old example from 2021, before any of the dall-e2 and stable diffusion and video consistency models were out.
https://www.youtube.com/watch?v=P1IcaBn3ej0
Is this perfect? No, there are artifacts and the quality just matches the training driving dataset, but this was an old specialized example from a completely different (now archaic) architecture. But the newer image generation models are much better, and frame-to-frame consistency models are getting better by week and some of them are nearly there (obviously not in real-time).
About the red-on-red bleed/background separation etc. issues: for 3d rendered games it’s relatively straightforward to get not just the color but depth and normal maps from the buffer (especially assuming this will be done with the blessing of the game developers/game engines/directX or other APIs). I don’t know if you follow the developments but for example using ControlNet with StableDiffusion it is trivial to add color, depth, normal map, pose, line outline, or a lot more other constraints on the created image, so if the character is wearing red over a red background, that is separated by the depth map and the generated image will also have the same depth, or their surface normals would be different. You can use whatever aspects of the input game as constraint in the generation.
I am not saying we can do this right now, the generation speeds for high quality images, plus any other required tools in the workflow (from understanding the context and caption generation/latent space matching, to getting these color/depth/normal constraints, to do temporal consistency using the previous N frames to generating the final image, and doing it all fast enough) obviously have a ton of challenges involved. But, it is quite possible, and I fully expect to see working non-realtime demos within a year or couple years at the most.
–
In 2D games, it may be harder due to pixelation. As you said there are upscaling algorithms that work more or less well to increase the resolution slightly, nothing photorealistic obviously. There are also methods such as these using segmenting algorithms to split the images and generating new ones using AI generators: https://github.com/darvin/X.RetroGameAIRemaster
To be honest to make 2D games in a different style you can do much better, even now. Most of the retro games have their sprites and backgrounds already extracted from their original games, you can just upscale once (by the producer, or fan-edits), and then you don’t even need to worry about the real time generation. I wanted to upscale Laura Bow 2 this way for example.
One random example I just found: https://www.youtube.com/watch?v=kBFMKroTuXE
Replacing the sprites/backgrounds won’t make them really photorealistic with dynamic drop shadows and lighting changes, but once the sprites are in enough resolution then you can feed them into the full-frame re-generation frame by frame. But then I probably don’t want Guybrush to look like Gabriel Knight 2 or other FMV games so not sure about that angle.
+10 for “randomly” linking latent vision. Thats the dude that made ipadapter for stable diffusion which is hands down revolutionary for my comfyui workflows.
I actually fully agree on all the 3D stuff, i remember that gta video.
My comment was reflecting the following idea, ahum
“putting the whole image through Al. Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.” -bjoern_tantau
But on the topic of modern 3D I expect we can go very far. Generate high quality models of objects. Venerate from that a low poly version + consistent prompt to be used by the game engine ai during development and live gaming. Including raytracing matrixds (not rtx but yes sikilar but for detection. (which admittedly coded exactly once to demonstrated for an exam and barely understand ) what i try to say is some clever people will figure out how to calculate collisions and interaction using low poly+ai.
I am very impressed by the retrogameXMaster but i think it may also depend on the game.
In these older games the consistency of its gameplay is core to its identity, there pre graphics. hitbox detection is pixel based which is core gameplay and influences difficulty. Hardware limitations in a way also become part of gameplay design.
You can upscale and given many of em fancy textures and maybe even layers of textures, modded items and accessibility cheats.
But the premises: “ Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.”
An ai can coock something up like that
But it will be a new distinct Mario game if you change all that much of what’s happening on screen:
Anyway i am tired and prob sound like a lunatic the longer i speak so again thanks for the good read.
When the output of something is the average of the inputs it will naturally be mediocre. It will always look like the output of a committee by the nature of how it is formed.
Certain artists stand out because they are different from everyone else, and that is why they are celebrated. M.C. Escher has a certain style that when run through AI looks like a skilled high school student doing their best impression of M.C. Escher.
Now as a tool to inspire, AI is pretty good at creating mashups of multiple things really fast. Those could be used by an actual artist to create something engaging. Most AI reminds me of photoshop battles.
I agree for narrow models and Loras trained on a specific style they can never be as good as the original but i also think that is the lamest uncreative way to generate.
Much more fun to use general use models and to crack the settings to generate exactly what you want the way you want,
That’s maybe because we’ve reached the limits of what the current architecture of models can achieve on the current architecture of GPUs.
To create significantly better models without having a fundamentally new approach, you have to increase the model size. And if all accelerators accessible to you only offer, say, 24gb, you can’t grow infinitely. At least not within a reasonable timeframe.
Will increasing the model actually help? Right now we’re dealing with LLMs that literally have the entire internet as a model. It is difficult to increase that.
Making a better way to process said model would be a much more substantive achievement. So that when particular details are needed it’s not just random chance that it gets it right.
That is literally a complete misinterpretation of how models work.
You don’t “have the Internet as a model”, you train a model using large amounts of data. That does not mean, that this model contains any of the actual data. State of the at models are somewhere in the billions of parameters. If you have, say, 50b parameters, each being a 64bit/8 byte double (which is way, way too much accuracy) you get something like 400gb of data. That’s a lot, but the Internet slightly larger than that.
It’s an exaggeration, but its not far off given that Google literally has all of the web parsed at least once a day.
Reddit just sold off AI harvesting rights on all of its content to Google.
The problem is no longer model size. The problem is interpretation.
You can ask almost everyone on earth a simple deterministic math problem and you’ll get the right answer almost all of the time because they understand the principles behind it.
Until you can show deterministic understanding in AI, you have a glorified chat bot.
You’re building beautiful straw men. They’re lies, but great job.
I said originally that we need to improve the interpretation of the model by AI, not just have even bigger models that will invariably have the same flaw as they do now.
Deterministic reliability is the end goal of that.
Will increasing the model actually help? Right now we’re dealing with LLMs that literally have the entire internet as a model. It is difficult to increase that.
Making a better way to process said model would be a much more substantive achievement. So that when particular details are needed it’s not just random chance that it gets it right.
Where exactly did you write anything about interpretation? Getting “details right” by processing faster? I would hardly call that “interpretation” that’s just being wrong faster.
While I think the realism of some models is fantastic and the flexibility of others is great it is starting to feel like we’re reaching a plateau on quality. Most of the white papers I’ve seen posted lately are about speed or some alternate way of doing what ControlNet or inpainting can already do.
Have you seen the SD3 preview images? They’re looking seriously impressive.
Well, when it’s fast enough you can do it in real time. How about making old games look like they looked to you as a child?
There’s way more to a game’s look than textures though. Arguably ray tracing will have a greater impact than textures. Not to mention, for retro games, you could just generate the textures beforehand, no need to do it in real time.
I meant putting the whole image through AI. Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.
There is no single “whole” image when talking about a video game. It’s a combination of dynamic layers carefully interacting with each-other.
You can take any individual texture and make it look different/more realistic and it may work with some interaction but might end up breaking the game. Especially if hit boxes depend on the texture.
We may see ai remakes of video games at some point but it will require the ai to reprogram from scratch.
Now when we talk about movies and other linear media, i expect to see this technology relatively soon.
Of course there is. When everything is done a whole image is sent to the display to show. That’s how FSR 1 can work without explicit game support.
What i ment is that the final image is dynamic so players may have a unique configuration which makes it harder for ai to understand whats going on.
Using the final render of each frame would cause a lot of texture bleeding for example when a red character stands in front of a red background. Or is jumping on top of an animal, you may have wild frames where the body shape drastically changes or is suddenly realistically riding the animal then petting it the next frame to then have it die on frame 3, all because every frame is processed as its own work.
Upscaling final renders is indeed possible but mostly because it doesnt change things all that much of the general shapes, Small artifacts are also very common here but often not noticeable by the human eye and dont effect a modern game.
In older games, especially mario where hitboxes are pixel dependent youd either have a very confusing games with tons of clipping because the game doesn’t consider the new textures or it abides to new textures affecting the gameplay.
Source: i have studied game development and have recreated mario era games as part of assignments, currently i am self-studying the technological specifics of how machine learning and generative algorithms operate.
Those are valid points, but nothing there is insurmountable with even little bit of advancement.
For example this is a relatively old example from 2021, before any of the dall-e2 and stable diffusion and video consistency models were out. https://www.youtube.com/watch?v=P1IcaBn3ej0
Is this perfect? No, there are artifacts and the quality just matches the training driving dataset, but this was an old specialized example from a completely different (now archaic) architecture. But the newer image generation models are much better, and frame-to-frame consistency models are getting better by week and some of them are nearly there (obviously not in real-time).
About the red-on-red bleed/background separation etc. issues: for 3d rendered games it’s relatively straightforward to get not just the color but depth and normal maps from the buffer (especially assuming this will be done with the blessing of the game developers/game engines/directX or other APIs). I don’t know if you follow the developments but for example using ControlNet with StableDiffusion it is trivial to add color, depth, normal map, pose, line outline, or a lot more other constraints on the created image, so if the character is wearing red over a red background, that is separated by the depth map and the generated image will also have the same depth, or their surface normals would be different. You can use whatever aspects of the input game as constraint in the generation.
I am not saying we can do this right now, the generation speeds for high quality images, plus any other required tools in the workflow (from understanding the context and caption generation/latent space matching, to getting these color/depth/normal constraints, to do temporal consistency using the previous N frames to generating the final image, and doing it all fast enough) obviously have a ton of challenges involved. But, it is quite possible, and I fully expect to see working non-realtime demos within a year or couple years at the most.
–
In 2D games, it may be harder due to pixelation. As you said there are upscaling algorithms that work more or less well to increase the resolution slightly, nothing photorealistic obviously. There are also methods such as these using segmenting algorithms to split the images and generating new ones using AI generators: https://github.com/darvin/X.RetroGameAIRemaster
To be honest to make 2D games in a different style you can do much better, even now. Most of the retro games have their sprites and backgrounds already extracted from their original games, you can just upscale once (by the producer, or fan-edits), and then you don’t even need to worry about the real time generation. I wanted to upscale Laura Bow 2 this way for example. One random example I just found: https://www.youtube.com/watch?v=kBFMKroTuXE
Replacing the sprites/backgrounds won’t make them really photorealistic with dynamic drop shadows and lighting changes, but once the sprites are in enough resolution then you can feed them into the full-frame re-generation frame by frame. But then I probably don’t want Guybrush to look like Gabriel Knight 2 or other FMV games so not sure about that angle.
First of all thank you for the detailed reply.
+10 for “randomly” linking latent vision. Thats the dude that made ipadapter for stable diffusion which is hands down revolutionary for my comfyui workflows.
I actually fully agree on all the 3D stuff, i remember that gta video.
My comment was reflecting the following idea, ahum
“putting the whole image through Al. Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.” -bjoern_tantau
But on the topic of modern 3D I expect we can go very far. Generate high quality models of objects. Venerate from that a low poly version + consistent prompt to be used by the game engine ai during development and live gaming. Including raytracing matrixds (not rtx but yes sikilar but for detection. (which admittedly coded exactly once to demonstrated for an exam and barely understand ) what i try to say is some clever people will figure out how to calculate collisions and interaction using low poly+ai.
I am very impressed by the retrogameXMaster but i think it may also depend on the game.
In these older games the consistency of its gameplay is core to its identity, there pre graphics. hitbox detection is pixel based which is core gameplay and influences difficulty. Hardware limitations in a way also become part of gameplay design.
You can upscale and given many of em fancy textures and maybe even layers of textures, modded items and accessibility cheats.
But the premises: “ Not just the textures. Tell it how you want it to look and suddenly a grizzled old Mario is jumping on a realistic turtle with blood splattering everywhere.”
An ai can coock something up like that But it will be a new distinct Mario game if you change all that much of what’s happening on screen:
Anyway i am tired and prob sound like a lunatic the longer i speak so again thanks for the good read.
Here is an alternative Piped link(s):
https://www.piped.video/watch?v=P1IcaBn3ej0
https://www.piped.video/watch?v=kBFMKroTuXE
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source; check me out at GitHub.
deleted by creator
When the output of something is the average of the inputs it will naturally be mediocre. It will always look like the output of a committee by the nature of how it is formed.
Certain artists stand out because they are different from everyone else, and that is why they are celebrated. M.C. Escher has a certain style that when run through AI looks like a skilled high school student doing their best impression of M.C. Escher.
Now as a tool to inspire, AI is pretty good at creating mashups of multiple things really fast. Those could be used by an actual artist to create something engaging. Most AI reminds me of photoshop battles.
Who says the output is an average?
I agree for narrow models and Loras trained on a specific style they can never be as good as the original but i also think that is the lamest uncreative way to generate.
Much more fun to use general use models and to crack the settings to generate exactly what you want the way you want,
That’s maybe because we’ve reached the limits of what the current architecture of models can achieve on the current architecture of GPUs.
To create significantly better models without having a fundamentally new approach, you have to increase the model size. And if all accelerators accessible to you only offer, say, 24gb, you can’t grow infinitely. At least not within a reasonable timeframe.
Will increasing the model actually help? Right now we’re dealing with LLMs that literally have the entire internet as a model. It is difficult to increase that.
Making a better way to process said model would be a much more substantive achievement. So that when particular details are needed it’s not just random chance that it gets it right.
That is literally a complete misinterpretation of how models work.
You don’t “have the Internet as a model”, you train a model using large amounts of data. That does not mean, that this model contains any of the actual data. State of the at models are somewhere in the billions of parameters. If you have, say, 50b parameters, each being a 64bit/8 byte double (which is way, way too much accuracy) you get something like 400gb of data. That’s a lot, but the Internet slightly larger than that.
It’s an exaggeration, but its not far off given that Google literally has all of the web parsed at least once a day.
Reddit just sold off AI harvesting rights on all of its content to Google.
The problem is no longer model size. The problem is interpretation.
You can ask almost everyone on earth a simple deterministic math problem and you’ll get the right answer almost all of the time because they understand the principles behind it.
Until you can show deterministic understanding in AI, you have a glorified chat bot.
It is far off. It’s like saying you have the entire knowledge of all physics because you skimmed a textbook once.
Interpretation is also a problem that can be solved, current models do understand quite a lot of nuance, subtext and implicit context.
But you’re moving the goal post here. We started at “don’t get better, at a plateau” and now you’re aiming for perfection.
You’re building beautiful straw men. They’re lies, but great job.
I said originally that we need to improve the interpretation of the model by AI, not just have even bigger models that will invariably have the same flaw as they do now.
Deterministic reliability is the end goal of that.
Where exactly did you write anything about interpretation? Getting “details right” by processing faster? I would hardly call that “interpretation” that’s just being wrong faster.