Stable Diffusion

4297 readers
12 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 1 year ago
MODERATORS
1
39
MEGATHREAD (lemmy.dbzer0.com)
submitted 1 year ago by [email protected] to c/[email protected]
 
 

This is a copy of /r/stablediffusion wiki to help people who need access to that information


Howdy and welcome to r/stablediffusion! I'm u/Sandcheeze and I have collected these resources and links to help enjoy Stable Diffusion whether you are here for the first time or looking to add more customization to your image generations.

If you'd like to show support, feel free to send us kind words or check out our Discord. Donations are appreciated, but not necessary as you being a great part of the community is all we ask for.

Note: The community resources provided here are not endorsed, vetted, nor provided by Stability AI.

#Stable Diffusion

Local Installation

Active Community Repos/Forks to install on your PC and keep it local.

Online Websites

Websites with usable Stable Diffusion right in your browser. No need to install anything.

Mobile Apps

Stable Diffusion on your mobile device.

Tutorials

Learn how to improve your skills in using Stable Diffusion even if a beginner or expert.

Dream Booth

How-to train a custom model and resources on doing so.

Models

Specially trained towards certain subjects and/or styles.

Embeddings

Tokens trained on specific subjects and/or styles.

Bots

Either bots you can self-host, or bots you can use directly on various websites and services such as Discord, Reddit etc

3rd Party Plugins

SD plugins for programs such as Discord, Photoshop, Krita, Blender, Gimp, etc.

Other useful tools

#Community

Games

  • PictionAIry : (Video|2-6 Players) - The image guessing game where AI does the drawing!

Podcasts

Databases or Lists

Still updating this with more links as I collect them all here.

FAQ

How do I use Stable Diffusion?

  • Check out our guides section above!

Will it run on my machine?

  • Stable Diffusion requires a 4GB+ VRAM GPU to run locally. However, much beefier graphics cards (10, 20, 30 Series Nvidia Cards) will be necessary to generate high resolution or high step images. However, anyone can run it online through DreamStudio or hosting it on their own GPU compute cloud server.
  • Only Nvidia cards are officially supported.
  • AMD support is available here unofficially.
  • Apple M1 Chip support is available here unofficially.
  • Intel based Macs currently do not work with Stable Diffusion.

How do I get a website or resource added here?

*If you have a suggestion for a website or a project to add to our list, or if you would like to contribute to the wiki, please don't hesitate to reach out to us via modmail or message me.

2
 
 

Details: https://github.com/Nerogar/OneTrainer/blob/master/docs/RamOffloading.md

  • Flux LoRA training on 6GB GPUs (at 512px resolution)
  • Flux Fine-Tuning on 16GB GPUs (or even less) +64GB of RAM
  • SD3.5-M Fine-Tuning on 4GB GPUs (at 1024px resolution)
3
8
submitted 3 days ago* (last edited 3 days ago) by [email protected] to c/[email protected]
4
5
6
 
 

Highlights for 2024-10-29

  • Support for all SD3.x variants
    SD3.0-Medium, SD3.5-Medium, SD3.5-Large, SD3.0-Large-Turbo
  • Allow quantization using bitsandbytes on-the-fly during models load Load any variant of SD3.x or FLUX.1 and apply quantization during load without the need for pre-quantized models
  • Allow for custom model URL in standard model selector
    Can be used to specify any model from HuggingFace or CivitAI
  • Full support for torch==2.5.1
  • New wiki articles: Gated Access, Quantization, Offloading

Plus tons of smaller improvements and cumulative fixes reported since last release

README | CHANGELOG | WiKi | Discord

7
 
 

Abstract

We propose Framer for interactive frame interpolation, which targets producing smoothly transitioning frames between two images as per user creativity. Concretely, besides taking the start and end frames as inputs, our approach supports customizing the transition process by tailoring the trajectory of some selected keypoints. Such a design enjoys two clear benefits. First, incorporating human interaction mitigates the issue arising from numerous possibilities of transforming one image to another, and in turn enables finer control of local motions. Second, as the most basic form of interaction, keypoints help establish the correspondence across frames, enhancing the model to handle challenging cases (e.g., objects on the start and end frames are of different shapes and styles). It is noteworthy that our system also offers an "autopilot" mode, where we introduce a module to estimate the keypoints and refine the trajectory automatically, to simplify the usage in practice. Extensive experimental results demonstrate the appealing performance of Framer on various applications, such as image morphing, time-lapse video generation, cartoon interpolation, etc. The code, the model, and the interface will be released to facilitate further research.

Paper: https://arxiv.org/abs/2410.18978

Code: https://github.com/aim-uofa/Framer

Project Page: https://aim-uofa.github.io/Framer/#comparison_with_baseline_container

8
 
 

Highlights for 2024-10-23

A month later and with nearly 300 commits, here is the latest SD.Next update!

Workflow highlights

  • Reprocess: New workflow options that allow you to generate at lower quality and then
    reprocess at higher quality for select images only or generate without hires/refine and then reprocess with hires/refine
    and you can pick any previous latent from auto-captured history!
  • Detailer Fully built-in detailer workflow with support for all standard models
  • Built-in model analyzer
    See all details of your currently loaded model, including components, parameter count, layer count, etc.
  • Extract LoRA: load any LoRA(s) and play with generate as usual
    and once you like the results simply extract combined LoRA for future use!

New models

What else?

  • Tons of work on dynamic quantization that can be applied on-the-fly during model load to any model type (you do not need to use pre-quantized models)
    Supported quantization engines include BitsAndBytes, TorchAO, Optimum.quanto, NNCF compression, and more...
  • Auto-detection of best available device/dtype settings for your platform and GPU reduces neeed for manual configuration
    Note: This is a breaking change to default settings and its recommended to check your preferred settings after upgrade
  • Full rewrite of sampler options, not far more streamlined with tons of new options to tweak scheduler behavior
  • Improved LoRA detection and handling for all supported models
  • Several of Flux.1 optimizations and new quantization types

Oh, and we've compiled a full table with list of top-30 (how many have you tried?) popular text-to-image generative models,
their respective parameters and architecture overview: Models Overview

And there are also other goodies like multiple XYZ grid improvements, additional Flux ControlNets, additional Interrogate models, better LoRA tags support, and more...
README | CHANGELOG | WiKi | Discord

9
10
11
 
 

Abstract

Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: this https URL , Model: this https URL , Gallery: this https URL .

Paper: https://arxiv.org/abs/2410.15458

Code: https://github.com/rhymes-ai/Allegro (coming soon)

Weights: https://huggingface.co/rhymes-ai/Allegro

Project Page: https://huggingface.co/blog/RhymesAI/allegro

12
13
14
15
 
 

Abstract

Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training on millions of data pairs with hundreds of GPU hours, which is quite expensive and makes it challenging for ordinary users to explore and develop new types of conditions. To address this problem, we propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions, along with condition-specific LoRAs to capture distinct characteristics of each condition. Utilizing our pretrained Base ControlNet, users can easily adapt it to new conditions, requiring as few as 1,000 data pairs and less than one hour of single-GPU training to obtain satisfactory results in most scenarios. Moreover, our CtrLoRA reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights. Extensive experiments on various types of conditions demonstrate the efficiency and effectiveness of our method. Codes and model weights will be released at this https URL.

Paper: https://arxiv.org/abs/2410.09400

Code: https://github.com/xyfJASON/ctrlora

Weights: https://huggingface.co/xyfJASON/ctrlora/tree/main

16
17
 
 

Abstract

Diffusion models, such as Stable Diffusion, have made significant strides in visual generation, yet their paradigm remains fundamentally different from autoregressive language models, complicating the development of unified language-vision models. Recent efforts like LlamaGen have attempted autoregressive image generation using discrete VQVAE tokens, but the large number of tokens involved renders this approach inefficient and slow. In this work, we present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL. By incorporating a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions, Meissonic substantially improves MIM's performance and efficiency. Additionally, we leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images. Extensive experiments validate Meissonic's capabilities, demonstrating its potential as a new standard in text-to-image synthesis. We release a model checkpoint capable of producing 1024×1024 resolution images.

Paper: https://arxiv.org/abs/2410.08261

Code: https://github.com/viiika/Meissonic

Model: https://huggingface.co/MeissonFlow/Meissonic

18
 
 

The megathread mentions Diffusion Toolkit, although this is a Windows-only tool.

There is also Breadboard, however I consider this abandoned and lacks some features like rating/scoring.

My hacky tool and why I want something betterI've been using a hacky Python script to interpret prompts and other PNG Info metadata as tags and inserting them into a booru-like software which lets me search and sort by any of those tags (including a prompt keyword, seed, steps, my own rating scores). This tool was useful in a lot of ways when using tag style prompting, but as I move towards natural language prompts with newer models, a tag-based media software will make it harder to search and to compare prompts between images. Also, my hack was hacky and somewhat manual to use, images wouldn't automatically be imported when generated.

­

So I'd like to start using a purpose-made tool instead, but I'm struggling to find any other options. I'd rather know if a good tool exists before I start rebuilding my duct-tape conveyor belt.

19
 
 

Image shows list of prompt items before/after running 'remove duplicates' from a subset of the Adam Codd huggingface repo of civitai prompts: https://huggingface.co/datasets/AdamCodd/Civitai-2m-prompts/tree/main

The tool I'm building "searches" existing prompts similiar to text or images.

Like the common CLIP interrogator , but better.

Link to notebook here: https://huggingface.co/datasets/codeShare/fusion-t2i-generator-data/blob/main/Google%20Colab%20Jupyter%20Notebooks/fusion_t2i_CLIP_interrogator.ipynb

For pre-encoded reference , can recommend experimenting setting START_AT parameter to values 10000-100000 for added variety.

//---//

Removing duplicates from civitai prompts results in a 90% reduction of items!

Pretty funny IMO.

It shows the human tendency to stick to the same type of words when prompting.

I'm no exception. I prompt the same all the time. Which is why I'm building this tool so that I don't need to think about it.

If you wish to search this set , you can use the notebook above.

Unlike the typical pharmapsychotic CLIP interrogator , I pre-encode the text corpus ahead of time.

//---//

Additionally , I'm using quantization on the text corpus to store the encodings as unsigned integers (torch.uint8) instead of float32 , using this formula:

For the clip encodings , I use scale 0.0043.

A typical zero_point value for a given encoding can be 0 , 30 , 120 or 250-ish.

The TLDR is that you divide the float32 value with 0.0043 , round it up to the closest integer , and then increase the zero_point value until all values within the encoding is above 0.

This allows us to accurately store the values as unsigned integers , torch.uint8 .

This conversion reduces the file size to less than 1/4th of its original size.

When it is time to calculate stuff , you do the same process but in reverse.

For more info related to quantization, see the pytorch docs: https://pytorch.org/docs/stable/quantization.html

//---//

I also have a 1.6 million item fanfiction set of tags loaded from https://archiveofourown.org/

Its mostly character names.

They are listed as fanfic1 and fanfic2 respectively.

//---//

ComfyUI users should know that random choice {item1|item2|...} exists as a built in-feature.

//--//

Upcoming plans is to include a visual representation of the text_encodings as colored cells within a 16x16 grid.

A color is an RGB value (3 integer values) within a given range , and 3 x 16 x 16 = 768 , which happens to be the dimension of the CLIP encoding

EDIT: Added it now

//---//

Thats all for this update.

20
 
 

Abstract

World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. To foster future research on diffusion for world modeling, we release our code, agents and playable world models at https://github.com/eloialonso/diamond.

Paper: https://arxiv.org/pdf/2405.12399

Code: https://github.com/eloialonso/diamond/tree/csgo

Project Page: https://diamond-wm.github.io/

21
22
23
 
 

Abstract

Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: this https URL

Paper: https://arxiv.org/abs/2410.07171

Code: https://github.com/YangLing0818/IterComp

Hugging Face Repo: https://huggingface.co/comin/IterComp

Safetensor SDXL Model: https://civitai.com/models/840857/itercomp

24
 
 

I want to buy a new GPU mainly for SD. The machine-learning space is moving quickly so I want to avoid buying a brand new card and then a fresh model or tool comes out and puts my card back behind the times. On the other hand, I also want to avoid needlessly spending extra thousands of dollars pretending I can get a 'future-proof' card.

I'm currently interested in SD and training LoRas (etc.). From what I've heard, the general advice is just to go for maximum VRAM.

  • Is there any extra advice I should know about?
  • Is NVIDIA vs. AMD a critical decision for SD performance?

I'm a hobbyist, so a couple of seconds difference in generation or a few extra hours for training isn't going to ruin my day.

Some example prices in my region, to give a sense of scale:

  • 16GB AMD: $350
  • 16GB NV: $450
  • 24GB AMD: $900
  • 24GB NV: $2000

edit: prices are for new, haven't explored pros and cons of used GPUs

25
 
 
view more: next ›