Abstract:As text-to-image generation technologies based on diffusion models continue to evolve, both image quality and diversity have improved. However, issues such as missing multiple subjects and attribute confusion remain unresolved. This paper proposes a graph-constrained dynamic attention text-to-image generation method to enhance the generative capability of Stable Diffusion under conditions involving multiple subjects and attributes. The approach first introduces a scene graph generator based on a graph attention network, extracting object nodes and semantic relationships from CLIP text embeddings to generate signals that impose structural layout constraints. Subsequently, a dynamic attention gating module is embedded within the U-Net architecture. This module perceives temporal steps and adaptively adjusts attention weights, transforming the implicit denoising tendency into explicit attention scheduling that incorporates scene graph constraints. Experiments conducted on the CUB and COCO datasets demonstrate that, compared to mainstream methods, this approach achieves improvements across metrics including FID, IS, and CLIP-Score.