Text image generation algorithm based on improved diffusion model combined with conditional control

2025-5-4- 14

Text image generation algorithm based on improved diffusion model combined with conditional control
DOI:
                        
                    
CSTR:
                        
                    
Author:
                        DU Hongbo1DU Hongbo
沈阳工业大学
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
XUE Haoyuan2XUE Haoyuan
School of Science,Shenyang University of Technology,Shenyang Liaoning
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
ZHU Lijun3ZHU Lijun
School of Information and Computing Science,Northern University for Nationalities,Yinchuan Ningxia
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:1.沈阳工业大学;2.School of Science,Shenyang University of Technology,Shenyang Liaoning;3.School of Information and Computing Science,Northern University for Nationalities,Yinchuan Ningxia
Clc Number:
Fund Project:National Natural Science Foundation of China (11861003); Liaoning Provincial Department of Education Basic Research Project for Higher Education Institutions (LJKZ0157)

Article

Figures

Metrics

Reference [33]

Cited by [0]

Materials

Comments

Abstract:

A novel text image generation method based on diffusion model is proposed to address the problems of low image fidelity, difficult image generation operations, and limited applicability to specific task scenarios in existing text image generation methods. This method takes the current mainstream diffusion model as the main network, designs a new structure of residual blocks, effectively improves the model generation performance, adds a CBAM (Convolutional Block Attention Module) attention module to improve the noise estimation network, enhances the model"s ability to extract key information from images, further improves the quality of generated images, and finally combines conditional control networks to effectively achieve text image generation for specific poses. Qualitative and quantitative analysis, as well as ablation experiments, were conducted on the dataset using leading methods KNN Difsuion, CogView2, textStyleGAN, and simplefifusion using CelebA HQ. According to the evaluation indicators and generation results, the method can effectively improve the quality of text generated images, with an average decrease of 36.4% in FID, an average increase of 11.4% and 3.9% in IS and structural similarity. Combined with a conditional control network, the task of generating text images with directed actions is achieved.

Key words:diffusion model; text image generation; conditional control; residual block; CBAM attention module

Reference

[1] Zhu X, Goldberg A B, Eldawy M, et al. A text-to-picture synthesis system for augmenting communication[C]// Proceedings of the AAAI Conference on Artificial Intelligence. British Columbia: AAAI, 2007, 7: 1590-1595.

[2] 曹寅, 秦俊平, 高彤, 等. 基于生成对抗网络的文本两阶段生成高质量图像方法[J]. 浙江大学学报(工学版), 2024, 58(04): 674-683.

Cao ming, Qing junping, Gao tong, et al. A two-stage text generation method for high-quality images based on generative adversarial networks[J]. Journal of Zhejiang University (Engineering Edition), 2024, 58(04): 674-683

[4] [3] Michalczak M, Ligas M. Short-term prediction of UT1-UTC and LOD via Dynamic Mode Decomposition and combination of least-squares and vector autoregressive model[J]. Reports on Geodesy and Geoinformatics, 2024, 117(1): 45-54.

[5] [4] Yi X, Tang L, Zhang H, et al. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior[J]. Information Fusion, 2024, 110102450: 1320-1329.

[6] [5] 刘雨生, 肖学中. 基于扩散模型微调的高保真图像编辑[J/OL]. [2024-3-14][2024-6-05]http://kns.cnki.net/kcms/detail/51.1307.TP.20240312.1538.012.html.

Liu Yusheng, Xiao Xuezhong. High fidelity image editing based on diffusion model fine-tuning[J/OL]. [2024-3-14] [2024-6-05] http://kns.cnki.net/kcms/detail/51.1307.TP.20240312.1538.012.HTML.

[8] [6] Hao Wenyue, Cai Huaiyu, Zuo Tingtao, et al. Self supervised pre trained IVUS image segmentation method based on diffusion model[J/OL]. [2024-3-14] [2024-6-05] http://kns.cnki.net/kcms/detail/31.1690.TN.20240220.1057.058.html.

[9] [7] 钱枫, 胡桂铭, 祝能, 等. 基于改进扩散模型的图像去雨方法[J].重庆理工大学学报(自然科学), 2024,38(01):59-66.

Qian Feng, Hu Guiming, Zhu Neng, et al. Image Rain Removal Method Based on Improved Diffusion Model[J]. Journal of Chongqing University of Technology (Natural Science), 2024, 38(01): 59-66.

[11] [8] Zeng Y, Chen X, Zhang Y, et al. Dense-U-Net: densely connected convolutional network for semantic segmentation with a small number of samples[C]// Nanjing Univ. of Science and Technology (China), 2019: 67-69.

[12] [9] Wu F, Qi Z. Multi-layer stacks of GaN/n-Alsub0.5/subGaN self-assembled quantum dots grown by metal-organic chemical vapor deposition[C]// Central China Normal Univ. (China); Huazhong Institute of Electro-Optics (China), 2019: 84-92.

[13] [10] Han J, Liu J. HFGAN-CN: T2I Model via Text-Image Hierarchical Attention Fusion[C]// Northeast University, Information Physics System Control and Decision Making Professional Committee of the Chinese Society of Automation. Proceedings of the 34th China Control and Decision Making Conference(5). Department of Automation, College of Information Science and Engineering,China University of Petroleum; 2022: 6. DOI: 10.26914/c.cnkihy.2022.025346.

[14] [11] LI B, QI X, LUKASIEWICZ T, et al. Controllable text-to-image genration[J]. Advances in Neural Information Processing Systems, 2019, 32(18): 2065-2075.

[15] [12] TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach: IEEE, 2019: 10501-10510.

[16] [13] TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach: IEEE, 2019: 10501-10510.

[17] [14] ZHANG H, YIN W, FANG Y, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation [EB/OL]. [2023-07-01] [2024-5-31]. https://arxiv.org/abs/2112.15283

[18] [15] DING M, YANG Z, HONG W, et al. Cogview: mastering text-to-image generation via transformers[J]. Advances in Neural Information Processing Systems, 2021, 34(18): 19822–19835.

[19] [16] DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27] [2024-6-8]. https://arxiv.org/pdf/2204.14217.

[20] [17] Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]. Lille: International Conference on Machine Learning, PMLR. 2015; 2256-2256.

[21] [18] NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text guided diffusion models[C]// International Conference on Machine Learning. Long Beach: IEEE, 2022: 16784-16804.

[22] [19] Sehwag V, Hazirbas C, Gordo A, et al. Generating High Fidelity Data From Low-Density Regions Using Diffusion Models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022: 11492–11501.

[23] [20] Austin J, Johnson D D, Ho J, et al. Structured Denoising Diffusion Models in Discrete State-Spaces[C]// Advances in Neural Information Processing Systems. 2021: 17981–17993.

[24] [21] Jolicoeur Martineau A, Piché-Taillefer R, Combes R T des, et al. Adversarial score matching and improved sampling for image generation[J]. arXiv: 2009. 05475, 2020.

[25] [22] Kim D, Na B, Kwon S J, et al. Maximum Likelihood Training of Implicit Nonlinear Diffusion Models[J]. arXiv: 2205.13699, 2022.

[26] [23] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10684-10695.

[27] [24] Shichao Z, Hangchi S, Shukai D, et al. Position adaptive residual block and knowledge complement strategy for point cloud analysis[J]. Artificial Intelligence Review, 2024, 57(5): 19-23.

[28] [25] Mekruksavanich S, Jitpattanakul A. Deep Residual Network with a CBAM Mechanism for the Recognition of Symmetric and Asymmetric Human Activity Using Wearable Sensors[J]. Symmetry, 2024, 16(5): 67-73.

[29] [26] Qin Z. A Multimodal Diffusion-based Interior Design AI with ControlNet[J]. Journal of Artificial Intelligence Practice, 2024, 7(1): 25-27.

[30] [27] SHEYNIN S, ASHUAL O, POLYAK A, et al. Knn-diffusion: image generation via large-scale retrieval[EB/OL]. [2022-10-02] [2024-6-04]. https://arxiv.org/pdf/2204.02849.

[31] [28] DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27] [2024-6-04]. https://arxiv.org/pdf/2204.14217.

[32] [29] ZHANG Y, LU H. Deep cross-modal projection learning for image-text matching[C]// Proceedings of the European Conference on Computer Vision. Long Beach: IEEE, 2018: 686-701.

[33] [30] HOOGEBOOM E, HEEK J, SALIMANS T. Simple Diffusion：End-to-End Diffusion for High Resolution Images[Z]. [2023-01-26] [2024-06-07]. https://doi.org/10.48550/ arXiv.2301.11093.

Get Citation

Copy

Article Metrics

Abstract:12
PDF: 0
HTML: 0
Cited by: 0

History

Received:June 19,2024
Revised:December 24,2024
Adopted:February 27,2025
Online:
Published:

Article QR Code

Address：No. 219, Ningliu Road, Nanjing, Jiangsu Province

Postcode：210044

Phone：025-58731025

Get Citation

Share

Article Metrics

History

Article QR Code