Text image generation algorithm based on improved diffusion model combined with conditional control
Author:
Affiliation:

1.沈阳工业大学;2.School of Science,Shenyang University of Technology,Shenyang Liaoning;3.School of Information and Computing Science,Northern University for Nationalities,Yinchuan Ningxia

Fund Project:

National Natural Science Foundation of China (11861003); Liaoning Provincial Department of Education Basic Research Project for Higher Education Institutions (LJKZ0157)

  • Article
  • | |
  • Metrics
  • |
  • Reference [33]
  • | |
  • Cited by [0]
  • | |
  • Comments
    Abstract:

    A novel text image generation method based on diffusion model is proposed to address the problems of low image fidelity, difficult image generation operations, and limited applicability to specific task scenarios in existing text image generation methods. This method takes the current mainstream diffusion model as the main network, designs a new structure of residual blocks, effectively improves the model generation performance, adds a CBAM (Convolutional Block Attention Module) attention module to improve the noise estimation network, enhances the model"s ability to extract key information from images, further improves the quality of generated images, and finally combines conditional control networks to effectively achieve text image generation for specific poses. Qualitative and quantitative analysis, as well as ablation experiments, were conducted on the dataset using leading methods KNN Difsuion, CogView2, textStyleGAN, and simplefifusion using CelebA HQ. According to the evaluation indicators and generation results, the method can effectively improve the quality of text generated images, with an average decrease of 36.4% in FID, an average increase of 11.4% and 3.9% in IS and structural similarity. Combined with a conditional control network, the task of generating text images with directed actions is achieved.

    Reference
    [1] Zhu X, Goldberg A B, Eldawy M, et al. A text-to-picture synthesis system for augmenting communication[C]// Proceedings of the AAAI Conference on Artificial Intelligence. British Columbia: AAAI, 2007, 7: 1590-1595.
    [2] 曹寅, 秦俊平, 高彤, 等. 基于生成对抗网络的文本两阶段生成高质量图像方法[J]. 浙江大学学报(工学版), 2024, 58(04): 674-683.
    Cao ming, Qing junping, Gao tong, et al. A two-stage text generation method for high-quality images based on generative adversarial networks[J]. Journal of Zhejiang University (Engineering Edition), 2024, 58(04): 674-683
    [4] [3] Michalczak M, Ligas M. Short-term prediction of UT1-UTC and LOD via Dynamic Mode Decomposition and combination of least-squares and vector autoregressive model[J]. Reports on Geodesy and Geoinformatics, 2024, 117(1): 45-54.
    [5] [4] Yi X, Tang L, Zhang H, et al. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior[J]. Information Fusion, 2024, 110102450: 1320-1329.
    [6] [5] 刘雨生, 肖学中. 基于扩散模型微调的高保真图像编辑[J/OL]. [2024-3-14][2024-6-05]http://kns.cnki.net/kcms/detail/51.1307.TP.20240312.1538.012.html.
    Liu Yusheng, Xiao Xuezhong. High fidelity image editing based on diffusion model fine-tuning[J/OL]. [2024-3-14] [2024-6-05] http://kns.cnki.net/kcms/detail/51.1307.TP.20240312.1538.012.HTML.
    [8] [6] Hao Wenyue, Cai Huaiyu, Zuo Tingtao, et al. Self supervised pre trained IVUS image segmentation method based on diffusion model[J/OL]. [2024-3-14] [2024-6-05] http://kns.cnki.net/kcms/detail/31.1690.TN.20240220.1057.058.html.
    [9] [7] 钱枫, 胡桂铭, 祝能, 等. 基于改进扩散模型的图像去雨方法[J].重庆理工大学学报(自然科学), 2024,38(01):59-66.
    Qian Feng, Hu Guiming, Zhu Neng, et al. Image Rain Removal Method Based on Improved Diffusion Model[J]. Journal of Chongqing University of Technology (Natural Science), 2024, 38(01): 59-66.
    [11] [8] Zeng Y, Chen X, Zhang Y, et al. Dense-U-Net: densely connected convolutional network for semantic segmentation with a small number of samples[C]// Nanjing Univ. of Science and Technology (China), 2019: 67-69.
    [12] [9] Wu F, Qi Z. Multi-layer stacks of GaN/n-Alsub0.5/subGaN self-assembled quantum dots grown by metal-organic chemical vapor deposition[C]// Central China Normal Univ. (China); Huazhong Institute of Electro-Optics (China), 2019: 84-92.
    [13] [10] Han J, Liu J. HFGAN-CN: T2I Model via Text-Image Hierarchical Attention Fusion[C]// Northeast University, Information Physics System Control and Decision Making Professional Committee of the Chinese Society of Automation. Proceedings of the 34th China Control and Decision Making Conference(5). Department of Automation, College of Information Science and Engineering,China University of Petroleum; 2022: 6. DOI: 10.26914/c.cnkihy.2022.025346.
    [14] [11] LI B, QI X, LUKASIEWICZ T, et al. Controllable text-to-image genration[J]. Advances in Neural Information Processing Systems, 2019, 32(18): 2065-2075.
    [15] [12] TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach: IEEE, 2019: 10501-10510.
    [16] [13] TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach: IEEE, 2019: 10501-10510.
    [17] [14] ZHANG H, YIN W, FANG Y, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation [EB/OL]. [2023-07-01] [2024-5-31]. https://arxiv.org/abs/2112.15283
    [18] [15] DING M, YANG Z, HONG W, et al. Cogview: mastering text-to-image generation via transformers[J]. Advances in Neural Information Processing Systems, 2021, 34(18): 19822–19835.
    [19] [16] DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27] [2024-6-8]. https://arxiv.org/pdf/2204.14217.
    [20] [17] Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]. Lille: International Conference on Machine Learning, PMLR. 2015; 2256-2256.
    [21] [18] NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text guided diffusion models[C]// International Conference on Machine Learning. Long Beach: IEEE, 2022: 16784-16804.
    [22] [19] Sehwag V, Hazirbas C, Gordo A, et al. Generating High Fidelity Data From Low-Density Regions Using Diffusion Models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022: 11492–11501.
    [23] [20] Austin J, Johnson D D, Ho J, et al. Structured Denoising Diffusion Models in Discrete State-Spaces[C]// Advances in Neural Information Processing Systems. 2021: 17981–17993.
    [24] [21] Jolicoeur Martineau A, Piché-Taillefer R, Combes R T des, et al. Adversarial score matching and improved sampling for image generation[J]. arXiv: 2009. 05475, 2020.
    [25] [22] Kim D, Na B, Kwon S J, et al. Maximum Likelihood Training of Implicit Nonlinear Diffusion Models[J]. arXiv: 2205.13699, 2022.
    [26] [23] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10684-10695.
    [27] [24] Shichao Z, Hangchi S, Shukai D, et al. Position adaptive residual block and knowledge complement strategy for point cloud analysis[J]. Artificial Intelligence Review, 2024, 57(5): 19-23.
    [28] [25] Mekruksavanich S, Jitpattanakul A. Deep Residual Network with a CBAM Mechanism for the Recognition of Symmetric and Asymmetric Human Activity Using Wearable Sensors[J]. Symmetry, 2024, 16(5): 67-73.
    [29] [26] Qin Z. A Multimodal Diffusion-based Interior Design AI with ControlNet[J]. Journal of Artificial Intelligence Practice, 2024, 7(1): 25-27.
    [30] [27] SHEYNIN S, ASHUAL O, POLYAK A, et al. Knn-diffusion: image generation via large-scale retrieval[EB/OL]. [2022-10-02] [2024-6-04]. https://arxiv.org/pdf/2204.02849.
    [31] [28] DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27] [2024-6-04]. https://arxiv.org/pdf/2204.14217.
    [32] [29] ZHANG Y, LU H. Deep cross-modal projection learning for image-text matching[C]// Proceedings of the European Conference on Computer Vision. Long Beach: IEEE, 2018: 686-701.
    [33] [30] HOOGEBOOM E, HEEK J, SALIMANS T. Simple Diffusion:End-to-End Diffusion for High Resolution Images[Z]. [2023-01-26] [2024-06-07]. https://doi.org/10.48550/ arXiv.2301.11093.
    Related
    Cited by
    您输入的地址无效!
    没有找到您想要的资源,您输入的路径无效!

    Comments
    Comments
    分享到微博
    Submit
Get Citation
Share
Article Metrics
  • Abstract:12
  • PDF: 0
  • HTML: 0
  • Cited by: 0
History
  • Received:June 19,2024
  • Revised:December 24,2024
  • Adopted:February 27,2025
Article QR Code

Address:No. 219, Ningliu Road, Nanjing, Jiangsu Province

Postcode:210044

Phone:025-58731025