Abstract:A novel text image generation method based on diffusion model is proposed to address the problems of low image fidelity, difficult image generation operations, and limited applicability to specific task scenarios in existing text image generation methods. This method takes the current mainstream diffusion model as the main network, designs a new structure of residual blocks, effectively improves the model generation performance, adds a CBAM (Convolutional Block Attention Module) attention module to improve the noise estimation network, enhances the model"s ability to extract key information from images, further improves the quality of generated images, and finally combines conditional control networks to effectively achieve text image generation for specific poses. Qualitative and quantitative analysis, as well as ablation experiments, were conducted on the dataset using leading methods KNN Difsuion, CogView2, textStyleGAN, and simplefifusion using CelebA HQ. According to the evaluation indicators and generation results, the method can effectively improve the quality of text generated images, with an average decrease of 36.4% in FID, an average increase of 11.4% and 3.9% in IS and structural similarity. Combined with a conditional control network, the task of generating text images with directed actions is achieved.