Date of Award


Publication Type


Degree Name



Computer Science

First Advisor

I. Ahmad

Second Advisor

M. Khalid

Third Advisor

B. Boufama


Diffusion models, Generative Adversarial Networks, Generative models, Text to image generation



Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


The development of deep learning algorithms has tremendously helped computer vision applications, image processing methods, Artificial Intelligence, and Natural Language Processing. One such application is image synthesis, which is the creation of new images from text. Recent techniques for text-to-image synthesis offer an intriguing yet straight forward conversion capability from text to image and have become a popular research topic. Synthesis of images from text descriptors has practical and creative applications in computer-aided design, multimodal learning, digital art creation, etc. Non-Fungible Tokens (NFTs) are a form of digital art that is being used as tokens for trading across the globe. Text-to-image generators let anyone with enough creativity can develop digital art, which can be used as NFTs. They can also be beneficial for the development of synthetic datasets. Generative Adversarial Networks (GANs) is a generative model that can generate new data using a training set. Diffusion Models are another type of generative model which can create desired data samples from the noise by adding random noise to the data and then learning to reverse the diffusion process. This thesis compares both models to determine which is better at producing images that match the given description. We have implemented the Vector-Quantized GAN (VQGAN) - Connecting Text and Images (CLIP) model. It combines the VQGAN and CLIP machine learning techniques to create images from text input. The diffusion model that we have implemented is Guided Language to Image Diffusion for Generation and Editing (GLIDE). For both models, we use text input from the MS-COCO data set. This thesis is an attempt to assess and compare the images generated using text for both models using metrics like Inception Score (IS) and Fréchet Inception Distance (FID). The semantic object accuracy score (SOA) is another metric that considers the caption used during the image generation process. We compute and compare the results for each label in the MS COCO data set. We highlight the potential causes of why the models may not be able to generate images through analysis of the results obtained. Our experimental results indicate that the GLIDE model outperforms the VQGAN - CLIP for our task of generating images from text.