Exploring the Architecture and Application of Transformer Models

Authors: Zhairui Shen, Tianwei Wang

Advisor: Vitaly Ford

Department of Computer Science and Mathematics, Arcadia University

Introduction

Background: Transformer models significantly enhance natural language processing (NLP) and computer vision tasks through their self-attention mechanisms. The research extends beyond traditional text processing to include text-to-image and text-to-video automatic content generation.

Research Objective: Optimize the architecture of Transformer models to improve the accuracy of text-to-image and text-to-video generation, while accelerating training and reducing computational costs.

Methodology

Data Preparation

  • Tokenization using a pre-trained BERT model.
  • Normalizing and resizing images to 256x256 resolution.
  • Handling missing or noisy data by filtering out inconsistent samples.

Model Optimization Techniques

  • Self-Attention: Focuses on different parts of input sequences when generating images.
  • Sparse Attention: Optimized computational efficiency by focusing on smaller token subsets.
  • Positional Encoding: Retained input sequence order to ensure text-image alignment.

Training Acceleration Techniques

  • Mixed Precision Training for faster convergence and reduced memory consumption.
  • Distributed Training across multiple GPUs to accelerate training processes.
More
Results

Quantitative Evaluation

In this graph, we compare PSNR, SSIM, and Inception Score across 10 epochs of training.


The PSNR values after optimization show consistent improvement, peaking at 30 dB, indicating better image quality. The SSIM values rise from 0.6 to 0.9, demonstrating improved structural similarity in the images.

PSNR Graph Comparison

The FID before optimization starts high at 140, showing minimal improvement over epochs. After optimization, it decreases significantly from 120 to 60, indicating better perceptual quality in the generated images.

FID and Loss Comparison

In this figure, we compare GPU usage and image diversity over the 10 epochs.

GPU Usage before optimization hovers around 65-75%, while post-optimization, it increases to 80-95%, reflecting better GPU resource utilization during optimized training.

Image Diversity Score also shows a marked improvement. Pre-optimization scores are between 2.5 and 3.5, while post-optimization scores increase to 4.0-5.0, indicating greater variation and creativity in generated images./

GPU and Image Diversity Comparison
More
Summary

Key Findings

  • Optimization techniques reduced computational cost by 30%, maintaining high accuracy.
  • Mixed precision training accelerated the process by 50%, with no loss in model performance.
  • Distributed training allowed for faster convergence with large datasets.

Challenges Encountered

Balancing performance and computational cost was challenging, especially during GAN integration for image generation.

Future Directions

Advanced attention mechanisms like adaptive attention could further improve efficiency in text-to-image tasks. Additionally, research into text-to-video generation is crucial for achieving coherent video sequences based on textual descriptions.

More
References
  1. Vaswani, A., et al. (2017). Attention is All You Need. arXiv preprint arXiv:1706.03762.
  2. Koh, J., Park, S., & Song, J. (2024). Improving Text Generation on Images with Synthetic Captions.
  3. Khan, F. (2023). Solving Transformer by Hand: A Step-by-Step Math Example.