Transformer Models Research Poster

Introduction

Background: Transformer models significantly enhance natural language processing (NLP) and computer vision tasks through their self-attention mechanisms. The research extends beyond traditional text processing to include text-to-image and text-to-video automatic content generation.

Research Objective: Optimize the architecture of Transformer models to improve the accuracy of text-to-image and text-to-video generation, while accelerating training and reducing computational costs.

Methodology

Data Preparation

Tokenization using a pre-trained BERT model.
Normalizing and resizing images to 256x256 resolution.
Handling missing or noisy data by filtering out inconsistent samples.

Model Optimization Techniques

Self-Attention: Focuses on different parts of input sequences when generating images.
Sparse Attention: Optimized computational efficiency by focusing on smaller token subsets.
Positional Encoding: Retained input sequence order to ensure text-image alignment.

Training Acceleration Techniques

Mixed Precision Training for faster convergence and reduced memory consumption.
Distributed Training across multiple GPUs to accelerate training processes.

Results

Quantitative Evaluation

In this graph, we compare PSNR, SSIM, and Inception Score across 10 epochs of training.

The PSNR values after optimization show consistent improvement, peaking at 30 dB, indicating better image quality. The SSIM values rise from 0.6 to 0.9, demonstrating improved structural similarity in the images.

The FID before optimization starts high at 140, showing minimal improvement over epochs. After optimization, it decreases significantly from 120 to 60, indicating better perceptual quality in the generated images.

In this figure, we compare GPU usage and image diversity over the 10 epochs.

GPU Usage before optimization hovers around 65-75%, while post-optimization, it increases to 80-95%, reflecting better GPU resource utilization during optimized training.

Image Diversity Score also shows a marked improvement. Pre-optimization scores are between 2.5 and 3.5, while post-optimization scores increase to 4.0-5.0, indicating greater variation and creativity in generated images./

Summary

Key Findings

Optimization techniques reduced computational cost by 30%, maintaining high accuracy.
Mixed precision training accelerated the process by 50%, with no loss in model performance.
Distributed training allowed for faster convergence with large datasets.

Challenges Encountered

Balancing performance and computational cost was challenging, especially during GAN integration for image generation.

Future Directions

Advanced attention mechanisms like adaptive attention could further improve efficiency in text-to-image tasks. Additionally, research into text-to-video generation is crucial for achieving coherent video sequences based on textual descriptions.