Vision Transformers (ViT) are neural model architectures that compete and exceed classical convolutional neural networks (CNNs) in computer vision tasks. ViT's versatility and performance is best understood by proceeding with a backward analysis. In this study, we aim to identify, analyse and extract the key elements of ViT by backtracking on the origin of Transformer neural architectures (TNA). We hereby highlight the benefits and constraints of the Transformer architecture, as well as the foundational role of self- and multi-head attention mechanisms.
We now understand why self-attention might be all we need. Our interest of the TNA has driven us to consider self-attention as a computational primitive. This generic computation framework provides flexibility in the tasks that can be performed by the Transformer. After a good grasp on Transformers, we went on to analyse their vision-applied counterpart, namely ViT, which is roughly a transposition of the initial Transformer architecture to an image-recognition and -processing context.
When it comes to computer vision, convolutional neural networks are considered the go to paradigm. Because of their proclivity for vision, we naturally seek to understand how ViT compared to CNN. It seems that their inner workings are rather different.
CNNs are built with a strong inductive bias, an engineering feature that provides them with the ability to perform well in vision tasks. ViT have less inductive bias and need to learn this (convolutional filters) by ingesting enough data. This makes Transformer-based architecture rather data-hungry and more adaptable.
Finally, we describe potential enhancements on the Transformer with a focus on possible architectural extensions. We discuss some exciting learning approaches in machine learning. Our last part analysis leads us to ponder on the flexibility of Transformer-based neural architecture. We realize and argue that this feature might possibility be linked to their Turing-completeness.
Table of Contents
- Chapter 1
- Introduction
- Purpose Statement
- Approach
- Natural Language Processing (NLP)
- Computer Vision (CV)
- Chapter 2
- Transformer
- Transformer - Building Blocks
- Transformer - Workflow
- Transformers - Digest
- Vision Transformer (ViT)
- Key Ideas
- ViT in CNN Realm
- ViT - State of the Art (SOTA)
- ViT and CNN: A Shared Vision?
- Chapter 3.
- Perspectives for Transformers and ViTs
- Selected Learning Paradigms
- Model Soups - Ensemble Learning
- Multimodal Learning . .
- Self-Supervised Learning.
- Other Approaches and Open Question
- Beyond Transformers?
- Personal Path Of Exploration
- Conclusion
Objectives and Key Themes
This master's thesis examines the architecture and functionality of Vision Transformers (ViT), a type of neural network that leverages the Transformer architecture to achieve state-of-the-art results in computer vision tasks. The study aims to understand the strengths and limitations of ViT by tracing their origins and analyzing key elements such as self-attention mechanisms. The objective is to demonstrate how these models compete and outperform traditional convolutional neural networks (CNNs) in the domain of computer vision.
- The role of Transformer neural architectures (TNA) in computer vision
- Analysis of the strengths and limitations of ViT compared to CNNs
- Exploration of key elements within ViT, including self-attention mechanisms
- Discussion of potential future enhancements and research directions for Transformer-based architectures
- Examination of the potential link between Transformer-based neural architectures and Turing-completeness
Chapter Summaries
- Chapter 1 introduces the topic of computer vision and the role of deep learning. It highlights the limitations of traditional CNNs in achieving human-like generalization capabilities and explores the potential of Transformer architectures as a solution. The chapter introduces ViT, a Transformer-based neural network specifically designed for image processing.
- Chapter 2 delves into the Transformer architecture, explaining its building blocks, workflow, and key concepts. It also analyzes the application of Transformer models in the field of computer vision, specifically discussing ViT and its performance compared to CNNs.
- Chapter 3 explores various perspectives for Transformers and ViTs, including advanced learning paradigms like ensemble learning, multimodal learning, and self-supervised learning. It also discusses potential future directions for research in this area.
Keywords
The key terms and concepts explored in this thesis include Vision Transformers (ViT), Transformer neural architectures (TNA), self-attention mechanisms, computer vision, convolutional neural networks (CNNs), deep learning, state-of-the-art (SOTA), Turing-completeness, and advanced learning paradigms like ensemble learning, multimodal learning, and self-supervised learning.
- Quote paper
- Tolga Topal (Author), 2022, What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages, Munich, GRIN Verlag, https://www.grin.com/document/1437625