An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Vision Transformer(ViT) Method
- 기존 자연어 처리에서 사용되던 Transformer라는 모델을 Vision에 적용
- Image Patch -> Transformer Encoder