ViT¶
ViT backbone based on An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
ViT (Vision Transformer) does not have a stage configuration and therefore does not support compatibility with neck modules. Currently, it only supports the FC head. When using the ViT model for classification tasks, users can decide whether to use a classification token. Additionally, users can flexibly configure the settings of the transformer encoder.
Field list¶
Field | Description |
---|---|
name |
(str) Name must be "vit" to use ViT backbone. |
params.patch_size |
(int) Size of the image patch to be treated as a single embedding. |
params.attention_channels |
(int) Dimension for the encoder. |
params.num_blocks |
(int) The number of self-attention blocks in the encoder. |
params.num_attention_heads |
(int) The number of heads in the multi-head attention. |
params.attention_dropout_prob |
(float) Dropout probability in the attention block. |
params.ffn_intermediate_channels |
(int) Intermediate dimension of the feed-forward network inside the attention block. |
params.ffn_dropout_prob |
(float) Dropout probability in the feed-forward network inside the attention block. |
params.use_cls_token |
(bool) Whether to use the classification token. |
params.vocab_size |
(int) Maximum token length for positional encoding. |