Skip to content

MixTransformer

MixTransformer backbone based on SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.

We provide the MixTransformer encoder (MiT), the backbone of SegFormer, as a freely usable backbone module. Users have the flexibility to configure the transformer encoder for each stage, enabling MiT-b0 to MiT-b5.

Compatibility matrix

Supporting necks Supporting heads torch.fx NetsPresso
FPN
YOLOPAFPN
FC
ALLMLPDecoder
AnchorDecoupledHead
AnchorFreeDecoupledHead
Supported Supported

Field list

Field Description
name (str) Name must be "mixtransformer" to use MixTransformer backbone.
params.ffn_intermediate_expansion_ratio (int) Expansion factor to compute intermediate dimension in feed-forward network.
params.ffn_act_type (str) Activation function for feed-forward network in the transformer block. Supporting activation functions are described in [here].
params.ffn_dropout_prob (float) Dropout probability for feed-forward network in the transformer block.
params.attention_dropout_prob (float) Dropout probability for attention in the transformer block.
stage_params[n].num_blocks (int) The number of transformer blocks in the encoder.
stage_params[n].sequence_reduction_ratio (int) Sequence reduction ratio for multi-head attention.
stage_params[n].encoder_chananels (int) Dimension for the transformer block.
stage_params[n].embedding_patch_sizes (int) Kernel size for convolution layer in overlapping patch embedding.
stage_params[n].embedding_strides (int) stride value for convolution layer in overlapping patch embedding.
stage_params[n].num_attention_heads (int) The number of heads in the multi-head attention.

Model configuration examples

MiT-b0
model:
  architecture:
    backbone:
      name: mixtransformer
      params:
        ffn_intermediate_expansion_ratio: 4
        ffn_act_type: "gelu"
        ffn_dropout_prob: 0.0
        attention_dropout_prob: 0.0
      stage_params:
        -
          num_blocks: 2
          sequence_reduction_ratio: 8
          attention_chananels: 32
          embedding_patch_sizes: 7
          embedding_strides: 4
          num_attention_heads: 1
        -
          num_blocks: 2
          sequence_reduction_ratio: 4
          attention_chananels: 64
          embedding_patch_sizes: 3
          embedding_strides: 2
          num_attention_heads: 2
        -
          num_blocks: 2
          sequence_reduction_ratio: 2
          attention_chananels: 160
          embedding_patch_sizes: 3
          embedding_strides: 2
          num_attention_heads: 5
        -
          num_blocks: 2
          sequence_reduction_ratio: 1
          attention_chananels: 256
          embedding_patch_sizes: 3
          embedding_strides: 2
          num_attention_heads: 8