Link: https://arxiv.org/abs/2201.08377v1

First each image is represented as a 4D tensor \(X \in \mathbb{R}^{ T \times H \times W \times C}\), where \(T\) is the additional temporal dimension. So for images, we can take\(T\) to be 1. Here \(H\)denotes height, \(W\) denotes width and \(C\) denotes the number of channels.

This is then divided into sub tensors of form \(\mathbb{R}^{t \times h \times w \times c}\). These patches are then mapped into an embedding of size \(d\) individually using a linear layer with normalization applied to it. Then these embeddings are fed into the Swin transformer architecture.

This is then optimized with respect to the cross-entropy loss using stochastic gradient descent. Despite the architecture is simple, it works quite effectively on non-specific generalized modality inference.