Lecture 6: Attention Mechanisms | CMSC 25700/35100 NLP

## Decomposable Attention: The Attend Step

Given premise $A \in \mathbb{R}^{m \times d}$ and hypothesis $B \in \mathbb{R}^{n \times d}$:

Compute attention matrix:
                        $E = f(A) \cdot f(B)^T \in \mathbb{R}^{m \times n}$

Align hypothesis to premise:
                        $\beta = \operatorname{softmax}(E) \cdot B$

Align premise to hypothesis:
                        $\alpha = \operatorname{softmax}(E^T) \cdot A$

Each premise token gets a weighted summary of relevant hypothesis tokens (and vice versa).

</div>
                        <div style="flex: 1; text-align: center;">
                        <img src="images/nli-attention.png" alt="Decomposable Attention" style="max-height: 500px;">
                        <p class="citation">Parikh et al. (2016)</p>
                        </div>
                        </div>