
Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent neural networks and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. In this talk, we first review the recent advancements in deep learning architectures, their (dis)advantages, and finally discuss a new perspective to look at Transformers and recurrence neural networks, providing new insights to design more powerful large language models (LLMs) in the future.

PhD student @ Cornell University