To a Developer Like Me, Without Any Math Formulas
Me: “Hey bro, just curious to learn, I’ve read this picture (how transformers work) so many times, but I still don’t get it. Can you help me understand?” I showed him below.
DS Friend: “Sure! What part is confusing you?”
Me: “Well, pretty much everything — encoding and decoding. How exactly do they work? But mostly the Q, K, V matrices and the whole attention thing. And also, what’s with the multi-head attention?”
DS Friend: “Alright. Let’s break it down step by step. Let’s start with what you already know?”
Me: “I think I understand embedding— turning words into vectors so the model can process them, and some basic neural network knowledge, both learned from you and that’s all haha. But how does the transformer ‘understand’ the meaning of a sentence? For example, this one: ‘transformer is a process of LLM’? Getting a bunch of vectors doesn’t seem to extract the ‘meaning’ of the sentence, right?”
Embedding and Positional Encoding
DS Friend: “Yes. After embedding the words into vectors, the transformer needs to know their position in the sentence because ‘transformer is’ and ‘is transformer’ are different. That’s where…