Inner Workings of LLMs for Developers - Part 2

October 4, 2025 Shivkanwer Singh Sidhu 18 minute read

Welcome to part 2 of the series on the “Inner Workings of LLMs for Developers”! In Part 1, we took our first crucial step into understanding how machines process language by exploring the classic Bag-of-Words (BoW) model.

We learned how BoW provides a simple yet effective way to convert unstructured text into numerical vectors that a machine can understand. By treating text as an unordered collection of words and simply counting their frequencies, we were able to create a structured representation suitable for tasks like spam filtering and basic document classification. However, we also hit a wall. We discovered that BoW’s simplicity is its biggest limitation. Since it has no concept of word order or semantic meaning, it is blind to the fact that “price” and “cost” are similar, and it can’t tell the difference between “This movie was not good” and “This movie was good, not terrible”.

2013 and 2014

It became clear that plain vector representation of the words was not enough for machines to grasp the most fundamental property of language, words exist in a complex web of relationships. So we needed a methodology to somehow capture their meaning as well as context.

Word Embeddings

If you ask someone which word is more similar to “doctor”—“nurse” or “patient”, most people would say “nurse” makes more sense, since both are medical professionals. But how do we teach a computer that specific relationship, especially when all three words appear together so often? That’s where word embeddings came into the picture.

The conceptual breakthrough that paved the way for word embeddings did not come from computer science, but from linguistics. In the 1950s, linguists such as J.R. Firth and Zellig Harris formulated what is now known as the Distributional Hypothesis. The hypothesis is elegantly summarized by Firth’s famous dictum: “You shall know a word by the company it keeps”. The idea is that the meaning of a word is not an intrinsic property but is defined by the contexts in which it appears. Words that consistently show up in similar linguistic environments are likely to have similar meanings.

While “doctor,” “nurse,” and “patient” all keep company with each other, a machine can analyze millions of sentences and spot subtle patterns.

It might learn that “doctor” and “nurse” often appear in similar contexts like:

“…consulted with the nurse.” / “…consulted with the doctor.”
“The doctor’s shift is over.” / “The nurse’s shift is over.”

In contrast, the context for “patient” is consistently different:

“The doctor treated the patient.” (not “The doctor treated the nurse.”)
“The patient was admitted by the nurse.”

By recognizing these distinct patterns a computer can deduce that doctors and nurses share a similar role, while a patient’s role is different. This is precisely what a word embedding is designed to capture. It translates these learned relationships into a mathematical form by representing each word as a vector. This vector encodes the word’s meaning in such a way that words with similar contexts, like “doctor” and “nurse,” are positioned closer together in the resulting vector space.

To make this concrete, let’s imagine we have word embeddings for several words like “doctor”, “surgeon”, “nurse”, ”teacher”, “student”, “car” etc. We can represent these embeddings in a table, where each row is a word and each column is a dimension. A dimension is just one piece of information about a word’s meaning, represented by a number. It is simply a feature or a property or an attribute that describes a data point. For example, to describe a car, we could use 3 dimensions/attributes: [speed, price, safety].

A table representing word embeddings for various words, including 'doctor', 'surgeon', 'nurse', 'teacher', 'student' and 'car'. Each word is associated with numerical scores across different conceptual dimensions like 'Medical Pro', 'Is a Location', 'Is a Vehicle', 'Is an Emotion', and 'Education'.

Table: Word embeddings for sample words

As shown in the table, each word is represented by scores across different conceptual dimensions. Words with similar meanings have similar scores. For example, “doctor”, “nurse” and “surgeon” all have high positive scores (e.g., 0.9, 0.8) for the “Medical Pro” dimension, grouping them together. In contrast, a word like “car” scores high on the “Is a Vehicle” dimension but negatively on “Medical Pro,” placing it in a completely different semantic category. This allows a computer to mathematically understand that words with similar vectors are similar in meaning.

💡 Remember in real embeddings, the dimensions don’t have explicit human-readable names like “Medical Pro”, “Is a Location”, as represented in the table. The model learns these dimensions automatically as abstract mathematical properties during the training process. So, a word’s embedding is simply its set of coordinates across hundreds of abstract dimensions (often 300 or more). Each coordinate represents a feature of the word’s meaning that the AI model has learned on its own by analyzing patterns in massive amounts of text.

If we were to plot these embeddings on a 2D graph, we would see that the similar words tend to cluster together.

A 2D scatter plot showing various words clustered in different regions. Words like 'doctor', 'nurse' and 'surgeon' are grouped closely together, distinct from words like 'teacher' and 'student', which form another cluster. The word 'Car' is located far from all other clusters.

Figure: Visualizing word embeddings in a 2D space

💡 Sometimes the word embeddings are also referred to as vector embeddings, are they the same? A vector embedding is a broader term used in AI to represent any entity (like a word, a user, a sentence or an image) as a numeric vector whereas word embeddings are just one specific type of vector embedding, specialized for representing the meaning of words.

Now that we understand what word embeddings are, the question is how do we generate them? This is where Word2Vec comes into the picture.

Word2Vec

Word2Vec (short for Word to Vector) introduced by a team of researchers at Google in 2013, was the first successful technique of converting words to vectors, thereby capturing their meaning and relationship with surrounding text in the form of word embeddings.

In order for the model to start generating word embeddings, it must first be trained on enormous amounts of text data like a library of books, blogs, articles, Wikipedia etc. By analyzing which words frequently appear near each other, it starts to learn patterns about the language. But before the model can learn, the raw text must be turned into a structured dataset suitable for training. Word2Vec accomplishes this by using a “sliding window” approach. A window of a fixed size moves across the text and at each position, it generates one or more training dataset samples.

A diagram illustrating the sliding window approach used in Word2Vec. A window of a fixed size moves across a sentence, generating training dataset samples.

Figure: The sliding window approach

Let’s understand this with an example. Consider the sentence: “The nurse assisted the doctor with the patient’s treatment at the hospital.” Suppose this sentence is part of a larger training corpus. We’ll use it to walk through how the Word2Vec model generates training samples and eventually learns useful word embeddings.

Step 1: Pre-processing

Just like BoW, the first step is to clean the training data to reduce the complexity of the vocabulary and allow the model to focus on learning meaningful semantic signals.

A table showing the pre-processed and tokenized version of the sentence: 'The nurse assisted the doctor with the patient's treatment at the hospital.'

Pre-processed and tokenized sentence for Word2Vec training

💡 In this example, the vocabulary size comes out to be 9, however in reality it can be thousands or even millions. The vocabulary isn’t built from a single sentence but from a massive collection of text (a “corpus”) like all of Wikipedia or a large portion of the internet.

Step 2: Creating training dataset

With the corpus transformed into a clean sequence of tokens, the next stage involves extracting training instances. For this, Word2Vec proposes two primary architectures:

Continuous Bag-of-Words (CBOW)
Skip-gram Let’s understand each of these architectures in detail.

Continuous Bag-of-Words (CBOW)

The CBOW architecture operates on the principle of predicting the target word based on its surrounding context words. It essentially asks the question: “Given these surrounding words, what is the most likely word to be in the middle?” The context words are treated as a “bag” of words, meaning their order is not considered in the standard implementation. This process primarily relies on a “sliding window” technique. This technique systematically moves through the token sequence to define a “local context” for each word.

An illustration showing the Continuous Bag-of-Words (CBOW) model in action. The model takes context words 'the', 'man', 'his', 'son' as input and predicts the target word in the middle, in this case 'loves'.

CBOW: Given the context words "the", "man", "his", "son", predict the most likely middle word

In the above example, a window size of 2 means that for any given word, we consider up to two words to its left and up to two words to its right as its context. The total potential size of the context for any given word is therefore four.

Let’s apply CBOW to the tokenized output from Step 1 ['the', 'nurse', 'assisted', 'the', 'doctor', 'with', 'the', 'patient', 'treatment', 'at', 'the', 'hospital'] considering a sliding window size of 2 to obtain a training dataset.

Iteration 1:

For the first iteration, the context words are “nurse” and “assisted” (window size 2), and the target word is “the”. Since there are no words before the first “the”, we only take the two words that come after it.

An image showing the first iteration of the CBOW training process, where 'nurse' and 'assisted' are context words and 'the' is the target word.

CBOW Training dataset - Iteration 1

Iteration 2:

The next context words are “the” (left boundary) and “assisted”, “the” (right boundary) and target word is “nurse”.

An image showing the second iteration of the CBOW model for the sentence 'the nurse assisted the doctor with the patient’s treatment at the hospital'. The target word 'nurse' is shown in the center. Its context words are 'the' (left boundary), and 'assisted' and 'the' (right boundary), all within a sliding window of size 2.

CBOW Training dataset - Iteration 2

Iteration 3:

Context words are “the”, “nurse” (left boundary) and “the”, “doctor” (right boundary) and the target word is “assisted”.

An image showing the third iteration of the CBOW model for the sentence 'the nurse assisted the doctor with the patient’s treatment at the hospital'. The target word 'assisted' is shown in the center. Its context words are 'the', 'nurse' (left boundary), and 'the', 'doctor' (right boundary), all within a sliding window of size 2.

CBOW Training dataset - Iteration 3

We continue to iterate over the tokenized input until we reach the end of the input data. The final training dataset would look as below:

A table showing the training dataset for the CBOW model with a window size of 2. Each row contains context words and a target word.

CBOW Training dataset

Skip-gram

The Skip-gram architecture works in the opposite direction. Instead of using the context to predict the target, it uses the target word to predict its surrounding context words. It asks, “Given this central word, what are the words likely to be found in its vicinity?” This approach generates more training samples for the same window position.

A Skip-gram model diagram showing a target word 'love' at the center, with arrows pointing outwards to predict context words 'I', 'to', 'you', and 'it'.

Skip-gram: Given the target word "love", predict the most likely context words

Let’s apply Skip-gram to the tokenized output from Step 1 ['the', 'nurse', 'assisted', 'the', 'doctor', 'with', 'the', 'patient', 'treatment', 'at', 'the', 'hospital'] considering a sliding window size of 2 to obtain a training dataset.