rss-bridge 2026-01-13T13:58:17+00:00

Inside the LLM | Understanding AI & the Mechanics of Modern Attacks

Learn how attackers exploit tokenization, embeddings and LLM attention mechanisms to bypass LLM security filters and hijack model behavior.

Security Research

Inside the LLM | Understanding AI & the Mechanics of Modern Attacks

Phil Stokes

January 13, 2026

Executive Summary

Assessing AI security risks requires understanding how prompts are transformed inside the model and how these transformations create security gaps.

This post focuses on the initial stages of the LLM pipeline, including tokenization, embedding, and attention, to clarify how the model interprets input and where vulnerabilities arise.

We show how prompts can bypass traditional keyword filters and exploit architectural behaviors like context window limits.

We explain how the Query-Key-Value mechanism allows engineered token sequences to hijack model focus, overriding built-in safety guardrails.

Overview

LLMs are now widely used across enterprise environments for everything from internal workflows and customer support to automated documentation and data analysis. While these systems offer huge productivity gains, they also create potential attack surfaces, particularly where organizations do not have control over the input, such as in public-facing chatbots that could be manipulated through crafted prompts.

Even simple inputs can influence how these models behave. By examining how text is transformed inside the model, from tokens to embeddings and through attention mechanisms, we can see where attackers might exploit these processes. This includes techniques such as prompt injection, jailbreaking, and adversarial suffix attacks.

Looking at components such as the context window, attention mechanisms, and token embeddings, this post explores how inputs are processed and why certain sequences can override intended behavior. This understanding should help analysts and security teams to recognize how LLM systems can be exploited in their environments.

The Taxonomy of Intelligence

To understand the attack surface, it can be helpful to locate LLMs within the broader hierarchy of artificial intelligence. The following terms are often used interchangeably within security research and threat intelligence reports, but they represent distinct architectural layers:

Artificial Intelligence (AI): The broad discipline of creating systems capable of performing tasks characteristic of biological intelligence, such as reasoning, learning, and perception.

Machine Learning (ML): A subset of AI focused on algorithms that learn patterns from data rather than being explicitly programmed.

Deep Learning (DL): A specialized subset of ML using multi-layered Neural Networks to model complex patterns. This is the engine of modern AI.

Large Language Models (LLMs): Deep Learning models trained on massive datasets with a single mathematical objective: to predict the next token (or tokens) in a sequence.

Much of the discussion around these topics has a tendency to anthropomorphise how AI works, but an LLM does not literally “know” the capital of France: It calculates that “Paris” is the most likely token to follow a sequence such as “The capital of France is…”.

This probabilistic generation is one of the primary causes of “hallucinations,” or more accurately put, those confident but incorrect assertions that are familiar to even casual users of LLMs. This same disconnect between token generation and semantic meaning also allows for the attack vectors we will discuss below.

The Inference Pipeline | High-Level Architecture

With that in mind, let’s explore how these models operate by tracing the end-to-end data flow.

When a user sends a prompt, the data traverses five distinct stages, powered by the Transformer architecture: the “T” in GPT (Generative Pre-trained Transformer). First introduced by Google in 2017, Transformers utilize parallelization and “self-attention” mechanisms to process sequences of text at scale.

Tokenization: Raw text is input and converted to atomic units, known as tokens, which are then mapped to discrete integers.

Embedding: The discrete integers are converted into long numeric arrays, or vectors, known as embeddings. This numeric array essentially represents the token’s semantic meaning. The embedding for “hacker,” for instance, would be mathematically closer to the embeddings for terms like “attack” or “exploit” than to a dissimilar term like “chair.”

Positional Encoding: A unique vector is added to each token’s embedding to give the model a sense of word order and grammatical dependencies.

Attention: The model calculates how strongly each token relates to every other token through a process called self-attention.

Decoding: The model predicts the probability of the next token. The selected token ID is then converted back to text.

This post examines the first four stages, where the disconnect between human semantics and machine representation enables specific attacks.

1. Tokenization & Filter Evasion

Neural networks cannot process raw text strings, so the first layer of abstraction is a process known as tokenization, converting text items into atomic units of processing.

While it may be intuitive to assume tokens map to words, modern architectures commonly utilize subword-based tokenization such as Byte Pair Encoding (BPE). This algorithm builds a vocabulary of variable-length units, including whole words, sub-words, and individual characters, by merging the most frequent sequences found in the model’s training data.

Compare a standard security log entry with how a model might tokenize it :


Input:
"EventID: 4688 | Image: C:\Windows\System32\powershell.exe | Command: -ExecutionPolicy Bypass"

Tokens:
["EventID", ":", " 4688", " |", " Image", ":", " C", ":\\", "Windows", "\\", "System32", "\\", "powershell", ".exe", " |", " Command", ":", " -", "Execution", "Policy", " Bypass"]

Tokenization is deterministic but distinct from linguistic morphology, such as decomposition into elements like roots and suffixes. Algorithms like BPE are statistical rather than grammatical, merging characters based solely on frequency in the training dataset, not semantic meaning. While ["powershell", ".exe"] aligns with human logic, the model might split “powershell” into ["power", "shell", ".exe"] or even smaller units such as ["pow", "er", "sh", "ell", ".", "e", "x", "e"] depending on the specific vocabulary established during the model training phase.

This disconnect between human language structure and machine statistics makes filter bypass possible.

Attack Vector | Filter Bypass

Tokenization boundaries can hide malicious payloads when security filters and the model operate at different representation levels.

For example, a static keyword blocklist might check input as plain text strings and block the string “powershell”. However, if the LLM processes the input as tokens like ["power", "shell"], the filter might fail to trigger against the prompt.

Adversaries actively optimize prompts to exploit these boundaries, utilizing techniques such as Adversarial Tokenization. The model reassembles the semantic meaning while the filter only sees fragmented syntax.

2. Embedding & Gradient-Based Attacks

Once tokenized, text is initially converted into discrete integers, known as Token IDs. For example,


"The"      → 464
"analyst"  → 18291
"security" → 12961

[...]

Original source