SBN

LLM vector and embedding risks and how to defend against them

As large language model (LLM) applications mature, the line between model performance and model vulnerability continues to blur.

While vector embeddings have become foundational to Retrieval-Augmented Generation (RAG), recommendation systems, and semantic search, their improper handling introduces new attack surfaces that can compromise both LLM behavior and user data.

In this third post in our blog series exploring the Open Worldwide Application Security Project (OWASP) Top 10 for Large Language Model Applications, we focus on “Vector and Embedding Weaknesses” — a risk category that highlights how subtle manipulation of vector space can lead to data poisoning, behavior modification, and data leakage.

What Are Vector and Embedding Weaknesses?

Vector embeddings are mathematical representations of concepts, allowing LLMs to reason about similarity and relevance. These are typically generated from user inputs or external documents, then matched against a vector store to augment responses — a technique central to active RAG.

However, these embeddings are vulnerable:

  • Malicious inputs can be crafted to poison the embedding space, misleading LLMs into returning incorrect or adversarial results.

  • Attackers may insert embedding collisions, where crafted text shares near-identical vector values with legitimate content.

  • Poor hygiene in vector storage or indexing can lead to data exposure, especially when embeddings encode sensitive information.

In short, embedding vulnerabilities undermines the trustworthiness of the retrieval pipeline itself, which RAG relies on to ground LLM responses in factual data.

Real-World Risks: From Semantic Poisoning to Data Leaks

The OWASP LLM Top 10 highlights several real-world examples of how vector and embedding weaknesses can manifest:

  • Hidden instructions in embedded content: Attackers can insert invisible prompts, such as white text on white backgrounds, into documents submitted to systems powered by RAG. When these documents are embedded and later retrieved, the hidden text can manipulate the LLM into producing biased (Read more...)

*** This is a Security Bloggers Network syndicated blog from 2024 Sonatype Blog authored by Aaron Linskens. Read the original post at: https://www.sonatype.com/blog/llm-vector-and-embedding-risks-and-how-to-defend-against-them