Keeping Large Language Models on a Leash: A Short Summary of Prompt Injection Fix Recommendations

Mateusz Wojciechowski
A hound on a leash that represents the constraints that we put on Large Language Models.

Some time ago, our team engaged in an extensive penetration test of an application that utilized an LLM. We conducted this part of the pentest using the OWASP Top 10 for LLM, which I have described in this article.

A few of the discovered vulnerabilities were connected to the app’s usage of AI, most notably Prompt Injections. This sparked a vigorous internal discussion about what we should put in a list of recommendations for fixing these issues.

The results of this discussion were a starting point in writing this blog post.

Vulnerabilities

Aside from the usual web app vulnerabilities, the application we pentested contained some specifics related to using an LLM. For example, the application failed to restrict the AI to only engage in topics related to its business purpose. But the most serious issues were Direct and Indirect Prompt Injections.

The situation was particularly severe because it enabled influencing the output of LLMs for other users.

Fixes

A big part of our job as penetration testers is putting together recommendations on how to fix the issues we have found. Usually, it is pretty straightforward. If you find an XSS, the remediation is to encode data according to its context properly. If you have an SQL injection, you use the parametrization of user input. You know the drill. However, in the case of LLMs broadly and Prompt Injection vulnerabilities in particular, things aren’t that simple. No remediation short of just ceasing to use LLMs altogether will provide bulletproof security. Preparing the Prompt Injection fix recommendations is much trickier and requires more customization to fit the business case.

The landscape of Prompt Injection remediation is vast, colorful, and everchanging. We have multiple, completely different approaches and philosophies and a dazzling diversity within them. The main methods currently proposed to mitigate these vulnerabilities or lower their risk are:

  • Input pre-processing
  • Guardrails
  • Design patterns
  • Mixture of Experts
  • Prompt Engineering
  • Finetuning
  • Reducing the impact

That’s a lot of different mitigations for the same vulnerability, isn’t it? For the time being, though, this might be a good thing, as we have more flexibility in picking and choosing fixes. In the next few articles,

I will dive into them a lot more, so stay tuned for that. For now, I will just briefly characterize them for you.

Input pre-processing

This method depends on transforming the user-supplied input to make creating an adversarial prompt more difficult. The main categories here are paraphrasing and retokenization. These can mean rewriting user input using another language model to be more concise or differently worded, character or word substitutions, or breaking up LLM tokens into smaller parts.

All of these actions should, in theory, preserve the benign instructions while disrupting adversarial ones.

Guardrails

Guardrails are a family of methods for monitoring the LLMs’ inputs and outputs. This is done to detect Prompt Injection attacks or their results. Those methods might be as simple as using good old regexes or much more complex.

An interesting example is utilizing another LLM to judge the harmfulness of the user input.

Design patterns

This method relies on using multiple models with different levels of permissions organized in specific design patterns. Well-structured and defined data is passed between them to fence the model with access to high-stake actions from the untrusted input.

This one, in my opinion, has the potential to someday become a more general solution to Prompt Injection.

Mixture of Experts

This category uses multiple different models or a few instances of the same model. Those can generate multiple outputs and combine them or vote over the validity of a singular output.

This forces the attacker to fool the most models with their payload instead of just one.

Prompt Engineering

A family of methods that has a common theme of tampering with the LLM’s prompt. This might mean structuring system prompts in a certain way or encapsulating user queries within it. Another interesting method is enforcing a strict instruction hierarchy where system instructions have the highest priority.

This is done to ensure that the model stays on track with its task and is less prone to Prompt Injection.

Finetuning

This is an additional learning phase that a model undergoes. It is meant to make it harder to derail the LLM from its intended purpose. The learning is usually task-specific, and this approach increases the probability that the model will stick to this task.

This approach requires additional computation and data preparation so it might become quite expensive to conduct.

Reducing the impact

This is not a method of preventing Prompt Injection from occurring but of limiting its impact when someone exploits it. Those measures are highly application-specific, but in general, they include building a high wall between the LLM and any private data, employing data parsing and sanitization before using the model’s output in other systems, and fencing the LLM from high-stakes actions, as it cannot be trusted with them.

Sadly, you must treat LLM’s output as potentially malicious, just like you would treat user input.

Problems

All of those methods have a significant, immediately apparent drawback. They don’t provide absolute security, just some degree of probability that an attack will not succeed. And as Simon Willison has rightly pointed out (and I’m sure most of you will agree with him) – 99% is a failing grade in application security. It doesn’t mean that nobody should use LLMs for anything – pretty much every system is exploitable in some way, and it does not stop us from developing them. It does mean, however, that not every use case is feasible for LLMs right now from a security standpoint. Any high-stakes agentic behavior like buying and selling, modifying data, making appointments, etc., is off the limits for now due to the risk of exploitation.

To put it bluntly, you can contend with a 1 in 100 risk of a hallucination or misinformation but not of a fraudulent purchase or changing your password.

Conclusions

Preparing a list of clear, effective, and actionable recommendations for vulnerabilities we found might be the most challenging part of our jobs as penetration testers. This is immediately and intensely apparent regarding AI penetration testing. The pace of change in this field and the level of complexity far surpasses most of the regular application security. There are a lot of unsolved problems and vulnerabilities that have not been remedied. Staying up to date with current best practices is a difficult task, but I will do my best to help you in this journey.

See you soon in the next blog post, in which I will delve deeper into the specifics of one of the Prompt Injection defenses mentioned here.

Mateusz Wojciechowski
Head of AI | Offensive Security Engineer

Is your company secure online?

Join our list of satisfied customers and safeguard your company’s data!

Trust us and leave your contact details. Our team will contact you to discuss the details and prepare a tailor-made offer for you. Full discretion and confidentiality of your data are guaranteed.

Willing to ask a question immediately? Visit our Contact page.