EU AI Act Compliance Testing for LLMs with Promptfoo

The EU AI Act is the first comprehensive, legally binding framework for AI governance, and it shifts compliance from a pre-launch checklist into a core engineering requirement. This guide walks through how Promptfoo can be used to evaluate an LLM against the Act.

Under the EU AI Act, AI systems are grouped into risk tiers, from minimal-risk up to high-risk and unacceptable-risk (which are outright prohibited). For developers building applications powered by Large Language Models (LLMs), the stakes are particularly high. Because LLMs are probabilistic, they can inadvertently generate outputs that violate the Act’s strict prohibitions. Without proper guardrails, a customer service bot can easily cross the line into:

  • Subliminal Manipulation: Hijacking user intent or using deceptive persuasion.
  • Social Scoring: Making unjustified assessments or false characterizations of individuals.
  • Biometric Categorization: Inferring sensitive personal data (like race, political opinions, or religion) from user prompts.

Why Standard Tests Miss EU AI Act Violations

The challenge with LLMs is that you cannot test them like traditional software. A standard unit test expecting a boolean True or False won’t catch a sophisticated prompt injection attack or a subtle violation of a user’s privacy. A unit test asserting that a response starts with "Sure, here's…" passes while the model agrees to infer a user’s political affiliation from their prompt. This is exactly the kind of biometric-categorisation violation Article 5 prohibits. You can’t just cross your fingers and hope your system prompt is strong enough to hold up in production. You need adversarial, automated testing.

What Promptfoo Is (and Why It Fits the EU AI Act)

Promptfoo fills that gap. Promptfoo is an open-source, developer-focused LLM evaluation framework that brings “shift-left” security to AI. Instead of manually trying to trick your model to see if it breaks the law, Promptfoo’s Red Teaming capabilities act as an automated adversary.

Promptfoo also ships dedicated plugins built specifically to test against the EU AI Act. By simulating thousands of custom, context-aware attacks tailored to your application’s logic or model, Promptfoo probes your model for vulnerabilities and maps the findings directly to EU regulatory requirements. In this guide, we will walk through how to configure Promptfoo, run a localized red team evaluation, and generate a report, focusing on the EU AI Act.

1. Prerequisites

  • Requires Node.js ^20.20.0 or >=22.22.0 for npm and npx usage.
  • Make sure you have REST API access to the model you want to test.

In this guide, we will run tests against the DeepSeek V4 Flash model.

2. Installation

To use Promptfoo in our system, we need to install it.

npm install -g promptfoo

3. Run Promptfoo Locally

To run Promptfoo locally, in UI mode, you need to execute the following command:

promptfoo eval setup

Once running, go to http://localhost:15500/ to access the UI.

4. Select Red Teaming

It will open the page shown below.

Promptfoo home screen with options for Eval and Red Team

We will select Create Red Team option.

5. Select Target Setup

Promptfoo offers several target options: Python code, OpenAI API keys, or custom JavaScript integrations. In this guide, select HTTP/HTTPS Endpoint and fill out the target name with DeepSeek-V4-Flash-target.

Target setup screen with HTTP/HTTPS Endpoint selected

Click Next.

6. Promptfoo Target Configuration

Next we will configure Promptfoo to target our model or application. This section has multiple steps.

HTTP Calls Setup

  • First, we need to configure the URL of the model since we selected the HTTP/HTTPS target.
  • Header Content-Type should be set to application/json.

Request Body will be:

{
  "model": "deepseek-v4-flash",
  "messages": [
    {
      "role": "user",
      "content": "{{prompt}}"
    }
  ]
}

Adjust this for your model and your API accordingly.

For Request Parser we will insert a JavaScript snippet to parse the response:

(json) => {
  if (json.error) return `Server Error: ${json.error.message}`;
  if (!json.choices) return `${JSON.stringify(json)}`;
  return json.choices[0].message.content;
}

Finally, select Test Target to check if you configured everything correctly. You should receive a success message if everything is correct.

HTTP request configuration showing URL, headers, body and parser

Prompt Model API Configuration

The next step covers other API settings, such as:

  • Session Management
  • Authorization
  • Request Transform
  • Token Estimation
  • TLS/HTTPS Config
  • HTTP Status Code

We change Session Management to include full interaction history in each request and leave the rest as default. Promptfoo also lets you configure:

  • Test Generation – Where you configure how Promptfoo generates tests for your target.
  • Delay – Helps you mimic real user activity.
  • Extension Hook – Where you can add custom hooks for before/after each request.

In this guide we leave everything as default and move to the next step.

Model API configuration panel for session management and auth

7. Application Details

In this step, you can configure settings for your AI application or test the model directly. We will select I’m testing a model, since we are evaluating a model against the EU AI Act.

Application details step with 'I'm testing a model' selected

We can move to the next step.

8. Plugin Setup

Plugins are where Promptfoo earns its place. It provides plugin sets that test a range of risks and vulnerabilities in LLM models and LLM-powered applications.

The EU AI Act preset bundles plugins that map directly to the Act’s main obligations: prohibited practices under Article 5 (subliminal manipulation, social scoring, biometric categorisation), transparency duties under Article 50 (disclosing that the user is interacting with AI, watermarking generated content), and high-risk system requirements under Article 15 (accuracy, robustness, cybersecurity). Selecting the preset rather than hand-picking plugins ensures the resulting report is structured around the same articles your compliance team will reference.

Plugin selection screen with the EU AI Act preset highlighted

After selecting the EU AI Act preset, let’s move to the next step.

9. Strategies

This step lets you set up strategies for the red-team run. Strategies are attack techniques that systematically probe LLM applications for vulnerabilities. Where plugins decide what to test, strategies decide how, layering tactics like jailbreaks, prompt injection, or multi-turn conversational attacks on top of each plugin’s test cases.

For a baseline EU AI Act evaluation, plugins alone are sufficient. We want to know whether the model violates the Act when prompted directly. Teams who have already cleared that bar and are doing active threat modelling should run a second pass with strategies enabled to stress-test the same plugins against motivated adversaries.

We won’t add any strategy for this guide, so leave everything as default and move to the next step.

Red-team strategy selection screen

10. Review and Run

Review

Before running the red teaming, we can review our configurations and if necessary, we can go back and adjust them.

Configuration review summary before running the evaluation

Run

Once we are satisfied with our configurations, we can run the red teaming. This will start the evaluation process and generate a report.

Run screen starting the red-team evaluation

When it starts, Promptfoo will generate test cases and run them against our model.

Promptfoo generating test cases against the target model

Expect the EU AI Act preset to produce around 180 test cases per run and take 5–15 minutes to complete, depending on your model’s latency and rate limits. Every test consumes tokens against both the target model and Promptfoo’s grader model. If you’re not self-hosting either one, budget for the API costs yourself before kicking off the run, particularly when iterating on fixes.

11. Finished Eval

After the evaluation is finished, we see how many tests passed and failed.

Evaluation summary showing pass and fail counts

Open View Results -> Latest Eval to see details about the evaluation. For each test, we can see the status, whether it passed or failed. Each failed test includes an explanation of why it failed.

Example of a failed test case:

Detailed view of a failed test case with explanation

12. Report

After the evaluation is finished, we can open the report, which shows the severity of issues and the categories that failed.

Report overview showing severity and vulnerability categories

Focusing on the EU AI Act section, we can see which EU-regulated areas our model is vulnerable in.

EU AI Act section of the report listing flagged regulatory areas

You can also download the full sample report.

13. From Report to Remediation

A red-team report is the start of compliance work, not the end. Once the run is done:

  • Triage by severity first. Critical and high-severity findings under Article 5 (prohibited practices) need to be fixed before any other work. These are hard-prohibition violations, not best-practice nudges.
  • Apply the cheapest fix that works. In order of effort: tighten the system prompt, add input/output filtering or moderation, restrict tool use, switch to a model with stronger built-in guardrails. Most prohibited-practice failures are solvable at the prompt layer.
  • Re-run the same Promptfoo config after each fix. Saving the configuration as code (see below) makes this trivial and gives you a regression record over time.

14. Beyond the UI: Configuration as Code

The wizard is the fastest way to get a first report, but once you know the configuration you want, export it as a promptfooconfig.yaml and check it into your repository. The same evaluation then runs from a single command in CI, gates pull requests that change prompts or models, and produces diff-able reports across runs. Treat the UI as the design surface and the YAML file as the source of truth.

What a Promptfoo Run Proves (and Doesn’t) for EU AI Act Compliance

A passing Promptfoo run is strong evidence of due diligence and a fast way to catch the most embarrassing failures before they ship. It is not a legal certification. Formal conformity assessments under the EU AI Act still require human auditors, documentation beyond automated tests, and (for high-risk systems) registration in the EU database. Use Promptfoo as the engineering feedback loop that keeps you on the right side of those audits, not as a replacement for them.

Next Steps After Your First EU AI Act Evaluation

Running Promptfoo against the EU AI Act preset is the cheapest way to find out whether your LLM is one prompt away from a regulatory headline. But a report is the start of compliance, not the end. The harder work is deciding which findings matter, who signs off on residual risk, and how the test suite evolves alongside your model and prompts.

If you’d like a second pair of eyes on your evaluation pipeline, or help turning Promptfoo findings into an audit-ready remediation plan, book a scoping call and we’ll talk through your setup.

From Insight To Assessment

Need to Assess an AI System?

Request a Scoping Call →