EU AI ACT · ARTICLE 10 · DATA GOVERNANCE · ENFORCEMENT 02.08.2026

Art. 10 AI Act: 7 data governance mistakes that cost €15M penalty

Piotr Reder · aiactaudit.pl 05 May 2026 · ~14 min read

Most articles about EU AI Act focus on risk classification (Annex III) or GPAI obligations. Those are flashy topics. But when your system falls into high-risk and an audit comes knocking, the first question you'll get is not "how much did it cost" or "do you have human oversight". The first question is: show me the training data documentation. That's Art. 10 AI Act.

Out of the 22 requirements in Art. 9-15 for high-risk systems, Art. 10 Data Governance is the most-overlooked section in EU SMB SaaS audits I've seen. Consultants sell fluff "AI ethics workshop". Meanwhile European AI Office and EU notified bodies expect concrete documentation of data lineage, bias testing, data quality controls. Missing any = €15M penalty per Art. 99(4).

TL;DR

Art. 10 requires 7 things: data lineage (where and when), data minimization (why these and not others), bias detection + mitigation (with concrete metrics), sensitive data justification (Art. 10(5) — special category data only if necessary), separation of training/validation/test sets (no contamination), data quality controls (completeness, accuracy), documented procedures (auditable trail). 60% of EU SMB SaaS skip at least 3 of these 7. Each gap = potential €15M penalty for high-risk systems.

What is Art. 10? (overview)

Art. 10 AI Act covers data governance and management practices for high-risk AI systems (Annex III). It's enforceable from 02.08.2026 for most high-risk obligations, but this article is the "data foundation" — without Art. 10 compliance there's no point in talking about Art. 14 (human oversight) or Art. 15 (accuracy). No data, no nothing.

Six subsections of Art. 10:

Art. 10(1) — high-risk systems trained with data must be developed based on training/validation/test sets meeting quality criteria in (2)-(5)
Art. 10(2) — data governance practices must be in place: design choices, data collection processes, data prep operations, formulation of relevant assumptions, prior assessment of availability/quantity/suitability, examination for biases, identification of gaps
Art. 10(3) — training/validation/test sets must be relevant, sufficiently representative, free of errors, complete in view of intended purpose
Art. 10(4) — sets must take into account characteristics of geographical, behavioural, contextual, functional setting
Art. 10(5) — special category personal data (GDPR Art. 9 — race, health, biometrics, etc.) only if strictly necessary for bias detection/correction, with appropriate safeguards
Art. 10(6) — appropriate measures to detect, prevent, mitigate biases

Penalty per Art. 99(4): €15M or 3% global annual turnover, whichever is higher (lower for SMB per Art. 99(6)).

Who must comply? (scope)

Art. 10 applies to providers of high-risk systems. If your system falls into one of the 8 Annex III areas and your company is a provider (developing/training/placing on market), Art. 10 is mandatory. Decision tree for Annex III.

Full scope:

Provider high-risk AI system = full Art. 10 obligations
Deployer high-risk AI system = lighter Art. 26 obligations + cooperation with provider on Art. 10 issues
Distributor / Importer high-risk AI system = checks whether provider met Art. 10 (Art. 25-27)

For SMB SaaS the most common case: you're the provider of your high-risk system (you wrote it), even if you use GPAI API from OpenAI/Anthropic for underlying inference. Provider vs deployer in GPAI context.

7 data governance mistakes for EU SMB SaaS

List based on audits I've seen (initial benchmark) + analysis of public AI compliance reports from 2025-2026 (Anthropic, OpenAI, Mistral compliance disclosures). Sorted from most to least common.

Mistake #1 — Missing training data lineage

What it is: you don't know where your training data came from. "From the internet", "from customers", "scraped". That's NOT data lineage per Art. 10(2)(b).

What Art. 10 requires: for each category of training data — source identification (URL / API / vendor / contract reference), collection date range, collection method (scrape / purchase / user-submitted), licensing posture (PD / CC / proprietary / consent-based), volume metrics (count + size).

Frequency in EU SMB: ~80% of SaaS using training data have no formal data lineage. An Excel file with URLs is not enough.

Penalty risk: 🔴 high. Audit starts with this question. No lineage = audit fail = potentially €15M.

Fix: data lineage doc with 5 fields per source: name + URL/contract + dates + license + volume. For scraping — robots.txt compliance log. Maintain per release. 4-8h initial work, 2h/release ongoing.

Mistake #2 — Bias detection without metrics

What it is: you say "we test for bias" but you don't have concrete numbers. "We checked it works well across groups" — that does not satisfy Art. 10(2)(f) and Art. 10(6).

What Art. 10 requires: identifiable bias metrics per protected category (gender, age, ethnicity, disability, geography per Art. 10(4)). Standard frameworks: demographic_parity_difference, equalized_odds_difference, predictive_parity_ratio, disparate_impact_ratio. Plus baseline measurement + post-mitigation re-measurement.

Frequency in EU SMB: ~70% of audits show "we tested for bias" without concrete metric values or group breakdowns.

Penalty risk: 🔴 high. Audit looks for numbers. No numbers = audit fail.

Fix: pick 2-3 standard metrics (e.g. demographic parity + equalized odds), measure baseline on test set with protected attributes (or proxy if sensitive data not collected), document in technical documentation Annex IV. Tools: Aequitas (open source), IBM AI Fairness 360, Microsoft Fairlearn. 2-3 days of data scientist work.

Mistake #3 — No data minimization in collection

What it is: you collect everything you can, "just in case". Art. 10(2)(d) requires relevant assumptions about information data is supposed to measure and represent + Art. 10(3) requires relevance/representativeness. Crosswalks with GDPR Art. 5(1)(c) data minimization principle.

What Art. 10 requires: documented justification for each feature/column in training set — what it's supposed to measure/represent. Features without clear purpose = audit red flag. From the perspective of Art. 10 + GDPR overlap, "we collect everything" is a legal nightmare.

Frequency in EU SMB: ~60% of SaaS collect features that aren't used in the final model or have no clear documented purpose.

Penalty risk: 🟠 medium-high. Penalty under Art. 99(4) AI Act + GDPR Art. 83(5) double-stacking (max GDPR €20M / 4% turnover).

Fix: data dictionary doc — list of all features in training set with 3 fields: purpose (why you collect), used in model (yes/no — if no, why still collecting), retention policy (how long kept). Drop features without clear purpose. 1-2 days of work.

Mistake #4 — Sensitive data without Art. 10(5) justification

What it is: you collect or use GDPR Art. 9 special category data (health, biometrics, race, religion, political views, orientation, union membership) in training but you don't have a strict necessity justification per Art. 10(5).

What Art. 10(5) requires: sensitive data only when "strictly necessary" for bias detection / correction, with appropriate safeguards (encryption, pseudonymization, retention limits, access controls, no transmission to third parties without consent). Plus DPIA documentation per GDPR Art. 35.

Frequency in EU SMB: ~30% of SaaS using behavioral data have elements of sensitive data (e.g. health-tech, fertility apps, mental health). Most without formal Art. 10(5) justification.

Penalty risk: 🔴 very high. Sensitive data violations from AI Act + GDPR double penalty, plus class action exposure (consumer rights).

Fix: if you use sensitive data — dedicated Art. 10(5) Justification doc with: (1) what you collect specifically, (2) why strictly necessary (alternatives explored), (3) safeguards in place, (4) retention timeline, (5) DPIA reference. If possible — don't collect (proxy or synthetic alternatives). 2-3 days work + legal review €1-2k.

Mistake #5 — Validation/test sets confused with training

What it is: you don't have proper separation of training / validation / test sets per Art. 10(1). Either you have only train + test (without validation), or you use test set for hyperparameter tuning (= contamination).

What Art. 10 requires: three distinct sets: training (model fits weights), validation (hyperparameter tuning, model selection), test (final evaluation only — touched once). Plus documentation of when data was partitioned, what seed, what ratio.

Frequency in EU SMB: ~50% of startup ML pipelines have improper separation. Often "test set leak" through time-series cross-validation issues or by using test set for model selection.

Penalty risk: 🟠 medium. Audit may show that accuracy claims are overstated due to data leakage = misleading documentation = Art. 99(4) violation.

Fix: proper 70/15/15 split (or time-based split for time series), document in data prep doc, lock test set in separate file/encryption, pull only at final evaluation. Tools: scikit-learn train_test_split with random_state, MLflow or Weights & Biases for experiment tracking. 4-8h of work.

Mistake #6 — No data quality controls (errors/missing/duplicates)

What it is: you don't have a procedural check whether training data is "free of errors and complete" per Art. 10(3). Nobody checked missing values, duplicates, outliers, encoding issues, label noise.

What Art. 10 requires: documented data quality assessment with metrics: completeness rate (% missing values per feature), duplicate rate, label accuracy (manual sample audit, e.g. 100 records), encoding consistency, outlier detection (with disposition log: kept / removed / corrected).

Frequency in EU SMB: ~75% of startup ML produces models without formal data quality report. "It works in tests" is insufficient documentation.

Penalty risk: 🟠 medium. Audit may require re-training if quality issues undermine safety claims.

Fix: data quality report per training run: pandas profiling or Great Expectations for automatic report, manual label audit on 100-200 samples (with second annotator for agreement), outlier disposition log. Auto-include in technical documentation Annex IV. 1-2 days initial, 0.5 day per release.

Mistake #7 — No documented data governance procedure

What it is: everything about data is done ad-hoc. Each ML engineer has their own way. There's no procedure document. Art. 10(2) explicitly requires "data governance and management practices" — that's a procedural requirement, not just technical.

What Art. 10 requires: written procedure document covering: data sourcing approval, quality gates before training, bias testing checklist, sensitive data handling, validation protocols, retention/deletion schedule, incident response (e.g. discovered bias post-deployment), version control. Roles + responsibilities (who is data steward).

Frequency in EU SMB: ~85% don't have a formal data governance procedure document. Tribal knowledge, no audit trail.

Penalty risk: 🟡 medium-low individually, but combined with mistakes #1-6 = audit fail compounding.

Fix: 5-10 page Data Governance Procedure doc — template available in our sample audit. Ownership: one person (CTO / Lead ML / Compliance Lead). Review quarterly. 1-2 days of initial draft work.

Decision tree — does your AI have a data governance gap?

┌─ Is your AI high-risk per Annex III?
│   (recruitment, credit, education, healthcare, biometrics,
│    infrastructure, law enforcement, migration, courts)
│
├─ NO → Art. 10 doesn't apply. You can skip this article.
│
└─ YES → Art. 10 mandatory. Continue:

  Q1: Do you have documented training data lineage
      (source / dates / licenses / volumes per category)?
      ├─ NO → 🔴 GAP #1. High audit failure risk.
      └─ YES → Q2

  Q2: Do you have bias metrics with concrete values
      per protected category (gender/age/ethnicity/etc.)?
      ├─ NO → 🔴 GAP #2. High risk.
      └─ YES → Q3

  Q3: Does each feature in training data have
      documented purpose and used-in-model status?
      ├─ NO → 🟠 GAP #3. Medium risk (+ GDPR exposure).
      └─ YES → Q4

  Q4: If you use sensitive data (GDPR Art. 9),
      do you have Art. 10(5) Strict Necessity doc + DPIA?
      ├─ Use sensitive data, no doc → 🔴 GAP #4. Critical.
      ├─ Don't use sensitive data → Skip to Q5
      └─ Have doc → Q5

  Q5: Do you have proper train/validation/test separation
      without data leakage?
      ├─ NO → 🟠 GAP #5.
      └─ YES → Q6

  Q6: Do you have data quality report
      (completeness/duplicates/label audit) per training run?
      ├─ NO → 🟠 GAP #6.
      └─ YES → Q7

  Q7: Do you have Data Governance Procedure document
      with named roles + responsibilities?
      ├─ NO → 🟡 GAP #7.
      └─ YES → ✅ Art. 10 likely compliant.
                  Audit recommended for legal certainty.

Of 7 gaps, if you have 3 or more = high probability of audit failure. If 1-2 = manageable, fix before enforcement deadline 02.08.2026. If 0 = you're in the top 5% of EU SMB SaaS in data governance.

GDPR crosswalk — where Art. 10 AI Act overlaps with GDPR

Art. 10 AI Act does not replace GDPR — it operates on top of it. Cross-references every compliance lead must understand:

Art. 10 AI Act	GDPR equivalent	Implication
(2)(a) design choices	Art. 25 privacy by design	Documented design decisions required by both
(2)(b) data collection	Art. 6 lawful basis + Art. 5(1)(b) purpose limitation	Collection must have lawful basis + AI purpose
(2)(c) data prep	Art. 5(1)(c) data minimization	Pre-processing must minimize, not maximize
(2)(f) bias examination	Art. 22 automated decision-making	Bias = non-discrimination obligations
(3) error-free / complete	Art. 5(1)(d) accuracy principle	Inaccurate data = GDPR violation also
(5) sensitive data	Art. 9 special category data	Strict necessity test = GDPR Art. 9(2)(g) substantial public interest
(6) bias mitigation	Art. 5(1)(a) fairness principle	Bias = unfairness, GDPR violation

Practical implication: if you're already GDPR-compliant with DPO and DPIA culture, Art. 10 AI Act is 60% done. The missing 40% is AI-specific bits (bias metrics, training/validation separation, data lineage in ML pipeline context).

If you're NOT GDPR-compliant — you have a bigger problem than AI Act. GDPR enforcement is long live, max penalty €20M / 4% turnover.

SMB-friendly framework — 5 steps to compliance

Full Art. 10 compliance is ~6-8 weeks for EU SMB SaaS. You can phase it:

Step 1 — Data inventory (1-2 days)

List of all training data sources used in high-risk system. Per source: name, URL/contract, date acquired, license, volume, modality. Excel/Notion OK to start, formalize in docs in summer.

Step 2 — Data dictionary + minimization (2-3 days)

Per feature: purpose, used in model (Y/N), retention. Drop unused features. Check sensitive data — if present, pivot to Step 5.

Step 3 — Data quality report + train/val/test separation (3-5 days)

Pandas profiling or Great Expectations for automated quality checks. Manual label audit on 100-200 records. Lock test set. Document.

Step 4 — Bias metrics baseline (3-5 days)

Pick 2-3 metrics (demographic parity + equalized odds + predictive parity). Run on test set with protected attributes (real or proxy). Document baseline values. Plan re-measurement post-mitigation.

Step 5 — Sensitive data + DPIA (5-10 days if applicable)

If you use GDPR Art. 9 data: strict necessity justification + DPIA + safeguards. Legal review €1-2k. If you don't use — skip.

Step 6 — Procedure document + governance roles (2-3 days)

5-10 page Data Governance Procedure with named roles, escalation paths, review cadence. Sign-off by CEO/CTO.

Total effort: ~3-4 weeks of focused work for SMB without sensitive data, ~6-8 weeks with sensitive data + DPIA. Ongoing cost: ~10% of team capacity for maintenance + per-release updates.

Common pitfalls for SaaS using LLM API

"We use GPT-4 / Claude API, so we don't train our own model — Art. 10 doesn't apply?"

Half-true. Art. 10 covers training data for your high-risk system. If LLM API is only inference in your pipeline (input → API → output), training data for your "system" is prompts + few-shot examples + system prompts + RAG documents you collect or use. Art. 10 applies to those.

Concrete mapping:

Few-shot examples in system prompt = training-like data for your system. Need data lineage + bias check.
RAG corpus (knowledge base, vector DB) = training-like data. Need licensing review + sensitive data screening.
User feedback / ratings used for prompt tuning or model selection = training data per Art. 10.
Fine-tuning data if you use OpenAI fine-tuning or LoRA on open-source = full training data per Art. 10. You may become a GPAI provider in some cases.

Misconception: "API-based AI = we don't train = no Art. 10". Reality: Art. 10 covers data driving system behavior, not just model weights. Your prompts + RAG + few-shot examples are data that trains system behavior. Audit will look for them.

Penalties Art. 99(4) — €15M / 3% global turnover

Violation of Art. 10 (as part of Art. 9-15 high-risk obligations) is subject to Art. 99(4): €15 million or 3% global annual turnover, whichever is higher (per Art. 99(6) lower-of for SME — which per Recommendation 2003/361/EC = <250 employees and <€50M turnover).

Historical signal: AccessiBe FTC settlement $1M in 2025 (US, accessibility AI false claims). EU enforcement on high-risk AI Act doesn't have precedent yet, but GDPR enforcement is predictive: Meta €1.2B (2023), Amazon €746M (2021) — EU regulators are not afraid of large fines.

First wave of AI Act enforcement expected Q4 2026 - Q1 2027 (3-6 months post-deadline 02.08.2026). Targets: media-prominent companies, or those with public complaints (CV screening, predictive lending).

Check your risk exposure — penalty calculator calculates for your specific numbers (revenue, scale, sensitive data presence).

Action items for EU SMB SaaS — checklist

I have a high-risk AI system (Annex III). Today (or this week) I'll:

📋 Data inventory — list of all training/RAG/few-shot sources (1 day)
🎯 Data dictionary — purpose + used-in-model + retention per feature (1-2 days)
📊 Bias metrics baseline — 2-3 standard metrics on test set (3-5 days)
🔍 Sensitive data audit — do we use GDPR Art. 9 data? if yes → DPIA + Art. 10(5) doc (5-10 days)
✂️ Train/val/test separation — proper split, locked test set (4-8h)
📈 Data quality report — completeness/duplicates/label audit (1-2 days)
📝 Data Governance Procedure — 5-10 page doc with named roles (1-2 days)
🔄 Quarterly review cadence — calendar entries for each training run / release

Approximate total: 3-4 weeks of focused effort (more if sensitive data). You can do this internally or order an audit from us (€799 founding price, ~14 days delivery).

Check your Art. 10 exposure

Penalty calculator computes potential fine for your specific numbers (revenue, employees, sensitive data presence, current compliance level). Plus a 5-question quiz gives you a precise gap diagnostic per Art. 10 subsection.

Open Penalty Calculator →

Vs consultants selling "AI ethics workshop"

Warning. EU AI Act consultants often sell "AI ethics workshop" for €5-15k. The workshop is fluff: discussions about "responsible AI principles", "ethics framework", "stakeholder engagement". Audit doesn't look at this.

Audit looks at:

Documented data lineage
Quantified bias metrics
Annex IV technical documentation
Logging trail per Art. 12
Human oversight mechanism per Art. 14
Accuracy/robustness measurements per Art. 15

If your consultant sells "ethics" without concrete data lineage docs, bias measurement reports, technical Annex IV — find someone else. EU compliance is data + documentation, not philosophy.

Check the sample audit — shows exactly what an audit contains (data lineage tab, bias metrics tab, gap analysis per Art. 9-15, action plan).

Key takeaways

Art. 10 is the most-overlooked section in EU SMB SaaS audits — 60% of startups have 3+ gaps
7 mistakes ranked: data lineage > bias metrics > minimization > sensitive data > train/val separation > quality controls > procedure doc
GDPR crosswalk: if you're GDPR-compliant, you're 60% done with Art. 10
Penalty €15M / 3% turnover per Art. 99(4) for high-risk Art. 10 violations (lower-of for SMB)
3-4 weeks of focused work for full compliance for SMB without sensitive data, 6-8 weeks with sensitive data
API-based AI doesn't exempt from Art. 10 — prompts + RAG + few-shot is "training data" per audit
Consultants selling "ethics workshop" won't help in audit — they require data + documentation, not philosophy
Enforcement Q4 2026 - Q1 2027, media-prominent firms first targets

Disclaimer: this article is informational, NOT legal advice. Specific implications for your system require legal opinion from EU AI Act-specialized lawyer, ideally combined with technical audit. Audit ordering available — 100% money-back in 30 days if we don't find at least 3 actionable findings.