EU AI Act Data Readiness — Validating AI Training Data Before Acquisition

Data Provenance: Where the Training Data Came From

Article 10 of the AI Act requires that training, validation, and testing datasets be subject to appropriate data governance and management practices. For PE acquirers, the first practical question is whether the target company can document where its training data originated. In most cases, the answer is no.

Training data in portfolio companies typically accumulates over years. Customer interaction logs, behavioral tracking data, transaction records, third-party data enrichment feeds, scraped web data, and user-generated content all end up in training pipelines without formal provenance tracking. The data engineering team knows the technical source (which database, which API). They rarely know the legal source: under what authority was this data collected, for what stated purpose, and does that purpose extend to model training?

Provenance documentation under the AI Act is not optional for high-risk systems. It requires records of data origin, collection methodology, data preparation steps, labeling procedures, and any data cleaning or augmentation techniques applied. The absence of this documentation is not just a compliance gap. It is a signal to acquirers that the AI asset's foundation may be unstable. If you cannot trace the data, you cannot validate its legal basis, and if you cannot validate its legal basis, the model built on it carries regulatory risk that compounds over time.

The GDPR Consent Intersection

The AI Act and GDPR are separate regulatory instruments enforced by different authorities. But they share a common dependency: personal data. When an AI system is trained on personal data collected under GDPR, the AI Act's data governance requirements layer on top of GDPR's lawful basis requirements. This creates a dual-compliance obligation that most companies have not addressed.

The most common failure pattern involves consent scope. A company collects user data under a GDPR-compliant consent framework that specifies purposes like "service improvement" or "personalization of user experience." The data science team later uses this data to train a machine learning model. Whether "personalization" as a consent purpose covers model training is a legal question that companies frequently assume away rather than answer. Under GDPR's purpose limitation principle, using data for a purpose not contemplated at the time of collection requires a new legal basis.

For PE acquirers, this intersection creates a specific diligence requirement. The review must examine not just whether GDPR consent exists, but whether that consent's scope extends to the specific AI training use case. When it does not, the remediation options are limited: re-consent data subjects (expensive and incomplete, since many will not respond), identify an alternative legal basis such as legitimate interest (possible but requires a balancing test and documentation that likely does not exist), or retrain the model on a properly consented dataset (the most expensive option, potentially requiring new data collection).

Field observation We reviewed a B2B SaaS platform that trained its lead scoring model on seven years of customer interaction data. The original consent notice covered "analytics and service optimization." The company's DPO had never assessed whether model training fell within that purpose. The GDPR balancing test for legitimate interest had never been conducted. The training dataset contained data from EU subjects across 14 member states. Remediation required a full re-consent campaign and a parallel legitimate interest assessment, with a 9-month timeline and $1.8M estimated cost.

Bias Documentation and Assessment

Article 10(2)(f) requires that training datasets be examined for possible biases that could lead to discrimination. This is not an aspirational goal. It is a regulatory requirement for high-risk AI systems, and the documentation burden is explicit. Companies must identify potential sources of bias, assess their impact on system outputs, and implement measures to address identified biases.

In practice, bias assessment is the weakest area of AI compliance across PE portfolio companies. Most data science teams are aware of bias as a concept. Few have conducted formal bias audits against the categories the AI Act cares about: protected characteristics under EU anti-discrimination law, including race, gender, age, disability, religion, and sexual orientation. The assessment must be documented, the mitigation measures must be described, and the residual bias must be quantified.

For acquirers, the absence of bias documentation creates two risks. First, regulatory risk: a high-risk AI system without bias documentation is non-compliant. Second, reputational and litigation risk: an AI system that produces discriminatory outputs (even unintentionally) creates liability that extends beyond regulatory fines to civil claims and public exposure. The cost of a retrospective bias audit is relatively modest compared to the cost of discovering discriminatory outputs post-close.

Practical Validation Steps for Deal Teams

Data readiness assessment during pre-LOI review should follow a structured sequence. Start with inventory: identify every AI system that processes personal data and catalog the training datasets used. For each dataset, request provenance documentation. If it does not exist, that finding alone tells you the compliance gap is material.

Next, examine the consent architecture. Pull the consent notices and privacy policies that were active during the data collection periods. Map the stated purposes against the actual data uses, specifically model training. Flag every instance where training use is not clearly covered by the stated purpose. This is your legal basis gap analysis.

Then assess bias documentation. Request any bias assessments, fairness metrics, or impact analyses conducted on the training data or model outputs. For high-risk systems, the AI Act requires this documentation to exist. Its absence is a compliance finding. Its presence requires validation that the assessment methodology meets Article 10 standards.

Finally, estimate remediation cost. For each gap identified, scope the remediation: re-consent campaigns, legitimate interest assessments, bias audits, data provenance reconstruction, and model retraining. Aggregate these costs into a total data readiness investment figure. That figure belongs in the deal model alongside the AI asset valuation. The net of the two is the actual value of the AI capability to the acquirer.

AI Act Data Readiness.

Data Provenance: Where the Training Data Came From

The GDPR Consent Intersection

Bias Documentation and Assessment

Practical Validation Steps for Deal Teams

Validate the training data before the LOI.