Chapter 3: Data: The First and Last Attack Surface


The Dataset Nobody Owned

A financial services firm spent eighteen months building an AI system to detect fraudulent transactions. The model was sophisticated—trained on years of historical data, fine-tuned with expert feedback, validated against known fraud patterns. The security review was thorough: the model ran in an isolated environment, inference requests were authenticated, outputs were logged and auditable. By every conventional measure, the system was secure.

Six months after deployment, the fraud detection rate started declining. Not dramatically—just a slow erosion that the operations team attributed to evolving fraud tactics. They scheduled a model refresh, planning to retrain on newer data.

During the retraining preparation, a data engineer noticed something odd. The historical transaction data they'd been using had been "enriched" by a third-party service that matched transactions with merchant category codes. That enrichment happened automatically, through a pipeline that predated the AI project. Nobody on the AI team knew it existed.

The enrichment service had been compromised eight months earlier. For half a year, a small percentage of transactions had been subtly mislabeled—legitimate transactions tagged as potentially fraudulent patterns, and vice versa. The original training data was fine. But every subsequent data refresh had pulled from the poisoned pipeline. The model wasn't failing because fraud tactics had changed. It was failing because it had been trained to fail.

The breach wasn't in the model. It wasn't in the inference infrastructure. It was in a data pipeline that nobody on the security team had ever reviewed, because it wasn't "the AI part."

This chapter is about why data is where AI security begins and ends—and why most organizations are blind to the risks that flow through their data before it ever reaches a model.


Why Data Is the Dominant Risk Vector

When organizations think about AI security, they think about models. When they think about AI attacks, they think about prompt injection, jailbreaks, adversarial inputs. These are real concerns. They're also the concerns that dominate vendor pitches, conference talks, and security tool marketing.

But here's what actually matters: the model is a function of its data. Every capability the model has, every pattern it recognizes, every output it generates—all of it derives from data. The model doesn't know anything that wasn't in its training data. It can't reason about concepts it never saw. Its biases are data biases. Its blindspots are data blindspots.

If you control the data, you control the model. Not in some abstract philosophical sense—in a concrete, operational sense. An attacker who can influence training data can influence model behavior more reliably than an attacker who tries to manipulate the model at runtime. A data poisoning attack during training persists in every inference forever. A prompt injection attack affects one interaction.

This is why data is the first attack surface: it shapes everything that comes after. And it's the last attack surface because data leakage through model outputs is often the ultimate breach objective. Attackers want data. Models that have seen sensitive data can leak sensitive data. The data flows in, and the data flows out.

Most organizations treat data security and AI security as separate concerns. They have data governance teams and they have AI security initiatives, and the two barely talk. This is an architectural failure. AI security without data security is theater. You're guarding the vault door while leaving the loading dock open.

The real problem is that AI systems consume data at scale, from sources that traditional data governance never contemplated, through pipelines that nobody owns end-to-end. A model trained on "company data" might actually be trained on data from dozens of systems, external enrichment services, third-party datasets, and synthetic augmentation—each with its own provenance, each with its own trust assumptions, each with its own potential for compromise.

If you cannot trace your data end-to-end, you cannot secure your AI. This isn't a recommendation. It's a statement of architectural reality.


The Data Lifecycle in AI Systems

To secure data in AI systems, you need to understand how data actually moves. The lifecycle is more complex than most teams realize, with security-relevant decisions at every stage.

Data Collection and Ingestion

Every AI system starts with data collection. This might be obvious—a team decides to train a model and gathers a dataset. Or it might be invisible—data accumulates in systems that later get repurposed for AI training.

The ingestion stage is where provenance begins. Where did this data come from? Who provided it? Under what terms? With what quality guarantees? These questions seem administrative, but they're foundational to security.

Consider the sources that feed typical enterprise AI systems:

Internal operational data flows from production systems—transaction logs, customer interactions, sensor readings. This data wasn't collected for AI training. It was collected for operations. The consent frameworks, retention policies, and access controls were designed for operational use, not model training. Repurposing this data for AI doesn't automatically transfer those protections.

Third-party datasets come from vendors, partners, or public sources. A dataset purchased for model training might have been assembled from sources you can't verify. The vendor's data collection practices become your data collection practices, whether you know what they are or not.

User-generated content arrives through applications—feedback, corrections, conversations with AI systems. This data is particularly sensitive because it often contains information users shared in a specific context, not expecting it to become training data for future models.

Synthetic and augmented data gets generated to expand training sets. Synthetic data seems safe—it's not "real" data. But synthetic data generated from real data inherits properties of that real data. Augmentation that transforms sensitive records doesn't necessarily remove their sensitivity.

Web-scraped data pulls information from public sources. "Public" doesn't mean "safe for any use." Scraped data might include copyrighted material, personal information posted with expectations of limited distribution, or content from sites with terms prohibiting this use.

Each source has different trust properties. Internal operational data might be high-integrity but contain sensitive information. Third-party datasets might be lower sensitivity but uncertain provenance. User-generated content might be both sensitive and low-integrity. Treating all data the same—as just "training data"—collapses these distinctions in ways that create risk.

The ingestion stage is also where poisoning attacks enter the pipeline. If an attacker can influence what data gets collected—by compromising a source system, manipulating a scraping target, or exploiting a data submission interface—they can poison the model at its foundation.

Data Storage and Management

Once collected, data needs to live somewhere. AI projects create new storage requirements that often don't fit neatly into existing data architecture.

Raw data repositories hold data as collected, before any transformation. These repositories are often treated as staging areas, with less rigorous access control than production systems. But raw data is often the most sensitive—it hasn't been through any sanitization or filtering.

Feature stores hold processed data optimized for model training. Feature engineering transforms raw data into model-ready formats, but those transformations might not remove sensitivity. A feature derived from salary data is still salary-related data, even if the original values aren't visible.

Vector databases store embeddings—numerical representations of data used for similarity search and retrieval. Embeddings feel abstract and mathematical, which leads teams to treat them as non-sensitive. This is dangerous. Embeddings can be inverted—at least partially—to recover information about the original data. A vector database of document embeddings is, in a meaningful sense, a database of document information.

Model artifacts themselves are a form of data storage. A trained model contains a compressed representation of its training data. Models can memorize training examples, especially rare or unusual ones. The model file isn't just code—it's a data artifact with data sensitivity.

Conversation and interaction logs accumulate from AI systems in production. These logs capture prompts, responses, retrieved context, and user corrections. They're invaluable for debugging and improvement. They're also a growing repository of potentially sensitive information that users shared with the AI.

Each storage location needs appropriate controls—access management, encryption, retention policies, audit logging. But AI projects often create these storage systems outside normal data governance processes. A data scientist spins up a vector database to experiment with embeddings. An ML engineer creates a feature store to accelerate training. These systems persist, accumulate data, and eventually become critical infrastructure—but they started as experiments that nobody reviewed.

Data Transformation and Enrichment

Data rarely goes directly from collection to training. It gets transformed, cleaned, normalized, enriched, and augmented. Each transformation is a potential point of failure or compromise.

Cleaning and normalization processes fix data quality issues—handling missing values, standardizing formats, removing duplicates. These processes can introduce errors at scale. A normalization rule that works for most records might corrupt specific cases. Cleaning logic that removes "invalid" entries might systematically exclude certain populations.

Enrichment adds information from external sources—the merchant category codes in our opening scenario. Enrichment creates dependencies on external systems. Those systems have their own security posture. If an enrichment service is compromised, every record it touches becomes suspect.

Aggregation and anonymization attempt to reduce sensitivity by combining or obscuring individual records. But aggregation doesn't always work. Small groups can be re-identified. Anonymization can be defeated with auxiliary information. A dataset that looks anonymized might leak individual information when combined with other data the attacker has.

Labeling and annotation adds the ground truth that supervised learning requires. Labels might come from automated systems, human annotators, or inference from other data. Each source has different reliability. Human labeling, in particular, involves people who see raw data—potentially sensitive data—to do their work.

Augmentation and synthesis expand datasets artificially. Augmented data inherits properties from its source data. Synthetic data generated by models might reproduce patterns from training data, including sensitive patterns. The line between "synthetic" and "derived from real" is blurrier than it appears.

The transformation stage is where data lineage becomes critical. If you can trace a training record back to its sources and through its transformations, you can understand what's actually in your training data. If you can't trace it, you're training on a black box.

Data for Training

Training is the most obviously AI-specific stage, but its data security implications are often overlooked because the focus is on model development, not data protection.

Data selection determines what subset of available data goes into training. Selection criteria have security implications. If you train on data from a specific time period, you inherit whatever was happening during that period. If you train on data from specific systems, you inherit the biases and potential compromises of those systems.

Train/validation/test splits divide data for model development. The split methodology matters. If splits aren't truly independent—if they share information leakage—validation results become unreliable. More importantly, any of these splits might include sensitive data that shouldn't be used for any purpose.

Data loading moves data from storage into training infrastructure. This is a privilege escalation moment. Training infrastructure needs access to potentially sensitive data at scale. Whatever identity runs the training job has read access to the training data. If that identity is overprivileged—which it often is, to avoid friction during development—training becomes a broad data access point.

Checkpointing and serialization save model state during training. Checkpoints are model artifacts. They contain information derived from training data. Checkpoint storage needs the same protection as final model storage.

Training is also when poisoning attacks become embedded. If poisoned data made it through collection and transformation, training crystallizes it into model weights. After training, the poison is no longer in a data file you can examine—it's distributed across millions of parameters in ways that are difficult to detect.

Data at Inference Time

Once deployed, AI systems continue to process data—now in production, at scale, with real users.

User inputs are data. Every prompt to an LLM, every image uploaded for classification, every query to a recommendation system is data entering the system. User inputs might contain sensitive information. They might contain attacks. They might contain information the user didn't intend to provide.

Retrieved context in RAG systems pulls additional data into the model's processing. The retrieval query is based on user input, but what gets retrieved depends on what's in the retrieval store. A user asking a simple question might trigger retrieval of sensitive documents they shouldn't see.

Model outputs are derived data. An LLM's response is generated based on its training data, the current prompt, and any retrieved context. If any of those sources contained sensitive information, the output might contain sensitive information.

Logging and telemetry capture production activity. Inference logs—prompts, responses, metadata—are data that requires protection. These logs often receive less attention than training data, even though they might contain more directly sensitive information (actual user queries rather than historical patterns).

Feedback and corrections flow back from users and systems. This feedback might influence future training, creating a loop where production data becomes training data. The security implications compound: a vulnerability in production data handling becomes a vulnerability in training data integrity.

Inference-time data flows are high-volume and real-time, which creates pressure to minimize security overhead. But these flows are also where sensitive data is most likely to appear—users share information in prompts that they would never share in structured forms.


Data Lineage: The Foundational Control

If there's one architectural principle that matters more than any other for AI data security, it's data lineage. Lineage is the ability to trace data from its origins, through all transformations, to its current state and uses.

Why is lineage foundational? Because without it, you cannot answer basic security questions:

  • What data trained this model? If you can't answer this, you can't assess what the model might leak or how it might be biased.
  • Where did this training record come from? If you can't answer this, you can't verify its integrity or provenance.
  • What happened to this data between collection and use? If you can't answer this, you can't identify where poisoning might have occurred.
  • Who has accessed this data? If you can't answer this, you can't scope a breach or investigate an incident.
  • What models use data from this source? If you can't answer this, you can't assess the impact when a source is compromised.

Lineage isn't a nice-to-have for compliance. It's the foundation for every other data security control. You can't enforce retention policies on data you can't trace. You can't respond to deletion requests for data you can't find. You can't assess poisoning risk in pipelines you can't see.

What does practical lineage look like?

Source tracking records where each data element originated. Not just "from the CRM" but "from the CRM's customer_interactions table, extracted via the nightly_export job, on this date, with this schema version."

Transformation tracking records what happened to data after collection. Every join, filter, aggregation, enrichment, and normalization should be logged. When data is derived or computed, the derivation logic should be captured.

Usage tracking records where data was used. Which training runs included this data? Which models were trained on datasets containing this record? Which inference requests retrieved this document?

Access tracking records who touched the data. This includes human access (data scientists exploring datasets) and system access (pipelines reading and writing).

Building lineage is expensive. It requires instrumentation across data pipelines, storage overhead for lineage metadata, and discipline to maintain it as systems evolve. Most organizations don't have it, or have it only partially.

But the cost of not having lineage is higher. When a data source is compromised, you need to know what downstream systems are affected. When a model behaves unexpectedly, you need to understand what it was trained on. When a user requests data deletion, you need to find everywhere their data exists. Without lineage, you're operating blind.


Data Poisoning: The Attack That Persists

Data poisoning is the modification of training data to influence model behavior. Unlike prompt injection, which manipulates a single interaction, poisoning embeds malicious influence into the model's learned parameters. The attack persists across all future inferences.

Poisoning attacks vary in sophistication and goals:

Availability attacks aim to degrade model performance generally. By introducing noise or contradictory examples, attackers make the model less accurate across the board. The fraud detection decline in our opening scenario is an availability attack—the model became less effective at its primary task.

Targeted attacks aim to change model behavior for specific inputs while preserving general performance. An attacker might want a spam classifier to allow messages from their domain, while still catching spam from everyone else. These attacks are harder to detect because overall metrics look normal.

Backdoor attacks insert hidden triggers that activate malicious behavior. A model might perform normally unless it sees a specific pattern, at which point it produces attacker-controlled outputs. These attacks are particularly dangerous because they can pass extensive testing—the trigger never appears in the test data.

Where do poisoning attacks enter? Anywhere data flows into training:

Direct data modification is the most obvious vector. If attackers can access training data storage, they can modify records directly. This requires storage access, which should be controlled—but training data repositories often have weaker access controls than production databases.

Source system compromise poisons data before it's collected for training. The enrichment service in our opening scenario was compromised at the source. The AI team never touched poisoned data directly—they just consumed it from an upstream system they trusted.

Crowdsourced poisoning exploits human labeling and feedback. If training data includes human-provided labels, attackers can participate in labeling to inject incorrect labels. This is especially relevant for systems that train on user feedback—attackers can provide systematically biased feedback.

Supply chain poisoning targets third-party datasets or pretrained models. If you fine-tune a model poisoned by its original trainer, you inherit the poison. If you train on a purchased dataset with injected examples, those examples become part of your model.

Detection is difficult because poisoning happens before training, and the model's behavior after training is what you can observe. By the time you notice the effect, the cause is baked into parameters you can't easily inspect.

Defenses exist but are imperfect. Data validation checks incoming data against expected distributions—but attackers who understand the expected distribution can poison within those bounds. Robust training techniques aim to reduce the influence of outliers—but targeted attacks can use many small modifications rather than few large ones. Model testing looks for unexpected behaviors—but backdoors only activate with triggers that testers don't know to include.

The most reliable defense is architectural: minimize the attack surface for poisoning by controlling data provenance, limiting data sources, and maintaining lineage that lets you audit what went into training.


Data Leakage: The Risk That Compounds

If poisoning is data flowing maliciously into AI systems, leakage is data flowing inappropriately out. And leakage risks are everywhere.

Training Data Extraction

Models can memorize their training data, especially unusual or repeated examples. This isn't a bug—it's how neural networks work. But it means that data present in training might be recoverable from the model.

Extraction attacks query models systematically to elicit memorized content. A model trained on email data might complete prompts in ways that reveal actual emails. A model trained on code repositories might generate code that includes API keys present in training.

The risk scales with model size and training data sensitivity. Larger models have more capacity for memorization. Training data with personally identifiable information, financial records, or proprietary content creates extraction targets.

Fine-tuned models are particularly vulnerable. Fine-tuning on a small, specific dataset drives the model to learn that content thoroughly—which often means memorizing it. A model fine-tuned on your company's internal documents might be able to reproduce those documents.

Inference-Time Leakage

Even without training data extraction, models leak information through their outputs.

Prompt leakage occurs when system prompts or instructions are exposed in responses. If you've told the model "never discuss competitors" in a system prompt, an adversary might be able to extract that instruction and learn your competitive sensitivities.

Context leakage in RAG systems exposes retrieved documents through responses. The model summarizes what it retrieved, and that summary might include information from documents the user shouldn't see. This is the retrieval authorization problem from Chapter 2, manifesting as data leakage.

Membership inference determines whether specific data was used in training. Even if the attacker can't extract the data, knowing it was used for training might be sensitive—imagine knowing that a specific patient's records were used to train a medical AI.

Model inversion reconstructs properties of training data from model outputs. Even partial reconstruction—learning that training data had certain statistical properties—might reveal sensitive information about the population in that data.

The Accumulation Problem

Data leakage risks compound over time. Each individual query might leak a small amount of information. But AI systems handle millions of queries. Information that seems minimal in one response becomes significant when aggregated across many interactions.

This is particularly dangerous for adversaries playing a long game. A patient attacker can query systematically, collecting fragments that individually seem harmless but combine into sensitive reconstruction.

Traditional data loss prevention focuses on blocking sensitive data in transit. But models transform and recombine data. The output might not contain any direct copy of sensitive input—but it might contain information derived from that input. DLP tools that look for literal credit card numbers won't catch a model that describes "a typical customer with high credit limits and recent large purchases in the luxury goods category."


Common Mistakes Organizations Make

Treating Training Data Like Application Data

Most organizations have mature processes for securing application databases. They understand access control, encryption, backup, and audit logging for production systems. But training data often lives in different infrastructure—data lakes, ML platforms, research environments—that doesn't inherit those protections.

Teams treat training data as "just analytics" or "just for development." The data isn't serving production traffic, so it must be lower risk. This reasoning ignores that training data shapes a model that will serve production traffic. The sensitivity of training data persists in model behavior.

What this misses: Training data is production data because it determines production model behavior. Security controls should match the sensitivity of the data, not the perceived importance of the system storing it.

Assuming Anonymization Solves the Problem

Organizations often anonymize data before using it for training, believing this eliminates privacy risk. Remove names, generalize locations, mask identifiers—now it's safe for AI.

But anonymization is harder than it looks, and models are good at learning patterns that de-anonymization techniques can exploit. A model trained on "anonymized" healthcare data might still learn patterns that identify rare conditions or unusual treatment histories. Combined with auxiliary information, that's re-identification.

Even properly anonymized aggregate data can be problematic. If a model learns that "customers in region X with behavior Y have outcome Z," that pattern might reveal information about specific individuals when applied to small groups.

What this misses: Anonymization reduces direct identification risk, but models learn correlations that can enable indirect identification. Differential privacy provides stronger guarantees but is difficult to implement for complex AI training.

Trusting Third-Party Data Without Verification

Organizations purchase datasets, use public data, and consume data from partners without adequate verification. The vendor has good reputation. The data looks clean. What's the risk?

The risk is that you're importing trust assumptions you can't verify. Where did the vendor get this data? How was it labeled? Has it been validated for accuracy? Could it have been tampered with? These questions often have no satisfactory answers.

This is supply chain risk applied to data. Just as you shouldn't trust software dependencies without verification, you shouldn't trust data dependencies without verification. But organizations that carefully vet code dependencies will ingest data with minimal scrutiny.

What this misses: Third-party data is a trust boundary. Data from external sources should be validated, isolated, and monitored differently from internally generated data.

Building Pipelines Without Lineage

Data pipelines for AI are often built quickly, by data scientists and ML engineers focused on model development. Lineage tracking adds complexity. In the sprint to ship a model, nobody has time to instrument every transformation.

The result is pipelines that work—data flows in, models come out—but that can't answer basic provenance questions. When something goes wrong, there's no way to trace what data was involved or what happened to it.

This isn't laziness. It's optimization under pressure. But it creates technical debt that compounds. Each new pipeline without lineage makes the overall system harder to secure. Eventually, the organization has AI systems trained on data they can't account for.

What this misses: Lineage is infrastructure, not overhead. It should be built into data platforms from the start, not added as an afterthought.

Logging Everything (Or Nothing)

Some organizations log aggressively—every prompt, every response, every intermediate step. They'll figure out what's useful later. The result is log storage containing sensitive information with minimal access control, creating a secondary data leakage risk.

Other organizations log minimally—just enough to debug immediate problems. When incidents occur, there's insufficient information to investigate. What data was exposed? Who accessed it? The logs don't say.

Both extremes fail. Comprehensive logging without retention policy and access control creates liability. Minimal logging without security context leaves you blind.

What this misses: AI system logging requires security-aware design. What to log, how long to retain it, who can access it, and how to protect it should be decisions, not defaults.


Architectural Questions to Ask

Data Source Questions

  • Can you enumerate every source that contributes data to your AI training pipelines?
  • For each source, do you know the provenance of the data and the terms under which you can use it?
  • If a data source were compromised, do you have a process to identify affected models and datasets?
  • Are there data sources feeding AI systems that predate your AI program (legacy pipelines now repurposed)?

Why these matter: You can't secure sources you don't know about. Shadow data pipelines are common in AI systems built on existing data infrastructure.

Data Flow Questions

  • Can you trace a specific training record from its origin through all transformations to the model that used it?
  • Do you know everywhere a piece of sensitive data might exist—raw storage, feature stores, vector databases, model weights?
  • When data is enriched from external sources, do those sources appear in your security reviews?
  • Can you identify all models trained on data from a specific time period (relevant for time-bounded compromise)?

Why these matter: Data moves through AI systems in complex ways. Leakage and poisoning can occur at any point. If you can't trace the flow, you can't secure it.

Data Access Questions

  • Who has access to training data, and are those access rights regularly reviewed?
  • Do training jobs run with least-privilege access, or do they have broad read access for convenience?
  • When data scientists explore datasets, is that access logged and auditable?
  • Are there copies of training data in development environments, personal workstations, or experiment storage?

Why these matter: Training data access is often broadly granted for productivity. This creates data exposure risk that doesn't exist for production database access.

Data Sensitivity Questions

  • Is training data classified for sensitivity, or is it all treated the same?
  • Do you know whether training data contains PII, confidential business information, or regulated data?
  • When models are trained on mixed-sensitivity data, is the resulting model classified at the highest level?
  • Are there models in production that you can't confirm were trained only on approved data?

Why these matter: Sensitivity classification should flow from data through training to models. A model trained on sensitive data is a sensitive artifact.

Data Integrity Questions

  • How would you detect if training data had been subtly modified?
  • Do you validate training data against expected statistical properties before use?
  • If an upstream data source changed its schema or semantics, would you know before training on it?
  • Are there automated checks for training data quality that would flag potential poisoning?

Why these matter: Poisoning attacks succeed because integrity is assumed, not verified. Detection requires knowing what data should look like.

Data Retention Questions

  • Do you have retention policies for training data, and are they enforced?
  • When a data subject requests deletion, can you identify and remove their data from training sets?
  • Do intermediate artifacts (checkpoints, feature stores, experiment logs) have retention policies?
  • If you needed to retrain a model without specific data, could you identify what to exclude?

Why these matter: Data rights and regulations require the ability to delete data. AI systems that can't delete create compliance liability.


Key Takeaways

  • Data is the dominant attack surface for AI systems. The model is a function of its data. Control the data, and you control the model. Poisoning attacks embed malicious influence that persists across all inferences. Leakage risks extract sensitive information at scale. Securing the model while ignoring data security is architectural theater.

  • Data lineage is the foundational control. If you can't trace data from source through transformation to model, you can't secure your AI. You can't assess poisoning risk in pipelines you can't see. You can't respond to incidents involving data you can't find. Lineage isn't overhead—it's infrastructure.

  • Training data requires production-grade security. Training data determines production model behavior. It deserves the same access controls, encryption, audit logging, and monitoring as production databases. Treating training data as "just analytics" ignores that its sensitivity persists into deployed models.

  • Anonymization is necessary but not sufficient. Models learn patterns that can enable re-identification even from properly anonymized data. Differential privacy provides stronger guarantees but is difficult to implement. Assume that training data sensitivity persists into model behavior.

  • Third-party data is a trust boundary. Data from external sources carries provenance you can't verify and integrity you can't guarantee. Supply chain risk applies to data just as it applies to code. Validate, isolate, and monitor external data differently from internal sources.


Data is where AI security begins. Get data security wrong, and every subsequent control is built on a compromised foundation. Get it right, and you've established the lineage and visibility that make everything else possible. The next chapter examines what happens to data when it becomes a model—the training and fine-tuning process that most organizations treat as a black box.

results matching ""

    No results matching ""