Prompt-injection campaigns targeting RAG pipelines
Three observed patterns from Q2 incidents, and the retrieval-layer controls that materially reduced blast radius.
Across Q2, we tracked three recurring prompt-injection patterns in production RAG systems: poisoned documents in shared drives, indirect injection via summarized web content, and injection through user-provided file uploads bypassed by lax content filters. The pattern is shifting from research curiosities to operational campaigns, with adversaries targeting the retrieval layer as the path of least resistance into systems whose model layer is otherwise hardened.
The shift is significant because the retrieval layer is, in most architectures, the least-instrumented part of the AI stack. Source-trust scoring is uncommon, provenance tagging is inconsistent, and the assumption that internal documents are trustworthy is widespread. The assumption was always optimistic and is now actively wrong; any document a user can edit is a document an attacker can edit.
Pattern one: poisoned shared-drive documents
The most common pattern observed was poisoned documents in collaboration platforms — internal wikis, shared drives, and ticketing systems — that were ingested into production retrieval indexes without source-trust differentiation. The injection payloads were embedded in low-visibility sections of legitimate-looking documents and triggered when a user's query retrieved the document for context.
The mitigation that worked was source-trust scoring at the retrieval layer, with documents from low-trust sources excluded from contexts that could trigger consequential actions. The scoring did not have to be sophisticated; a simple distinction between curated knowledge bases and open user-editable spaces materially reduced the attack surface.
The detection signal that surfaced these campaigns earliest was a sudden change in the diversity of documents being retrieved for a narrow class of queries. The poisoned documents were authored specifically to rank highly for the queries the attacker wanted to influence, which produced a retrieval pattern that looked anomalous against the historical baseline. Programs that were monitoring the retrieval layer for these baseline shifts caught the campaigns days or weeks earlier than programs that were only monitoring the model output.
Pattern two: indirect injection via web summaries
Agents that browse the web and summarize content for users were targeted by adversaries who placed injection payloads on pages the agent was likely to visit. The payloads instructed the agent to take actions on behalf of the user — share credentials, send specific messages, retrieve and exfiltrate specific data — using authority the user had granted for unrelated reasons.
The mitigation that worked was treating retrieved web content as untrusted by construction, with the agent's tool-call permissions in a context that includes web content automatically downgraded to read-only. The downgrade was visible to users in the form of additional confirmation prompts for sensitive actions taken in such contexts, which produced minor friction and substantial risk reduction.
The variant of this pattern that proved most difficult to mitigate was injection through content that was itself summarized by another agent. The downstream agent never saw the injection payload directly; it received a summary that preserved the injection's intent in a form that survived the summarization. The defense against this variant is to treat the provenance of summarized content as a first-class property, with the trust level of a summary derived from the trust level of its source, recursively. Programs that have not built provenance into their summarization pipeline are vulnerable to this variant even when their direct-injection defenses are strong.
Pattern three: file-upload bypasses
User-provided file uploads that bypassed content filters were used to deliver injection payloads in formats the filters did not parse — embedded in images via OCR, in PDF metadata, in document comments, in spreadsheet formulas. The filters had been designed for a different threat model and missed the injection content even when they were nominally inspecting the file.
The mitigation that worked was a separate ingestion path for user-uploaded content, with stricter trust scoring, mandatory provenance tags, and a model-side awareness that content from this path is potentially adversarial. The architectural separation produced cleaner decisions at the model layer than any amount of filter tuning had produced previously.
The harder version of this pattern involves files uploaded through indirect channels — email attachments that get processed by an AI triage system, support tickets with attached screenshots that get summarized for the responding agent, customer-submitted documents that get extracted into a structured data pipeline. Each of these channels introduces a path through which adversarial content can reach a model that was not explicitly designed to handle it. Programs that have inventoried all of these channels and applied consistent trust scoring across them are noticeably better defended than programs that have hardened only the obvious upload paths.
Controls that worked
Across the three patterns, the controls that materially reduced blast radius were source-trust scoring at the retrieval layer, mandatory provenance tags on every retrieved chunk, and tool-call allow-lists derived from the originating user's role. None are exotic; all required treating the retriever as a security boundary rather than as a utility. The teams that had already made that shift were positioned to detect and respond; the teams that had not are now doing the architectural work under time pressure.
The control that produced the largest single risk reduction across the three patterns was the user-derived tool-call allow-list. Even when an injection succeeded in steering the model toward a malicious action, the action was blocked at the tool layer because the user on whose behalf the agent was operating did not have the authority to perform it. The allow-list converted what would have been a successful exploitation into a logged anomaly, which produced both the immediate containment and the signal needed to investigate.
What to instrument now
Programs that have not yet experienced one of these campaigns can prepare by instrumenting three layers. The retrieval layer, so that the baseline of which documents are being retrieved for which queries is visible and anomalies surface quickly. The tool-call layer, so that the boundary between intended and unintended actions is enforced rather than implied. And the prompt layer, so that the provenance of every component of the assembled prompt is queryable when an incident requires reconstruction. The instrumentation is the foundation that the response capability rests on; programs that build it before the incident are positioned to handle the incident in hours rather than weeks.
Detection signals worth tuning for
The detection signals that surfaced these campaigns earliest were not the ones the security team had originally instrumented. The original instrumentation focused on the model output — toxic content, policy violations, obvious refusal patterns — and the campaigns we observed largely avoided producing those signals because the adversary wanted the model to act, not to misbehave visibly. The signals that worked were on the retrieval side and the tool-call side: anomalous diversity in retrieved documents for narrow query classes, unusual sequences of tool calls within a single conversation, and tool calls whose arguments diverged from the historical distribution for the same tool under the same role.
The teams that detected the campaigns quickly were the teams whose detection engineering function treated the AI stack as a first-class detection surface rather than as a black box behind the existing monitoring. The detection engineers wrote and tuned rules against the retrieval and tool-call telemetry the same way they had tuned rules against network and endpoint telemetry for the previous decade. The skill transfer was natural; the missing piece was the management decision to treat the AI surface as worth the detection engineering investment at all.
The detection function also needs to share signals with the model and prompt teams, because some of the most useful corrective actions are upstream changes to the prompts or retrieval policies that make the same campaign harder to execute next time. Programs that treat detection as a security-only function tend to catch the campaigns and let the underlying attack pattern recur; programs that share the signals back to engineering tend to close the underlying conditions and force the adversary to find a new approach.
Red-team programs adapted for RAG
Internal red teams are extending their practice to cover the retrieval-layer attack surface. The exercises that have produced the most insight have been the ones designed by people who have either run the campaigns from the attacker side or defended against them in production. Generic AI red-team exercises that focus on jailbreaking the model layer tend to miss the retrieval pathway entirely, which means the defending team gets practice on the wrong layer and walks away with a misleading sense of readiness.
The exercises that work involve planting realistic poisoned documents in the test estate, running them through the same ingestion pipelines as legitimate content, and measuring how long it takes for the detection function to surface the anomaly and for the response function to contain it. The metrics from the exercises feed directly into the prioritization of the next quarter of instrumentation work. Programs that run these exercises quarterly find that their detection and response times improve measurably from one exercise to the next; programs that run them annually or not at all find that their readiness drifts in the wrong direction between exercises.
The exercises also produce a useful side effect: they educate the rest of the engineering organization about the threat model in a way that documentation does not. An engineer who has watched a planted document propagate through the retrieval pipeline into a production conversation understands the threat in a way that no slide deck would convey, and that engineer is more likely to design the next AI feature with the threat in mind. The training value of the exercises is one of the underrated reasons to run them regularly.
Working with content owners
The retrieval-layer defenses depend on cooperation from the content owners — the teams that maintain the wikis, the document stores, the support knowledge bases, and the other repositories that feed the retrieval index. The cooperation has to be earned, because the controls security needs to impose can look like friction to the content owners' workflows. Programs that frame the cooperation as a partnership and that absorb some of the operational cost on the security side tend to get sustainable cooperation; programs that frame it as a mandate tend to get formal compliance and informal workarounds that defeat the controls.
