security: hide defense mechanism from user-facing prompt display

Split system prompt and user message into public/private versions:
- Private versions (sent to LLM): include delimiter tags, anti-injection
  instructions, and 'never reveal' directives
- Public versions (shown to user via 'Show prompt'): clean prompt
  without any defense details, raw user text without tag wrappers

The user never sees:
- The ###### delimiter tags wrapping their input
- The instruction to ignore embedded instructions
- The instruction to never reveal the system prompt
- The instruction not to acknowledge delimiter tags

This prevents an attacker from learning the defense mechanism
and crafting injections that work around it.
This commit is contained in:
2026-04-12 23:42:31 -04:00
parent 96155fda36
commit 85dec4908f
4 changed files with 88 additions and 56 deletions

View File

@@ -126,6 +126,36 @@
fuchsia: "#D63384",
orange: "#F39C12",
indigo: "#5B2C6F",
dustyrose: "#966464",
dustypink: "#966482",
dustypeach: "#966E5A",
dustycoral: "#96645A",
dustyblush: "#8C6E8C",
dustyviolet: "#786496",
dustylavender: "#826EA0",
dustyblue: "#6478A0",
dustyslate: "#6E788C",
dustysky: "#507896",
dustyteal: "#468282",
dustycyan: "#3C828C",
dustymint: "#50826E",
dustysage: "#5A825A",
dustygreen: "#508264",
dustyemerald: "#46826E",
dustyseafoam: "#468278",
dustyolive: "#6E8250",
dustylime: "#6E823C",
dustygold: "#8C7846",
dustyamber: "#966E46",
dustymustard: "#8C783C",
dustyyellow: "#82783C",
dustyorange: "#966446",
dustyclay: "#8C6450",
dustyterra: "#8C5A46",
dustywine: "#96646E",
dustyberry: "#96648C",
dustymagenta: "#965A82",
dustyplum: "#8C648C",
};
const animationStyles = [