🤖 AI researchers reveal the "black box" of large language models (LLMs) like OpenAI's ChatGPT and Google's Bard, making it tough to delete sensitive data. Here's why:
- LLMs are pre-trained on databases and fine-tuned for coherent outputs.
- Deleting specific files from the database doesn't remove related results from the model.
- Guardrails like hard-coded prompts and reinforcement learning from human feedback (RLHF) help, but don't fully delete info.
- State-of-the-art methods like Rank-One Model Editing (ROME) still leave facts extractable 29-38% of the time.
- Researchers developed new defense methods, but admit they may always be playing catch-up to attack methods. 🕵️♂️