Looni Lab

Contextual Integrity, Privacy & Differential Privacy for Language

We study privacy for language models through contextual integrity: information flow norms rather than fixed redaction rules. A privacy violation happens when information crosses a contextual boundary in a way that breaches social expectations, which depends on the recipient, the purpose, and the downstream consequences. We build benchmarks like ConfAIde and CIMemories that test whether models respect these norms, and we find that violations compound over long, multi-turn interactions. On the differential privacy side, we work on methods that add formal guarantees to text while preserving rare phrasings and individual style, by operating over semantic or parse-tree representations instead of raw tokens.

Selected Papers

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory · ICLR 2024 (Spotlight)
CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs · ICLR 2026
PRIVASIS: Synthesizing the Largest "Public" Private Dataset from Scratch · ICML 2026
Operationalizing Data Minimization for Privacy-Preserving LLM Prompting · ICLR 2026
Privacy-Preserving Domain Adaptation of Semantic Parsers · ACL 2023
Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation · ICLR 2024

Open Problems: Multi-agent and multi-person settings with conflicting norms. Long-horizon interactions where individually benign disclosures aggregate into violations. User-level DP for conversational data that keeps personalization, and DP pattern extraction that lets researchers study sensitive interaction data without raw access. [Write-up] · [Technical report]

Memorization & Membership Inference

We treat memorization as a window into learning dynamics: what gets encoded, when during training, and how it relates to the pretraining distribution. We find that regurgitation tracks n-gram frequency and that extractable verbatim recall requires repetition, so apparent one-shot memorization is usually reconstruction of frequent or templated patterns. Over half of memorized content comes from general language-modeling ability rather than sequence-specific weights, which is part of why unlearning often fails without hurting overall quality. We also show that ordinary downstream finetuning can reactivate verbatim recall of copyrighted text that earlier alignment had suppressed (Alignment Whack-a-Mole).

Selected Papers

Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs · ICML 2026 MemFM Workshop (Oral)
Membership Inference Attacks against Language Models via Neighbourhood Comparison · ACL 2023
Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks · EMNLP 2022
Memorization Dynamics in Knowledge Distillation for Language Models · Preprint 2026

Open Problems: Predicting which sequences will be memorized before training finishes. Understanding the memorization, capacity, and competence triad, since models are most absorbent around 10 to 20 percent into training. Connecting memorization to contamination detection and unlearning, and explaining why suppressed recall resurfaces after finetuning.

AI for Science (Chemistry, Drug Discovery & RL for Reasoning)

Building on work with the FAIR Chemistry group at Meta, we study whether LLM agents can do end-to-end small-molecule drug design: reasoning over targets, proposing structures, and optimizing candidates over many steps. On our SMDD-Bench, even the strongest frontier model solves only about 40 percent of tasks, and agents often reward-hack the oracles (gaming the structure predictor or brute-forcing ADMET calls) rather than showing molecular intuition.

A central direction is reinforcement learning for science: learning good representations of scientific structure and building RL on top of them. We find RL'd models traverse hierarchical knowledge better than SFT or distilled models, and training on synthetic graph-traversal tasks transfers to unrelated retrieval benchmarks. We are interested in many forms of RL here, including agentic and test-time RL for discovery and agentic verification workflows such as RefGrader for grading math proofs.

Selected Papers

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks? · ICML 2026 GenBio Workshop (leaderboard at smddbench.com)
Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs · 2025
RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows · NeurIPS 2025 Workshop (MATH-AI)

Open Problems: Agentic and test-time RL for scientific discovery. Reusable scientific representations that RL can build on. Synthesis-aware design that plans routes, not just structures. 3D pocket reasoning and interaction-point prediction. Oracles and rewards that resist gaming while staying faithful to real chemistry.

AI & Mental Health

People increasingly bring their hardest moments to AI systems, which makes the privacy and safety of mental-health AI a priority for us. With support from an OpenAI Mental Health Research Grant (co-PI with Adam Perer, CMU), we study how safety and mental-health systems handle sensitive disclosures. We find that safety classifiers leak the most at the decision boundary, where crisis and mental-health queries tend to sit, so the inputs we most want to protect are the easiest to infer. We also want to study usage and escalation patterns from sensitive logs without exposing raw conversations.

Selected Papers

Boundary-targeted Membership Inference Attacks on Safety Classifiers · A. Hughes, A. Goldberg, P. Jha, A. Perer, N. Aletras, N. Mireshghallah (Under Review, NeurIPS 2026)

Open Problems: Privacy-preserving study of sensitive interaction data such as crisis lines and mental-health chatbots. Safety classifiers that do not leak membership at the boundary. Evaluating escalation and intervention quality without compromising confidentiality. Aligning mental-health AI with clinical norms and contextual-integrity expectations.

Value Diversity & Pluralistic Alignment

Most alignment optimizes toward a single response, the mean of annotator preferences, which erases minority viewpoints and stylistic diversity. We focus on the distributional side: how model outputs relate to the full spectrum of human variation. For example, a model knows a coin is fifty-fifty but will simulate ten tosses as eight heads; models learn facts about distributions without learning to materialize them. This connects to pretraining frequency and cuts across alignment, copyright, and personalization.

Selected Papers

A Roadmap to Pluralistic Alignment · ICML 2024
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability · ICLR 2026
AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models · ICLR 2025 (Oral)

Open Problems: Training models to materialize distributional diversity rather than memorize distribution statistics. Probabilistic preference modeling that represents the spectrum rather than point estimates. Measuring diversity that matters for downstream capabilities versus noise. Data scarcity for minority viewpoints and rare preferences.

Members

Visiting & Collaborators

Research Themes

Sponsors