Members
- Bill Watson (co-advised with Sauvik Das)
- Xiaoyu (Nicholas) Wu (co-advised with Steven Wu)
- Renfei Zhang
- Kevin Han (co-advised with Amir Barati Farimani)
- Bhuvan Chandra Koduru (Master's)
Visiting & Collaborators
Research Themes
We study privacy for language models through contextual integrity: information flow norms rather than fixed redaction rules. A privacy violation happens when information crosses a contextual boundary in a way that breaches social expectations, which depends on the recipient, the purpose, and the downstream consequences. We build benchmarks like ConfAIde and CIMemories that test whether models respect these norms, and we find that violations compound over long, multi-turn interactions. On the differential privacy side, we work on methods that add formal guarantees to text while preserving rare phrasings and individual style, by operating over semantic or parse-tree representations instead of raw tokens.
Selected Papers
- Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory · ICLR 2024 (Spotlight)
- CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs · ICLR 2026
- PRIVASIS: Synthesizing the Largest "Public" Private Dataset from Scratch · ICML 2026
- Operationalizing Data Minimization for Privacy-Preserving LLM Prompting · ICLR 2026
- Privacy-Preserving Domain Adaptation of Semantic Parsers · ACL 2023
- Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation · ICLR 2024
We treat memorization as a window into learning dynamics: what gets encoded, when during training, and how it relates to the pretraining distribution. We find that regurgitation tracks n-gram frequency and that extractable verbatim recall requires repetition, so apparent one-shot memorization is usually reconstruction of frequent or templated patterns. Over half of memorized content comes from general language-modeling ability rather than sequence-specific weights, which is part of why unlearning often fails without hurting overall quality. We also show that ordinary downstream finetuning can reactivate verbatim recall of copyrighted text that earlier alignment had suppressed (Alignment Whack-a-Mole).
Selected Papers
- Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs · ICML 2026 MemFM Workshop (Oral)
- Membership Inference Attacks against Language Models via Neighbourhood Comparison · ACL 2023
- Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks · EMNLP 2022
- Memorization Dynamics in Knowledge Distillation for Language Models · Preprint 2026
Building on work with the FAIR Chemistry group at Meta, we study whether LLM agents can do end-to-end small-molecule drug design: reasoning over targets, proposing structures, and optimizing candidates over many steps. On our SMDD-Bench, even the strongest frontier model solves only about 40 percent of tasks, and agents often reward-hack the oracles (gaming the structure predictor or brute-forcing ADMET calls) rather than showing molecular intuition.
A central direction is reinforcement learning for science: learning good representations of scientific structure and building RL on top of them. We find RL'd models traverse hierarchical knowledge better than SFT or distilled models, and training on synthetic graph-traversal tasks transfers to unrelated retrieval benchmarks. We are interested in many forms of RL here, including agentic and test-time RL for discovery and agentic verification workflows such as RefGrader for grading math proofs.
Selected Papers
- SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks? · ICML 2026 GenBio Workshop (leaderboard at smddbench.com)
- Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs · 2025
- RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows · NeurIPS 2025 Workshop (MATH-AI)
People increasingly bring their hardest moments to AI systems, which makes the privacy and safety of mental-health AI a priority for us. With support from an OpenAI Mental Health Research Grant (co-PI with Adam Perer, CMU), we study how safety and mental-health systems handle sensitive disclosures. We find that safety classifiers leak the most at the decision boundary, where crisis and mental-health queries tend to sit, so the inputs we most want to protect are the easiest to infer. We also want to study usage and escalation patterns from sensitive logs without exposing raw conversations.
Selected Papers
- Boundary-targeted Membership Inference Attacks on Safety Classifiers · A. Hughes, A. Goldberg, P. Jha, A. Perer, N. Aletras, N. Mireshghallah (Under Review, NeurIPS 2026)
Most alignment optimizes toward a single response, the mean of annotator preferences, which erases minority viewpoints and stylistic diversity. We focus on the distributional side: how model outputs relate to the full spectrum of human variation. For example, a model knows a coin is fifty-fifty but will simulate ten tosses as eight heads; models learn facts about distributions without learning to materialize them. This connects to pretraining frequency and cuts across alignment, copyright, and personalization.
Selected Papers