Ulyana Piterbarg

Ulyana Piterbarg

up2021 [at] cims.nyu.edu

[Twitter] [Google Scholar] [Github] [CV]

Papers

project image

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games


Davide Paglieri, Bartłomiej Cupiał*, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

Preprint, 2024
arXiv / project page / code /

We introduce BALROG, a benchmark designed to assess the agentic capabilities of LLMs and VLMs through a set of increasingly challenging games. Frontier models like Claude 3.5 Sonnet and GPT-4o achieve <2% progress on the hardest game in BALROG, the notorious roguelike dungeoncrawler NetHack.

project image

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis


Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

Preprint, 2024
arXiv / project page / code /

There are infinitely many ways to write a program. Training autoregressive LMs to natively synthesize programs with diffs improves the trade-off between generation quality and inference-time compute. We show that repeatedly sampling solutions to coding problems from small language models SFT-ed on synthetic program diff sequences yields benchmark coverage that is competitive with GPT-4 and GPT-4-Omni, with similar total cost to generating a single completion from the best open-source LLMs like Llama 3.1 405B.

project image

diff History for Neural Language Agents


Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

41st International Conference on Machine Learning (ICML), 2024
arXiv / project page / code /

We propose that long-context LMs for decision-making, interaction, and/or world-modeling be trained on demonstration data consisting of textual observation deltas, or diff history. Diff history can be thought of as a textual analogue of optical flow. Using diff history, we show that even tiny LMs (~120M parameters) can be efficiently tuned into highly competitive and robust agents for hard decision-making settings like the video-game NetHack, where they match state-of-the-art despite being tuned on 1800x fewer data than prior work.

project image

NetHack is Hard to Hack


Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

37th Conference on Neural Information Processing Systems (NeurIPS), 2023
arXiv / project page / code /

Neural policy learning methods struggle in long-horizon tasks, especially in open-ended environments with multi-modal observations, such as the popular dungeon-crawler game, NetHack. In the NeurIPS 2021 NetHack Challenge, symbolic agents outperformed neural approaches by over four times in median game score. In this paper, we delve into the reasons behind this performance gap and present an extensive study on neural policy learning for NetHack. Our investigations produce a state-of-the-art neural agent. However, we also demonstrate that scaling up supervised learning is insufficient to bridge the performance gap with the best symbolic models or even the top human players.

project image

Capturing missing physics in climate model parameterizations using neural differential equations


Ali Ramadhan, John C Marshall, Andre Nogueira Souza, Xin Kai Lee, Ulyana Piterbarg, Adeline Hillier, Gregory LeClaire Wagner, Christopher Rackauckas, Chris Hill, Jean-Michel Campin, Raffaele Ferrari

Earth and Space Science Open Archive (ESSOAR), 2022
arXiv / code /

We explore how neural differential equations (NDEs) may be trained on highly resolved fluid-dynamical models of unresolved scales providing an ideal framework for data-driven parameterizations in climate models. NDEs overcome some of the limitations of traditional neural networks (NNs) in fluid dynamical applications in that they can readily incorporate conservation laws and boundary conditions and are stable when integrated over time. We advocate a method that employs a “residual” approach, in which the NN is used to improve upon an existing parameterization through the representation of residual fluxes which are not captured by the base parameterization.

project image

Abstract strategy learning underlies flexible transfer in physical problem solving


Kelsey R. Allen, Kevin A. Smith, Ulyana Piterbarg, Robert Chen, Josh B. Tenenbaum

42nd Annual Meeting of the Cognitive Science Society, 2020


What do people learn when they repeatedly try to solve a set of related problems? In a set of three different exploratory physical problem solving experiments, participants consistently learn strategies rather than generically better world models. Our results suggested that people can make use of limited experience to learn abstract strategies that go beyond simple model-free policies and are instead object-oriented, adaptable, and can be parameterized by model-based variables such as weight.


Unpublished Work


project image

Biped Locomotion from Human Demonstrations with Motion Imitation via RL

MIT 6.832: Underactuated Robotics (graduate) (2021-05-17)

I experimented with reinforcement learning as a basis for learning bipedal locomotive skills from human demonstrations, using data from the CMU Motion Capture Database as the demonstration source and the NASA Valkyrie (R5) robot as the target system for motion imitation. This work draws from Peng at al. 2020 and Xie et al. 2019.


project image

Experiments with Quasi-Geostrophic Flows

MIT Ferrari Lab (2021-01-10)

Under the supervision of Andre N. Souza, I experimented with turbulent regimes of quasigeostrophic flows, numerically approximated using Julia.


project image

A Bayesian Approach to Modeling Infection-Based Social Distancing in the SARS-CoV-2 Pandemic

MIT IDS.147/15.077: Statistical Learning and Data Mining (graduate) (2020-05-20)

This project revolves around a simple extension of the classical SIR compartmental model of disease transmission that parameterizes infection-based social distancing policy i.e., the feedback SIR (fSIR) model imagined by Dr. Elisa Franco. I fit this model via the probabilistic programming package PyMC3 against true statistics of infection from four countries with sharply-contrasting responses to the pandemic, yielding posteriors that made it possible to perform a rough numerical comparison of policy-efficacy.


project image

Exploring strategy learning in the “Tools Environment”

MIT 9.66/9.660/6.804: Computational Cognitive Science (2018-12-10)

I worked with Kevin A. Smith and Kelsey R. Allen to design and to run a preliminary study testing the Virtual Tools Game as a testbed for behavioral experiments studying abstract strategy learning in humans. I also designed a hierarchical Bayesian mixture model to identify the abstract strategies learned by study participants directly from data.


project image

Investigating Option-Conditional Value Prediction in Reinforcement Learning

EPFL Life Sciences Summer Research Program Colloquium (2018-08-15)

Supervised by Johanni Brea and Wulfram Gerstner, I investigated the efficacy of option-conditional value prediction in reinforcement learning (RL) by adapting the Value Prediction Network for tabular environments as well as by implementing the algorithm as in Oh et al.’s original paper, using a combination of temporal-difference search (TD search) and n-step Q-learning for training.


project image

Anamorphic Entrance Scupture Prototype

AMNH Exhibitions Department (2017-06-10)

I co-designed the prototype of an anamorphic entrance sculpture to the American Museum of Natural History special exhibition “Our Senses: An Immersive Experience.” The final sculpture was included in the New York Times feature on the exhibition, “This Exhibition Will Help You Make Sense of Your Senses”.