EUREKA: Human-Level Reward Design via Coding Large Language Models


With the advancements Large Language Models have made in recent years, it’s unsurprising why these LLM frameworks excel as semantic planners for sequential high-level decision-making tasks. However, developers still find it challenging to utilize the full potential of LLM frameworks for learning complex low-level manipulation tasks. Despite their efficiency, today’s Large Language Models require considerable domain and subject expertise to learn even simple skills or construct textual prompts, creating a significant gap between their performance and human-level dexterity.

To bridge this gap, developers from Nvidia, CalTech, UPenn, and others have introduced EUREKA, an LLM-powered human-level design algorithm. EUREKA aims to harness various capabilities of LLM frameworks, including code-writing, in-context improvement, and zero-shot content generation, to perform unprecedented optimization of reward codes. These reward codes, combined with reinforcement learning, enable the frameworks to learn complex skills or perform manipulation tasks.

In this article, we will examine the EUREKA framework from a development perspective, exploring its framework, workings, and the results it achieves in generating reward functions. These functions, as claimed by the developers, outperform those generated by humans. We will also delve into how the EUREKA framework paves the way for a new approach to RLHF (Reinforcement Learning using Human Feedback) by enabling gradient-free in-context learning. Let’s get started.

Today, state of the art LLM frameworks like GPT-3, and GPT-4 deliver outstanding results when serving as semantic planners for sequential high-level decision making tasks, but developers are still looking for ways to enhance their performance when it comes to learning low-level manipulation tasks like pen spinning dexterity. Furthermore, developers have observed that reinforcement learning can be used to achieve sustainable results in dexterous conditions, and other domains provided the reward functions are constructed carefully by human designers, and these reward functions are capable of providing the learning signals for favorable behaviors. When compared to real-world reinforcement learning tasks that accept sparse rewards makes it difficult for the model to learn the patterns, shaping these rewards provides the necessary incremental learning signals. Furthermore, rewards functions, despite their importance, are extremely challenging to design, and sub-optimal designs of these functions often lead to unintended behaviors. 

To tackle these challenges and maximize the efficiency of these reward tokens, the EUREKA or Evolution-driven Universal REward Kit for Agent aims to make the following contributions. 

  1. Achieving human-level performance for designing Reward Functions. 
  2. Effectively solve manipulation tasks without using manual reward engineering. 
  3. Generate more human-aligned and more performant reward functions by introducing a new gradient-free in-context learning approach instead of traditional RLHF or Reinforcement Learning from Human Feedback method. 

There are three key algorithmic design choices that the developers have opted for to enhance EUREKA’s generality: evolutionary search, environment as context, and reward reflection. First, the EUREKA framework takes the environment source code as context to generate executable reward functions in a zero-shot setting. Following this, the framework performs an evolutionary search to improve the quality of its rewards substantially, proposes batches of reward candidates with every iteration or epoch, and refines the ones that it finds to be the most promising. In the third and the final stage, the framework uses the reward reflection approach to make the in-context improvement of rewards more effective, a process that ultimately helps the framework enable targeted and automated reward editing by using a textual summary of the quality of these rewards on the basis of policy training statistics. The following figure gives you a brief overview of how the EUREKA framework works, and in the upcoming section, we will be talking about the architecture and working in greater detail. 

EUREKA : Model Architecture, and Problem Setting

The primary aim of reward shaping is to return a shaped or curated reward function for a ground-truth reward function, which might pose difficulties when being directly optimized like sparse rewards. Furthermore, designers can only use queries to access these ground-truth reward functions which is the reason why the EUREKA framework opts for reward generation, a program synthesis setting based on RDP or the Reward Design Problem. 

The Reward Design Problem or RDP is a tuple that contains a world model with a state space, space for reward functions, a transition function, and an action space. A learning algorithm then optimizes rewards by generating a policy that results in a MDP or Markov Design Process, that produces the scalar evolution of any policy, and can only be accessed using policy queries. The primary goal of the RDP is to output a reward function in a way such that the policy is capable of achieving the maximum fitness score. In EUREKA’s problem setting, the developers have specified every component in the Reward Design Problem using code. Furthermore, for a given string that specifies the details of the task, the primary objective of the reward generation problem is to generate a reward function code to maximize the fitness score. 

Moving along, at its core, there are three fundamental algorithmic components in the EUREKA framework. Evolutionary search(proposing and rewarding refining candidates iteratively), environment as context(generating executable rewards in zero-shot setting), and reward reflection(to enable fine-grained improvement of rewards). The pseudo code for the algorithm is illustrated in the following image. 

Environment as Context

Currently, LLM frameworks need environment specifications as inputs for designing rewards whereas the EUREKA framework proposes to feed the raw environment code directly as context, without the reward code allowing the LLM frameworks to take the world model as context. The approach followed by EUREKA has two major benefits. First, LLM frameworks for coding purposes are trained on native code sets that are written in existing programming languages like C, C++, Python, Java, and more, which is the fundamental reason why they are better at producing code outputs when they are directly allowed to compose code in the syntax and style that they have originally trained on. Second, using the environment source code usually reveals the environments involved semantically, and the variables that are fit or ideal for use in an attempt to output a reward function in accordance with the specified task. On the basis of these insights, the EUREKA framework instructs the LLM to return a more executable Python code directly with the help of only formatting tips, and generic reward designs. 

Evolutionary Search

The inclusion of evolutionary search in the EUREKA framework aims to present a natural solution to the sub-optimality challenges, and errors occurred during execution as mentioned before. With each iteration or epoch, the framework various independent outputs from the Large Language Model, and provided the generations are all i.i.d, it exponentially reduces the probability of reward functions during the iterations being buggy given the number of samples are increasing with every epoch. 

In the next step, the EUREKA framework uses the executable rewards functions from previous iteration the perform an in-context reward mutation, and then proposes a new and improved reward function on the basis of textual feedback. The EUREKA framework when combined with the in-context improvement, and instruction-following capabilities of Large Language Models is able to specify the mutation operator as a text prompt, and suggests a method to use the textual summary of policy training to modify existing reward codes. 

Reward Reflection

To ground in-context reward mutations, it is essential to assess the quality of the generated rewards, and more importantly, put them into words, and the EUREKA framework tackles it by using the simple strategy of providing the numerical scores as reward evaluation. When the task fitness function serves as a holistic metric for ground-truth, it lacks credit assignment, and is unable to provide any valuable information as to why the reward function works, or why it does not work. So, in an attempt to provide a more targeted and intricate reward diagnosis, the framework proposes to use automated feedbacks to summarize the policy training dynamics in texts. Furthermore, in the reward program, the reward functions in the EUREKA framework are asked to expose their components individually allowing the framework to track the scalar values of every unique reward component at policy checkpoints during the entire training phase.

Although the reward function procedure followed by the EUREKA framework is simple to construct, it is essential thanks to the algorithmic-dependent nature of optimizing rewards. It means that the effectiveness of a reward function is directly influenced by the choice of a Reinforcement Learning algorithm, and with a change in hyperparameters, the reward may perform differently even with the same optimizer. Thus, the EUREKA framework is able to edit the records more effectively & selectively while synthesizing reward functions that are in enhanced synergy with the Reinforcement Learning algorithm. 

Training and Baseline

There are two major training components of the EUREKA framework: Policy Learning and Reward Evaluation Metrics.

Policy Learning

The final reward functions for every individual task is optimized with the help of the same reinforcement learning algorithm using the same set of hyperparameters that are fine-tuned to make the human-engineered rewards function well. 

Reward Evaluation Metrics

As the task metric varies in terms of scale & semantic meaning with every task, the EUREKA framework reports the human normalized score, a metric that provides a holistic measure for the framework to compare how it performs against the expert human-generated rewards in accordance with the ground-truth metrics. 

Moving along, there are three primary baselines: L2R, Human, and Sparse. 

L2R

L2R is a dual-stage Large Language Model prompting solution that helps in generating templated rewards. First, a LLM framework fills in a natural language template for environment and task specified in natural language, and then a second LLM framework converts this “motion description” into a code that writes a reward function by calling a set of manually written reward API primitives. 

Human

The Human baseline are the original reward functions written by reinforcement learning researchers, thus representing the outcomes of human reward engineering at an unprecedented level. 

Sparse

The Sparse baseline resembles the fitness functions, and they are used to evaluate the quality of the rewards the framework generates. 

Results and Outcomes

To analyze the performance of the EUREKA framework, we will evaluate it on different parameters including its performance against human rewards, improvement in results over time, generating novel rewards, enabling targeted improvement, and working with human feedback. 

EUREKA Outperforms Human Rewards

The following figure illustrates the aggregate results over different benchmarks, and as it can be clearly observed, the EUREKA framework either outperforms or performs on par to human-level rewards on both Dexterity and Issac tasks. In comparison, the L2R baseline delivers similar performance on low-dimensional tasks, but when it comes to high-dimensional tasks, the gap in the performance is quite substantial. 

Consistently Improving Over Time

One of the major highlights of the EUREKA framework is its ability to constantly improve and enhance its performance over time with each iteration, and the results are demonstrated in the figure below. 

As it can be clearly seen, the framework constantly generates better rewards with each iteration, and it also improves & eventually surpasses the performance of human rewards, thanks to its use of in-context evolutionary reward search approach. 

Generating Novel Rewards

The novelty of the rewards of the EUREKA framework can be assessed by calculating the correlation between human and EUREKA rewards on the entirety of Issac tasks. These correlations are then plotted on a scatter-plot or map against the human normalized scores, with each point on the plot representing an individual EUREKA reward for every individual task. As it can be clearly seen, the EUREKA framework predominantly generates weak correlated reward functions outperforming the human reward functions. 

Enabling Targeted Improvement

To evaluate the importance of adding reward reflection in reward feedback, developers evaluated an ablation, a EUREKA framework with no reward reflection that reduces the feedback prompts to consist only of snapshot values. When running Issac tasks, developers observed that without reward reflection, the EUREKA framework witnessed a drop of about 29% in the average normalized score. 

Working with Human Feedbacks

To readily incorporate a wide array of inputs to generate human-aligned and more performant reward functions, the EUREKA framework in addition to automated reward designs also introduces a new gradient-free in-context learning approach to Reinforcement Learning from Human Feedback, and there were two significant observations. 

  1. EUREKA can benefit and improve from human-reward functions. 
  2. Using human feedback for reward reflections induces aligned behavior. 

The above figure demonstrates how the EUREKA framework demonstrates a substantial boost in performance, and efficiency using human reward initialization regardless of the quality of the human rewards suggesting the quality of the base rewards does not have a significant impact on the in-context reward improvement abilities of the framework. 

The above figure illustrates how the EUREKA framework can not only induce more human-aligned policies, but also modify rewards by incorporating human feedback. 

Final Thoughts

In this article, we have talked about EUREKA, a LLM-powered human-level design algorithm, that attempts to harness various capabilities of LLM frameworks including code-writing, in-context improvement capabilities, and zero-shot content generation to perform unprecedented optimization of reward codes. The reward code along with reinforcement learning can then be used by these frameworks to learn complex skills, or perform manipulation tasks. Without human intervention or task-specific prompt engineering, the framework delivers human-level reward generation capabilities on a wide array of tasks, and its major strength lies in learning complex tasks with a curriculum learning approach. 

Overall, the substantial performance and versatility of the EUREKA framework indicates the potential of combining evolutionary algorithms with large language models might result in a scalable and general approach to design rewards, and this insight might be applicable to other open-ended search problems. 

Latest articles

spot_imgspot_img

Related articles

Leave a reply

Please enter your comment!
Please enter your name here

spot_imgspot_img