MetaWorld: A Compact Benchmark for Multi-Task Tabletop Manipulation

November 09, 2025 Benchmark Manipulation Multi-Task Learning

MetaWorld is a MuJoCo benchmark for multi-task and meta-reinforcement learning in robotic manipulation. The paper is useful because it provides 50 distinct manipulation task families under a shared robot interface, making task diversity the main object of study.

MetaWorld task overview from the paper
Paper figure from the MetaWorld source package, showing the 50-task manipulation suite and the train/test split idea for meta-learning.
MetaWorld task examples across multiple tabletop manipulation families
Task examples from the paper source. MetaWorld is useful because many tasks share one interface while still requiring different manipulation skills.

What the Paper Contributes

MetaWorld was designed to move meta-RL and multi-task RL beyond narrow task families. Instead of only varying a goal location inside one environment, it includes qualitatively different skills such as reaching, pushing, pick-place, opening doors, operating drawers, pressing buttons, inserting pegs, and opening windows.

The common evaluation modes are:

Split Purpose
ML1 Few-shot adaptation within one task family
MT10 Multi-task learning over 10 task families
MT50 Multi-task learning over all 50 task families
ML10 Meta-learning with held-out test tasks
ML45 The harder meta-learning setting using 45 training tasks and held-out tasks

The paper’s key point is that a method should not only learn a single manipulation skill. It should reuse structure across many related but distinct manipulation problems.

How to Use It

import metaworld

benchmark = metaworld.MT1("door-lock-v3")
env = benchmark.train_classes["door-lock-v3"]()
task = benchmark.train_tasks[0]
env.set_task(task)
obs = env.reset()
obs, reward, done, truncated, info = env.step(env.action_space.sample())
env.close()

For fair comparison, report the benchmark split, task list, seed, horizon, success definition, and whether evaluation uses state observations or rendered images.

Practical Usage Notes

“MetaWorld” is not a single setting. MT and ML splits answer different questions, and a custom four-task subset should not be compared as if it were MT50 or ML45.

For a clean report, include:

  • exact task family names and version suffixes;
  • MT or ML split, plus whether the run is a custom subset;
  • observation type, action normalization, horizon, seed count, and success definition;
  • per-task success in addition to the averaged score;
  • whether the experiment is about multitask learning, meta-learning, or task/client heterogeneity.

MetaWorld is also useful for federated or distributed-learning prototypes because task heterogeneity is controllable. But the claim should stay scoped: strong MetaWorld performance does not establish language grounding, photorealistic perception, or real-robot deployment.

When To Use It

Use MetaWorld when the experiment needs a clean multi-task manipulation testbed, task/client heterogeneity, policy adaptation, or quick simulation feedback.

It is a good middle ground between very small control tasks such as PushT and larger scene-rich benchmarks such as RoboCasa.

Limits

MetaWorld is not a language-rich VLA benchmark by default, and it is not a real-world dataset. It also does not capture the visual complexity of household scenes. Use it for controlled manipulation learning, not as the only proof of embodied intelligence.

Paper Source

This note was revised from the paper and its LaTeX source package: Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning.