News Release

Offline pre-trained multi-agent decision transformer

Peer-Reviewed Publication

Beijing Zhongke Journal Publising Co. Ltd.

Overview of the pipeline for pre-training the general policy and fine-tuning it online

image: This figure overviews the proposed method from the perspective of offline pretraining with supervised learning and online fine-tuning with MARL algorithms. view more 

Credit: Beijing Zhongke Journal Publising Co. Ltd.

Multi-agent reinforcement learning (MARL) algorithms play an essential role in solving complex decision-making tasks by learning from the interaction data between computerized agents and (simulated) physical environments. However, the scheme of learning policy from experience requires the algorithms with high computational complexity and sample efficiency due to the limited computing resources and high cost resulting from the data collection. Furthermore, even in domains where the online environment is feasible, researchers might still prefer to utilize previously-collected data instead. In addition, a policy trained on one scenario usually cannot perform well on another even under the same task. Therefore, a universal policy is critical for saving the training time of general reinforcement learning (RL) tasks.


Notably, the recent advance of supervised learning has shown that the effectiveness of learning methods can be maximized when they are provided with very large modelling capacity, trained on very large and diverse datasets. The surprising effectiveness of large, generic models supplied with large amounts of training data, spurs the community to search for ways to scale up thus boosting the performance of RL models. Towards this end, Decision transformer is one of the first models that verifies the possibility of solving conventional (offline) RL problems by generative trajectory modelling. The technique of transforming decision-making problems into sequence modelling problems has opened a new gate for solving RL tasks. Crucially, this activates a novel pathway toward training RL systems on diverse datasets in much the same manner as in supervised learning, which is often instantiated by offline RL techniques. Offline RL methods have recently attracted tremendous attention since they enable agents to apply self-supervised or unsupervised RL methods in settings where online collection is infeasible. The researchers argue that this is particularly important for MARL problems since online exploration in multi-agent settings may not be feasible in many settings, but learning with unsupervised or meta-learned outcome-driven objectives via offline data is still possible. However, it is unclear yet whether the effectiveness of sequence modelling through transformer architecture also applies to MARL problems.


In this paper, researchers propose multi-agent decision transformers (MADT), an architecture that casts the problem of MARL as conditional sequence modelling. Their mandate is to understand if the proposed MADT can learn through pre-training a generalized policy on offline datasets, which can then be effectively used to other downstream environments (known or unknown). As a study example, researchers specifically focus on the well-known challenge for MARL tasks: the StarCraft multi-agent challenge (SMAC), and demonstrate the possibility of solving multiple SMAC tasks with one big sequence model. Their contribution is as follows: They propose a series of transformer variants for offline MARL by leveraging the sequential modelling of the attention mechanism. In particular, they validate their pre-trained sequential model in the challenging multi-agent environment for its sample efficiency and transferability. They built a dataset with different skill levels covering different variations of SMAC scenarios. Experimental results on SMAC tasks show that MADT enjoys fast adaptation and superior performance via learning one big sequence model. The main challenges in their offline pre-training and online fine-tuning problems are the out-of-distribution and training paradigm mismatch problems. They tackle these two problems with the sequential model and pre-train the global critic model offline.


As to the methodology, researchers demonstrate how the transformer is applied to their offline pre-training MARL framework. First, they introduce the typical paradigm and computation process for the multi-agent reinforcement learning and attention-based model. Then, they introduce an offline MARL method, in which the transformer sequentially maps between the local observations and actions of each agent in the offline dataset via parameter sharing. Then they leverage the hidden representation as the input of the MADT to minimize the cross-entropy loss. Furthermore, they introduce how to integrate the online MARL with MADT in constructing their whole framework to train a universal MARL policy. To accelerate the online learning, they load the pre-trained model as a part of the MARL algorithms and learn the policy based on experience in the latest buffer stored from the online environment. To train a universal MARL policy quickly adapting to other tasks, they bridge the gap between different scenarios from observations, actions, and available actions, respectively.


As to the experiments, researchers show three experimental settings: offline MARL, online MARL by loading the pre-trained model, and few-shot or zero-shot offline learning. For the offline MARL, they expect to verify the performance of their method by pre-training the policy and directly testing on the corresponding maps. In order to demonstrate the capacity of the pre-trained policy on the original or new scenarios, they aim to demonstrate the fine-tuning in the online environment. Experimental results in offline MARL show that their MADT-offline outperforms the state-of-the-art methods. Furthermore, MADT-online can improve the sample efficiency across multiple scenarios. Besides, the universal MADT trained from multi-task data with MADT-online generalizes well in each scenario in a few-shot or even zero-shot setting.


It is understood that this is the first work that demonstrates the effectiveness of offline pre-training and the effectiveness of sequence modelling through transformer architectures in the context of MARL. This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences and the National Natural Science Foundation of China.


See the article:

Offline Pre-trained Multi-agent Decision Transformer

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.