ROaD-Electra Robustly Optimized and Distilled Electra 1

by Haytham ElFadeel - [email protected], Stan Peshterliev

2020

Research done while @ Meta Inc.

TL;DR

We present a new multi-task pre training method that improves the transformer performance, generalization, and robustness.
We present a new variation of knowledge distillation that improves the teacher signal.
A new state-of-the-art results in SQUAD 2.0 MNLI, and QQP. These improvements come without any increase in inference compute. This model has less weight than ELECTRA.

1. Introduction

The goal of this project was to push the state-of-the-art performance of transformer models (focusing on Question Answering and NLI). while simultaneously trying to address some of the transformer models issues such as: Robustness (small paraphrasing leads to loss of accuracy) and Generalization (model trained on one dataset, performs poorly on another dataset of the same task), Answerability (models tend to answer even if there is no answer in the context). The goal is to do this without incurring any increase in inference computation.

This work does not target new pre-training objectives or methods, such as RoBERTa and ELECTRA; instead it builds on top of ELECTRA and explores other multi-task pretraining and knowledge distillation.

2. Baseline

We selected ELECTRA (Clark et al., 2019) as our starting (‘baseline’) model since it represents the current state-of-the-art. That being said, our implementation differs in one aspect, ELECTRA question-answering module predicts the answer start and end positions jointly and has an ‘answerability’ classifier. We use a simplified question-answering module which independently predicts the answer start and end positions. We train the model to predict position ‘0’ (the CLS token) if the question is not answerable, which is similar to the original BERT model. Our experiment shows that this simplification impact on performance is marginal to non-existence, yet it reduces the number of parameters by more than 4 million.

3. Multi-task pretraining

Transformer models are trained in a self-supervised manner to predict a masked in a context, predict the next sentence or to distinguish “real” input tokens vs “fake” input tokens generated by another neural network in the case of ELECTRA.

Those objectives allow the models to learn about language structure and semantic, but its lack aspects of human language learning such as: humans learn to use language by listening and performing multiple tasks (e.g. expressing emotion, making statements, asking and answering, etc.) and they apply the knowledge learned from task to another to help learn new task. Also, language doesn’t exist on it’s own, language exists in a physical world that contains physical objects, interactions, general and common-sense knowledge, and a variety of sensory input e.g. vision, voice, etc. - but this topic for another day.

Multi-task pretraining is not a new idea, it has been proposed a few times (Caruana, 1997; Zhang and Yang, 2017; Liu et al, 2019). The goal of our MT-Pretraining is to teach the model a diverse set of realistic tasks to help it better understand language and generalize better.

3.1 Our Approach

Our model architecture is identical to ELECTRA. The encoding layers are shared across all tasks. Each task (a task could span multiple datasets) has its own output heads.

TL;DR

1. Introduction

2. Baseline

3. Multi-task pretraining

3.1 Our Approach

3.2 Tasks