Winograd Schema Challenge

Introduction

In preparation to attend the retirement of my beloved Linguistics mentor and professor along with her eminant students, many luminaries in the academic world, the neurotic me took this as an excuse to dive deep into the world of NLP and learn what’s happening in this exciting field. Today, I’ll share an update of my learnings with you, particularly about Winograd Schema Challenge named after another mentor of mine.

Reverse Timeline through NLP

Let’s begin in 2017, specifically Transformers, made a significant impact. Before that, in the 2010s, probabilistic models like neural networks, RNNs, GRUs, and LSTMs took the center stage. Moving further back, in the 1990s, statistical models utilizing n-gram co-occurrence were prevalent. And in the pre-90s era, rule-based systems ruled the domain.

“You shall know a word by the company it keeps” - J.R. Firth, Linguist

This quote captures the essence of why statistical and probabilistic models work well when building language models. These models consider the nearby neighbors, syntax, rules of parts of speech, and semantic relationships within the context. However, it’s essential to clarify that language models are not designed for true understanding. Instead, they focus on predicting the probability of the next word in a given sequence.

Upon the introduction of Transformers, models are grouped into 3 classes:

Encoder-Decoder Models: Within the realm of NLP, there are different architectures for language models. The encoder-decoder models, also known as many-to-many (Seq2Seq), excel in tasks such as machine translation, text summarization, and question answering. Examples of these models include T5 and BART.
Encoder-Only Models: On the other hand, encoder-only models employ many-to-one architecture. They’re great for sentiment analysis and classification tasks. Example of encoder-only models include BERT, RoBERTa, and their numerous variants. These models also employ masked (cloze) language modeling techniques.
Decoder-Only Models: Finally, decoder-only models utilize a one-to-many architecture, primarily focusing on text generation. GPT, GPT-2, and GPT-3 are renowned examples of these models. They employ causal (autoregressive) language modeling.

Winograd Schema Challenge

The recent exponential growth of large language models has been beenbreathtalking. The Turing Test has been used to identify artificial intelligence, though its validity has faced criticism. The Loebner Prize, an annual competition held from 1990 to 2016, aimed to find the most human-like AI through text-based conversations but was criticized by NLP practitioners as a publicity stunt. In The Most Human Human, Brian Christian describes his participation in a Turing Test challenge, striving to “defend the pride of humanity.” He engaged in instant-message chats with strangers, trying to convince a judging panel of his human identity. Despite being human himself, Christian recognized the challenge of distinguishing between humans and AI, blurring the lines between the two. These experiences shed light on the ongoing fascination with creating AI that convincingly mimics human behavior, even as the Turing Test and its implementations remain subjects of debate and critique within the field.

Thus, Winograd Schema Challenge (WSC) was proposed an alternative to the Turing Test meant as an assessment of common sense reasoning and intelligence. Winograd schemas are pairs of sentences that differ in a few words and contain ambiguous references resolved differently in each sentence.

For example, consider the original Winograd Schema: “The town council members refused to give the angry demonstrators a permit because they feared/advocated violence.” The question is, who feared/advocated violence? The answer could be either the town council members or the angry demonstrators, depending on the word choice.

The Winograd Schema Challenge consists of 273 pairs of sentences, each differing by just 1-2 words. The challenge requires knowledge and common sense reasoning to accurately determine the correct interpretation of the ambiguous references in the sentences.

While the pairs of sentences are easily understandable for humans,
They cannot be easily solved using simple techniques.
Designed to be “Google-proof” with no obvious statistical patterns to exploit.

The success or failure of a program in the challenge is readily apparent to non-experts, making it a straightforward evaluation measure for assessing the capabilities of AI systems.

Datasets & Benchmarks

Benchmarks like GLUE and SuperGLUE have been introduced to provide more comprehensive evaluations. To further explore the realm of common sense reasoning, several related datasets have been developed. These include the Winograd Natural Language Inference Dataset (WNLI) and SuperGLUE’s WSC. Other datasets like Definite Pronoun Resolution (DPR), Pronoun Disambiguation Problem (PDP), WinoGender & WinoBias, WinoGrande, WinoFlexi, Winventor, WinoWhy & WinoLogic, and WinoMT & Wino-X expand the scope and depth of common sense evaluations. Moreover, these datasets have been extended to different languages such as French, Portuguese, Japanese, Chinese, Hindi, Slovene, and Hebrew. However, some common pitfalls include examples that can be too easy or too ambiguous, making it difficult to gauge the true capabilities of a model. Additionally, leakage from large training data can inadvertently influence a model’s performance, impacting the integrity of the evaluation.

Defeat of Winograd Schema Challenge

By 2019, a number of pre-trained transformer-based language models achieved better than 90% accuracy on WSC. The Defeat of Winograd Schema Challenge reviews the history and assesses its significance. Some criticisms include:

Small initial dataset, which limits its usefulness for training and fine-tuning
The evaluation criteria have been considered lenient, potentially allowing for less robust solutions.
The manual creation of high-quality examples can be difficult and time-consuming
Artifacts present in the datasets lead to potential leakage from large training datasets

Some would argue that pronoun disambiguation alone does not provide a reliable measure of a model’s overall grasp of commonsense reasoning. Nevertheless, Winograd Schemas have played a fascinating role in development and evaluation AI model’s language understanding and intelligence.

Resources

Cover Photo by Hans-Peter Gauster on Unsplash

← Previous Post Next Post →