Paper Reading: Sharp Nearby, Fuzzy Faraway
Paper: Sharp Nearby, Fuzzy Far Away: How Natural Language Models Use Context
Overall introduction
This paper has an attractive name. The title itself gives away the research target of this paper: investigating the role that context plays in an LSTM Language model. More specifically, the authors are trying to use the ablation study approach, which means the removal of a component in an AI system, to study the increase in perplexity when manipulating context in various ways.
Motivations
There are two general motivations for this research topic. First, although we have recognized that neural language models have consistently outperformed n-gram-based language models, we are unclear on how NLMs use the context. Second, many previous works in LSTMs, such as capturing syntactic structures, and modeling semantic compositionality, remain of the sentence-level analysis. These settings did not provide much information about the part context played. Thus the second motivation is to complement the prior work by providing a richer understanding of the role of the context.
Research questions
Stemming from the two motivations above, the authors proposed three research questions. In terms of token count, to what degree is context involved in neural language models? Within this context-impactful range, are local and global contexts represented differently? How does the neural caching mechanism help NLMs take advantage of the context?
To approach the three questions mentioned above, the researchers proposed three methodologies. To answer the first research question, they alter the context length to study how many tokens are impactful w.r.t the model performance. For the second method, the authors manipulate local and global contexts to study its impact on the LSTM model. As for the third method, the authors plan to test the LSTMs’ copying ability by dropping and replacing target words locally and globally w/wo an external copy mechanism.
Experiment settings, mismatch and metrics
Before we dive into the findings, it is helpful to learn about the experiment setups, including model structure, evaluation metrics and mismatch settings. Since this research aims to discover the role of context, not beating the SOTA system, the authors chose a standard LSTM language model. Furthermore, since they plan to study the part of context by ablation, they do not manipulate the context in the training process but only in the test process, thus creating a train/test mismatch. This is very clever because if the model has seen reordered/masked contexts, the gap would not be convincing in the testing stage. The metric to measure the model’s performance is perplexity (PP), equivalently represented by negative log-likelihood (NLL) loss.
$$NLL = -\frac{1}{T}\sum_{t=1}^{T}logP(w_t|w_{t-1},\cdots,w_1)$$
where $$PP=exp(NLL)$$
Dataset information
The researchers chose two commonly used language modeling datasets, the Penn Treebank (PTB) dataset and the Wikitext-2 (Wiki) dataset. The PTB dataset consists of 0.9M tokens for training and a 10K vocab; the wiki dataset consists of 2.1M tokens and a 33k vocab.
Findings & Conclusions
To explore the effects of token numbers on model performance (RQ1), the authors performed four sets of experiments. In the first set, they tested directly how the number of tokens would affect the perplexity. They found the model has an effective context range within 150-200 tokens on the two datasets, meaning that the context beyond this range can be considered non-effective for the model. In the second set, they changed the LSTM model’s hyperparameter to check whether the 150-200 effective range would change. The result demonstrates that different hyperparameter settings will not change the effective context range. In the third setup, they tagged tokens as frequent and infrequent, concluding that infrequent words need more context than frequent words. In the fourth setup, they identified tokens as function or content words. Content words usually involve nouns, verbs, adjectives and adverbs, while function words can be seen as general semantic words in the context, such as determiners and prepositions. They found that content words need more context than function words.
The conclusions may not be so intuitive for the last two settings, so I will try to explain my thoughts. It is often the case that frequent words and content words contain more critical information in sentences, where the information is denser than the infrequent words or function words, thus less likely to be affected by a more extensive context.
As for the second research question, the researchers provided two experimental setups to study the distinction between the effect from the local context and the global context. The first experiment setup manipulates the context locally, changing word orders within a context window of 20 words, and found that local context manipulation has a trivial effect from 20 tokens beyond. The second setup is similar to the first but different in that it reverts the entire global context. The context loses effects after 50 tokens distance.
In the third research question, the authors looked into the impact of the cache mechanism. They found that LSTMs can remember words seen in nearby contexts. They also found that the cache mechanism enables LSTMs to recall words from long-range contexts.
Future work
The conclusions above can shed insight into some of the future works. For example, understanding the different impacts of content and function words can help researchers refine the word dropout strategy. Moreover, since this paper remains on the token-level context role exploration, future work can look into sentence-level contexts, which will significantly expand the context range.
Paper Reading: Sharp Nearby, Fuzzy Faraway
https://realzza.github.io/courses/Paper-Reading-Sharp-Nearby-Fuzzy-Faraway/