Pretrained large language models have revolutionized Natural Language Processing in the past few years. Since the release of BERT in 2018, their size has increased dramatically, both in terms of the volume of training data and the number of model parameters. While BERT had a relatively modest 340 million parameters, newer models such as PaLM have a parameter count in the hundreds of billions (that’s billions with a B). In the September meeting of our AI Reading Group we discussed the paper ‘On the dangers of Stochastic Parrots: Can Language Models be too big?’ by Emily Bender and colleagues from 2021, which addresses what the authors see as the risks of these big language models. Bender et al. mention four areas where they see large language models doing potential harm and ask whether ever larger models are really inevitable or necessary.


Environmental & Financial Costs

The first category of risks discussed in the paper are the environmental and financial costs associated with large language models. The CO2 generated by creating these large language models contributing to climate change. While an average human creates about 5t of CO2 per year, training a big transformer model creates roughly 280t of CO2. The effects of climate change disproportionately impact marginalized communities who pay the price for training these large (usually) English language models. In contrast, they themselves rarely benefit from language technology in their own languages. The huge amount of compute needed to train a large language model is also extremely expensive, which leads to inequitable access to resources and limits who can afford to develop these models. Both environmental and financial costs are to some extent issues of computational efficiency. The authors suggest a shift away from the current focus on small improvements on benchmarks and towards an approach that highlights energy-performance trade-offs.


Problematic Training Data

Another issue with large language models is what the authors call ‘unfathomable training data’. These models are trained on vast datasets of text scraped from the internet with questionable quality. A large dataset by no means guarantees diversity; in fact, internet data overrepresents younger users and users from developed countries. Moreover, data from marginalized communities is underrepresented, both because these groups are less likely to use the scraped sites and because the filtering approach applied to the scraped text often reduces data from online spaces for marginalized communities based on screening for certain ‘bad words’.

The data a model is trained on is static: it is scraped from the web at some point in time and therefore encodes the societal views of that time. However, societal views and values change. Since frequent retraining of language models is just not feasible, this leads to “value-lock”, meaning that the views encoded in the model with the training data do not change in tandem with those of society.

It is also well known that language models exhibit bias. This is a result of the bias encoded in the training data: for example, certain identity markers for marginalized groups (such as ‘black woman’) might occurr more frequently in negative contexts in the training data, creating a bias that the model then reproduces and that affects the output quality.

The authors suggest an approach to deal with ‘unfathomable training data’ that emphasizes curation, documentation, and accountability. They argue that instead of using random data scraped from the web, the training data for language models should be carefully curated to avoid issues with diversity and bias. The data sets should also be carefully documented to allow for accountability. Currently, training data is undocumented, so we do not really know in detail what these models are trained on.


Research Trajectories

A lot of researcher time and effort is being spent on the development of ever larger language models with a focus on improving performance on benchmarks. But the question remains whether we are actually learning anything about machine understanding this way. While languages are systems that pair form and meaning, the training data for language models is only form (text), with no connection to meaning. If we move away from the mindset that bigger is necessarily better when it comes to language models, researchers might be able to use their resources to make progress on actual natural language understanding.   


Real World Risks of Harm

The authors emphasize that language models aren’t really models of language in the way humans understand it. Instead, they are sequence prediction models that combine linguistic forms from their training data without meaning. They are ‘stochastic parrots’, producing form without meaning. However, when the output seems fluent and coherent, we as humans mistake this output for meaningful text and assign communicative intent. Since it looks so much like human language, model output might be perceived as authoritative even though models frequently produce text that is simply not true.  

The synthetic texts generated by language models reproduce and amplify the biases in the training data. Being exposed to these texts can be harmful to readers whose identities are targeted by these biases. The synthetic text can also reinforce biases in humans exposed to the text and in future models that include them as training data.   



This paper was presented and discussed in the September meeting of our AI Reading Group. Following a brief presentation of the paper, there was an interesting discussion about the concerns raised by the authors. Among other things, we discussed how the risks of language models are often framed in a way that makes it sound like the proposed ethical framework should be applied to models. However, since language models are fundamentally non-agentive, it is really a question of researcher and user ethics. We also discussed how a tool that generates language is different from other tools since we as humans tend to imagine a mind behind the text output. The recent story of a Google engineer who became convinced that their new chatbot LaMDA (‘Language Model for Dialogue Applications’) is sentient illustrates this point quite powerfully.

If you would like to learn more about AI basics, trends, ethics and everything in between, consider joining the AI Reading Group on the BCP. We meet once a month for one hour. A volunteer presents a paper that we all voted on beforehand for 10-20 minutes and then we discuss the shortcomings and potential of the paper, related work and anything AI. You can join the AI Reading Group on BCP here.


The image for this post was generated by the text2image model Midjourney from the prompt ‘stochastic parrots’. You can read more about text2image models in this post.