How Can We Measure (Artificial) Intelligence?
In the October meeting of our AI Reading Group we discussed the paper "On the Measure of Intelligence" by Francois Chollet (2019) in which the author describes today's deep-learning systems as still very limited, as they are essentially skill programs being trained on large amounts of data. Even when these systems are capable of solving entire batteries of tasks, they cannot truly be called intelligent because their designers know what tasks they will face, and current benchmarks are designed to measure only the related skills. To measure intelligence, he argues, we need other benchmarks, ones that focus on generalization and skill-acquisition efficiency. In his Paper, Chollet proposes a new formal definition of intelligence based on algorithmic information theory, which describes intelligence as efficiency in acquiring skills. It considers scope, generalization difficulty, prior knowledge, and experience as critical factors in characterizing intelligent systems. The Abstraction and Reasoning Corpus (ARC) is a set of tasks developed based on this formal definition to provide a more comprehensive testing environment for evaluating AI systems.
The paper consists of just under 60 pages of mainly written text. In order to make the content accessible to a broader audience, I have tried with this blog article to elaborate what I consider to be the most important points for general understanding.
Context and history
Need for an actionable definition and measure of intelligence
The promise of the field of AI is and has always been to develop machines that possess intelligence comparable to humans. However, current systems are still very limited in their abilities. The question of what we even mean when we talk about intelligence still doesn't have a satisfying answer. To make progress towards the promise, we need a reliable measure of that progress by using precise, quantitative definitions and measures of intelligence. Common sense dictionary definitions of intelligence are not helpful for this purpose as they are not actionable, explanatory or measurable. The same goes with methods like the Turing Test and its variants, as it is outsourcing the task to humans who do not have clear definitions or evaluation protocols.
Two divergent views of intelligence
Regarding the term intelligence, there is still no scientific consensus among researchers on a single definition. However, a common characterization of intelligence includes two aspects: task-specific abilities ("achieving goals") and generality and adaptation ("in a wide range of environments"). These two characterizations also relate closely to two opposing views of intelligence. One view sees the mind as a relatively static assembly of special-purpose mechanisms developed by evolution, only capable of learning what it is programmed to acquire. In another view, the mind is a general purpose "blank slate" capable of turning arbitrary experience into knowledge and skills that could be directed to any problem.
The evolutionary psychology view of human nature posits that much of the cognitive function is due to special-purpose adaptations. In other words, the human brain has evolved to be good at certain tasks because those skills were necessary for survival. This view gave rise to definitions of intelligence and evaluation protocols focused on task-specific performance. The problem with this approach is that it lacks generality. AI systems that are narrowly focused on task performance can often outperform humans on those specific tasks. However, they lack the flexibility and adaptability of humans when it comes to general problem solving. As a result, the focus on task-specific performance has led to a striking lack of generality in AI.
In contrast, some researchers have taken the position that intelligence consists of the general ability to acquire new skills through learning, an ability that can be directed to a wide range of previously unknown problems, perhaps, even to any problem. This view of intelligence reflects another long-standing view of human nature that has strongly influenced the history of cognitive science and contrasts with the view of evolutionary psychology: a vision of the mind as a flexible, adaptive, highly general process that transforms experience into behavior, knowledge and skill.
AI evaluation: from measuring skills to measuring broad abilities
The success of artificial intelligence relies on systems that can perform well described tasks at a high level, as measured by a skill-based metric. This focus on task-specific performance often leads to tradeoffs in other areas, such as robustness and flexibility. Therefore, there is a need to go beyond skill-based evaluation to assess these other important attributes. The goal is to build systems with a higher grade of generalization, the ability to deal with situations that are different from those previously encountered. The spectrum of generalization reflects the organization of human cognitive abilities as described in the theories of cognitive psychology.
Psychometrics is a branch of psychology that deals with intelligence testing and the assessment of skills. Modern intelligence tests are designed to be reliable, valid, standardized, and free of bias. Remarkably, in parallel to psychometrics, there has been recent and increasing interest across the field of AI in using batteries of tasks to measure general abilities rather than specific skills. However, these benchmarks are still gameable because the test systems can practice specifically for the target tasks or use task-specific prior knowledge inherited from the system developers.
An alternative approach is to use the insights of psychometrics on skill assessment and test design to develop new types of benchmarks specifically designed to assess broad skills in AI systems. Interest in developing flexible systems and generality is growing, but the AI community has not paid much attention to psychometric assessment. There are several positive developments, including a growing awareness of the need for generalization in the evaluation of RL algorithms and interest in benchmarks for data efficiency and multitasking. However, there are also several negatives, including problems with the robustness of deep learning models, the lack of reproducibility of research results, and the little attention given to the study of capabilities beyond local generalization.
A new perspective
In 1997, IBM's Deep Blue beat Gary Kasparov at chess, leading researchers to realize that they had not learned much about human cognition by developing an artificial chess master. From the perspective of modern AI research, it is obvious that a static chess program based on minimax and tree search does not provide information about human intelligence. But what about a program that is not human-programmed but trained to perform a task based on data? A learning machine may well be intelligent: learning is necessary for adapting to new information and acquiring new skills. But programming through exposure to a large amount of data is no guarantee of generalization or intelligence. Hard-coding prior knowledge in artificial intelligence is not the only way to artificially "buy" performance on a particular task without creating generalization capability. There is another way: add more training data.
It is well known that different individuals have different degrees of cognitive ability. These differences suggest that cognition is a multidimensional object, hierarchically structured with a single general factor - the g-factor. So the question arises: how general is human intelligence? Is the g-factor universal? Would it apply to every task in the universe?
This question is of great importance when it comes to artificial intelligence - if there is such a thing as universal intelligence and human intelligence is a realization of that intelligence, then reverse engineering the brain might be the shortest path to it. However, a closer look reveals that human abilities are not universal in an absolute sense but rather specialized for tasks that were prevalent during evolution. For example, humans are not designed for long term planning or large working memory beyond what was necessary for survival in the ancestral environment. In addition, there is a dimensional bias in which humans excel at 2D navigation tasks but struggle with 3D or higher tasks because they were not evolutionarily prepared for them. Thus, "general intelligence" should not be viewed as a binary trait that is either present or absent. Instead, it is on a spectrum determined by 1) the scope of application and 2) the efficiency with which new skills are learned.
Advances in developmental psychology have taught us that the human mind is not merely a collection of special purpose programs hard-coded by evolution. The large majority of the skills and knowledge we possess are acquired during our lifetimes rather than innate. Simultaneously, the mind is not a single, general purpose "blank slate" system capable of learning anything from experience. It is therefore proposed that an actionable test of human-like general intelligence should be founded on innate human knowledge priors. These priors should be made as close as possible to innate human knowledge priors as we understand them, and they should be explicitly and exhaustively described.
Defining intelligence: a formal synthesis
The intelligence of a system is defined as its skill acquisition efficiency over a scope of tasks with respect to priors, experience, and generalization difficulty. This definition encompasses meta-learning priors, memory, and fluid intelligence. The formalism presented in this paper is regarded as useful for research on broad cognitive abilities and can serve as a foundation for new general intelligence benchmarks.
A high-intelligence system is defined as one that is able to high-skill solution programs for high generalization difficulty tasks using little experience and prior knowledge. The measure of intelligence here is tied to the choice of the domain of application (space of tasks and value function over tasks). Optionally, it may also be tied to a choice of sufficient skill levels across the tasks in the scope (sufficient case). Skill is not a property of an intelligent system but a property of the output artefact of the intelligence process (a skill program). High skill is not synonymous with high intelligence, they are entirely different concepts. Intelligence must involve learning and adaptation, i.e. operationalizing information gained from experience to deal with future uncertainties. Intelligence is not curve-fitting, a system that merely produces the simplest possible capability program consistent with known data points could, by this definition, perform well only on tasks that do not present generalization difficulties. An intelligent system must produce behavioral programs that account for future uncertainties.
Besides the information efficiency described above (prior efficiency and experience efficiency with respect to generalization difficulty) of intelligent systems, there are several other alternatives that could be incorporated into the definition. These are metrics like computational efficiency (skill programs that have minimal computational resource consumption and intelligent systems that use minimal computational resources to generate skill programs), time efficiency (minimize latency), energy efficiency (minimize the amount of energy expended) and risk efficiency (encourage safe curricula).
The described framework provides a formal way to reason about the intuitive concepts of "generalization difficulty", "intelligence as skill-acquisition efficiency", and what it means to control for priors and experience when evaluating intelligence. Its main value is that it offers a perspective shift in how we understand and evaluate flexible or general artificial intelligence, with several practical consequences for research directions and evaluation methods.
Consequences for research directions include a focus on developing broad or general purpose abilities rather than pursuing skill alone, an interest in program synthesis, and an interest in curriculum development. Consequences for evaluation methods include: taking into account generalization difficulty when developing a test set, using metrics that can discard solutions that rely on shortcuts, and rigorously characterizing any intelligent system by asking questions about its scope, potential, priors, skill-acquisition efficiency, etc.
Evaluating intelligence in this light
When comparing the intelligence of different systems, it is important to ensure that the comparison is fair. This means that the systems being compared must share the scope of tasks and have comparable levels of potential skill. The comparison should focus on the efficiency with which the system achieves the same level of skill as a human expert. In addition, it is recommended that only systems with similar prior knowledge be compared.
According to the conclusions of the paper with regard to the properties that a candidate benchmark of human-like general intelligence should possess, such an ideal intelligence benchmark should
- describe its scope of application and its own predictiveness with regard to this scope.
- be reliable.
- seek to measure broad abilities and developer-aware generalization.
- control for the amount of experience leveraged by test-taking systems during training.
- explicitly and exhaustively describe the set of priors it assumes.
- work for both humans and machines fairly by only assuming the same priors as possessed by humans.
A benchmark proposal: the ARC dataset
In the last part, Chollet introduces the Abstraction and Reasoning Corpus (ARC), a dataset intended to serve as a benchmark for the kind of general intelligence defined in the previous sections. He describes the new benchmark as follows:
"ARC can be seen as a general artificial intelligence benchmark, as a program synthesis benchmark, or as a psychometric intelligence test. It is targeted at both humans and artificially intelligent systems that aim at emulating a human-like form of general fluid intelligence."
ARC has the following top-level goals:
- Stay close to psychometric intelligence tests that should be able to be solvable by humans without specific knowledge or training.
- Focus on developer-aware generalization rather than task-specific skill
- Focus on measuring a qualitatively "broad" form of generalization by featuring highly abstract tasks that must be understood by a test-taker using very few examples
- Quantitatively control for experience only by providing a fixed amount of training data for each task and only featuring tasks that do not lend themselves well to artificially generated new data.
- Explicitly describe the complete set of priors it assumes. These should be close to innate human prior knowledge.
A test-taker is said to solve a task when, upon seeing the task for the first time, they are able to produce the correct output grid for all test inputs in the task (this includes picking the dimensions of the output grid). For each test input, the test-taker is allowed three trials (this holds for all test-takers, either humans or AI). Only exact solutions (all cells match the expected answer) can be called correct.
This paper was presented and discussed at the October meeting of our AI Reading Group. After a short presentation of the paper, there was an interesting discussion about what intelligence really means and that, in retrospect, certain abilities that we associate with intelligent thinking and acting (e.g., being able to play chess very well) were not sufficient to call a system intelligent. Concerns have been raised that even with a benchmark that focuses more on adaptability and generalization, it will not be any different. There was an exchange that even then, it might be possible to use shortcuts or pure computational power to accomplish the goal. At the end of the day, we can only measure skills in solving a given task. Nevertheless, the group agreed that the work on the topic and the conception of the ARC Challenge are important steps in the right direction.
On the Measure of Intelligence, François Chollet Paper: https://arxiv.org/abs/1911.01547
GitHub repository, The Abstraction and Reasoning Corpus (ARC): https://github.com/fchollet/ARC
Figures were taken from the paper, visuals were generated with Midjourney v4
Full disclosure: I used AI writing tools to create extractive summaries of some parts of the paper. This was part of an experiment to evaluate the usefulness of these tools for these types of tasks. However, there was a lot of human intelligence involved to verify the results and prevent the publication of incorrect information. If you still notice something in this regard, please send me a short message.
Newest job offers
Senior QA Engineer (Belimo)
Senior Data Engineer
Business Management Associate
Systems Engineer (Azure Virtual Desktop)
IT Security Manager (m/f/d)
Intern in Applications Global Support (m/f)
Create Your Own Career
On our career website "Create Your Own Career" you can discover the wide range of entry and career opportunities at Bertelsmann and be inspired by our employee stories!Find more interesting jobs