Turing tests in Chinese rooms: What does it mean for AI to outperform humans

Share

TLDR;

  • Reports that AI beat humans on certain benchmarks or very specialised tasks don’t mean that AI is actually better at those tasks than any individual human.
  • They certainly don’t mean that AI is approaching the task with any of the same understanding of the world people do.
  • People actually perform 100% on the tasks when administered individually under ideal conditions (no distraction, typical cognitive development, enough time, etc.) They will start making errors only if we give them too many tasks in too short a time.
  • This means that just adding more of these results will NOT cumulatively approach general human cognition.
  • But it may mean that AI can replace people on certain tasks that were previously mistakenly thought to require general human intelligence.
  • All tests of artificial intelligence suffer from Goodart’s law.
  • A test more closely resembling an internship or an apprenticeship than a gameshow may be a more effective version of the Imitation Game.
  • Worries about ‘superintelligence’ are very likely to be irrelevant because they are based on an unproven notion of arbitrary scalability of intelligence and ignore limits on computability.

Reports of my intelligence have been greatly exaggerated

Over the last few years, there have been various pronouncements about AI being better than humans at various tasks such as image recognition, speech transcription, or even translation. And that’s not even taking into account bogus winners of the Turing test challenge. To make things worse, there’s always the implication that this is means machine learning is getting closer to human learning and artificial intelligence is only a step away from going general.

All of those reports were false. Every. Single. One. How do we know this? Well, because none of them were followed by “and therefore we have decided to replace all humans doing job X with machine learning algorithms”. But even if this were the case, it still would not necessarily mean that the algorithm outperformers humans at the task. Just that it can outperform them at the task when it is repeated time after time and the algorithm ends up making fewer mistakes because, unlike people, it does not get tired, distracted, or simply ticks the wrong box.

But even if the aggregate number of errors is lower for a machine learning algorithm, it may still not make sense to use it because it makes qualitatively different errors. Errors that are more random and unpredictable are worse than more systematic errors that can be corrected for. Also, because AI has no metacognitive mechanisms to identify its errors by doing a ‘sense check’. This often makes correcting AI-generated transcripts difficult to correct because it makes errors that don’t make intuitive sense.

Pattern matching in radiology and law

The closest machine learning has gotten to outperforming humans doing real jobs is in radiology. (I’m discounting games like Go, here.) But even here it only equalled the performance of the best experts. However, this could easily be enough. But interpreting X-Rays is an extremely specialised task that requires lots of training and has a built-in error rate. It is a pattern recognition exercise, not a general reasoning exercise. All the general reasoning about the results of the X Rays still has to be delegated to the human physician.

In a similar instance, AI could notice inconsistencies in complex contracts better than lawyers. Again, this is very plausible, but again this was a pattern-matching exercise with a machine pitted against human distractability and stamina. Definitely impressive, useful, and not something expected even a few years ago. But not in any meaningful ways replacing the lawyer any more than a form to draw up a contract I downloaded from the internet does.

This is definitely a case where an AI can significantly augment what an unassisted human can do. And while it will not replace radiologists or lawyers as a category, it could certainly greatly decrease their numbers.

Machine learning to the test

So on very specialised tasks involving complex pattern recognition, we could say that AI can genuinely outperform humans.

But in all the instances involving language and reasoning tasks, even if an AI beats humans on a test, it does not actually ‘outperform’ them on the task. That’s because tests are always imperfect proxies for the competence they measure.

For example, native speakers often don’t get 100% on English proficiency tests and can even do worse than non-native speakers in certain contexts. Why? Three reasons: 1. They can imagine contexts not expected of non-native speakers. 2. The non-native speakers have been practicing taking these tests a lot so they make fewer formal mistakes.

We are facing exactly the same problems when comparing machine learning and human performance based on tests designed to evaluate machine learning. Humans are the native speakers and they perform 100% on all the tasks in their daily lives. But their performance seems less than perfect in test conditions.

BLEU and overblown claims about Machine Translation

Sometimes the problem is with a poorly designed test. This is the case with the common measure of machine translation called BLEU (Bi-Lingual Evaluation Understudy). BLEU essentially measures how many similar words or word pairs there are in the translation by machine when compared to a reference corpus of human translations. It is obvious that this is not a good metric of quality of translation. It can easily assign a lower score to a good translation and a high score to a patently bad one. For instance, it would not notice that the translation missed a ‘not’ and gave the opposite meaning.

What human translators do is translate whole texts NOT sentences. This sometimes means they drop things, add things, rearrange things. This involves a lot of judgment and therefore no two translations are ever the same. And outside trivial cases they’re never perfect. But a reliable translator can make sure they convey the key message and they could provide footnotes to explain where this was not possible. Machine learning can get surprisingly good at translating texts by brute force. But it is NOT reliable because it operates with no underlying understanding of the overall meaning of the text.

That’s why we can easily dismiss Microsoft’s claim that their English-to-Chinese interpreter outperformed human translators. That is only because they used the BLEU metric to make this claim rather than professional translators evaluating the quality of AI output against that of other professional translators on any test. And since Microsoft has yet to announce that it is no longer using human interpreters when its executives visit China, we can safely assume that this ‘outperform’ is not real.

Now, could a machine translation ever get good enough to replace human translators? Possibly. But it is still very far from that for texts of any complexity. Transformers are very promising at improving the quality of the translation but they still only match patterns. To translate you need to make quite rich inferences and we’re nowhere near this.

GLUE and machine understanding come unstuck

Speaking of inferences. How good is AI at making those? Awful. Here we have another metric to look at: GLUE! Unlike BLEU which is a really bad representation of the quality of translation, GLUE (General Language Understanding Evaluation) is a really good representation of human intelligence. If you wanted to know what are the components of human intelligence, you could do a lot worse than look at the GLUE test.

But the GLUE leaderboard has a human benchmark and it comes 4th with 87.1% score. This puts it 1.4% behind the leader which is Facebook at 88.5%. So, it’s done. AI has not only reached human level of reasoning, it has surpassed them! Of course, not. Apart from the fact that we don’t know how much of a difference in reasoning ability 1% is, this tells us nothing about human ability to reason when compared to that of a machine learning model. Here’s why.

How people and machines make errors

I would argue that a successful machine learning algorithm does not actually outperform humans on these tasks even if it got 100%. Because humans also get 100% but they also devised the test.

Isn’t this a contradiction? How can humans get 100% if they consistently score in the mid-80s when given the test. Well, humans designed the test and the correctness criteria. And a machine learning algorithm must match the best human on every single answer to equal them. The benchmark here is just an average of many people over many answers and does not just reflect the human ability to reason but also the human ability to take tests.

Let’s explain by comparing what it means when a human makes an error on a test and when a machine does. There are three sources of human error: 1. Erroneous choice when knowing the right answer (ie clicking a when meaning to click b), 2. Lack of attention (ie choosing a because we didn’t spend enough time reading the task to choose correctly), 3. Overinterpretation (providing context in our head that makes the incorrect answer make sense).

These benchmarks are not Mensa tests, they measure what all people with typical linguistic and cognitive development can do. Let’s take the Windograd Schema test as an example. Here’s an often-quoted example:

The trophy didn’t fit into the suitcase because itwas too big.
The trophy didn’t fit into the suitcase because itwas too small.

It is very possible that out of 100 people, 5 would get this wrong because they click the wrong answer, 10 because they didn’t process the sentence structure correctly and 1 because they constructed a scenario in their head in which it is normal for suitcases to be smaller than the thing in them (as in Terry Pratchett’s books).

But not a single one got it wrong because they thought that a thing can be bigger than the thing it fits in.

Now, when a machine learning model gets it wrong, it does it because it miscalculated a probability based on an opaque feature set it constructs from lots of examples. When you get 2 people together, they can always figure out the right answer and discuss why they did it wrong. No machine learning algorithm can do that.

This becomes even more obvious when we take an example from the actual GLUE benchmark:

Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight.

So what does the ‘they’ refer to here? The obvious candidate here is ‘trains’. But it is easy to imagine that a person could click the option where ‘puffs of black smoke’ or even ‘Maude and Dora’ are the antecedent. That’s because both of those can be ‘seen’ and could theoretically cause horses to run away. If this is the 10th sentence I’m parsing in a go, I may easily shortcut the rather complex syntactic processing. I can even see someone choosing “whistles” even though they cannot “come in sight” but are a very strong candidate for causing horses to run away. But nobody would choose ‘horses’ unless they misclicked. A machine learning algorithm very easily could do this simply because ‘they’ and ‘horses’ match grammatically.

But all of this is actually irrelevant, because of how the ML algorithms are tested. They are given multiple pairs or sentences and asked to say 1 or 0 on whether they match or not. So some candidate sentences above are “Horses ran away when the trains came in sight.”, “Horses ran away when Maude and Dora came in sight.” or “Horses ran away when the whistles came in sight.” What it does NOT do is ask “Which of the words in the sentence does ‘they’ refer to?” Because the ML model has no understanding of such questions. You would have to train it for that task separately or just write a sequential algorithm to process these questions.

What people running these contests also cannot do is ask the model to explain their choice in a way that would show some understanding. There is a lot of work being done on interpretability, but this just spits out a bunch of parameters that have to be interpreted by people. Game, set and match to humans.

Chinese room revisited

But let’s also think about what it means for a neural network model to get things right. This brings us back to Searl’s famous Chinese room argument. Every single choice a model makes has assigned a probability and even quite ridiculous choices have a non-zero chance of being right in the model. Let’s look at another common example:

The animal didn’t cross the road because it was too busy.

Here it is sensible to assign it to ‘road’ because it makes the most sense but one could imagine a context in which we could make it refer to ‘the animal’. Animals can be thought of as busy and we can imagine that this could be a reason for not crossing the road. But we know with 100% certainty that it does not refer to ‘the’ or even ‘cross’. Yet, a neural model has no such assurance. It may never choose ‘the’ in practice as the antecedent for ‘it’ but it will never completely discount it, either.

So, even if the model got everything right. We could hardly think of it as making human-like inferences unless it could label certain antecedents as having 0% probability and others (much rarer) as having 100%. (Note: Programming it to change 10% to 0% or 90% to 100% does not count.)

This feels like a very practical expression of Searl’s Chinese room argument albeit in a weak form. Neural networks pose a challenge to Searl because their algorithmic guts are not as exposed as those of the expert systems of Searl’s time. But we can still see echoes of their lack of actual human-like reasoning in their scores.

Is a test of artificial intelligence possible under Goodhart’s Law?

I once attended a conference on AI risk where a skeptic said he wasn’t going to worry “until an AI could do Winograd schemas”. This referred to a test of common sense and linguistic ambiguity that AIs have long been famously bad at. NowMicrosoft claims to have developed a new AI that is comparable to humans on this measure. (Scott Alexander)

This post was inspired by the above remark by Scott Alexander. I wanted to explain why even the Winograd challenge being conquered is not enough in and of itself.

AI proponents constantly complain of sceptics’ shifting standards. When AI achieves a benchmark, everybody scrambles to find something else that could be required of it before it gets a pass. And I admit that I may have made a claim similar to that of the AI researcher quoted by Scott Alexander when I was writing about the Winograd schemas.

But the problem here is not that machines became intelligent and everybody is scrambling to deny the reality. The problem is that they got better at passing the test in ways that nobody envisioned when the test was designed. All this while taking no steps towards actual intelligence. Although with a possible increase in practical utility.

This is the essence of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” The Winograd Schema Challenge seemed so perfect. Yet, I can imagine a machine learning getting good at passing the challenge but still not actually having any of the cognition necessary to really deal with the tasks in real life. In the same way that IBM Watson got really good at Jeopardy but failed at everything else.

None of this is to say that machine learning could not get good enough at performing many tasks that were previously thought to require generalised cognitive capacity. But when machines actually achieve human-level artificial intelligence, we will know. It will not be that hard to tell. But it will not likely happen just because we’re doing more of the same.

The problem with the Turing test or imitation game is not that it cannot produce reliable results on any one run of it. The problem is that if any single test becomes not only the measure but also a target, it is very much possible to focus on passing the test on the surface while bypassing the underlying abilities the test is meant to measure. But the problem is not just with the individual tests but rather in the illusion that we can design a test that will determine AGI level performance simply by reaching an arbitrary threshold.

The current Turing test winners won by misdirection that hid the fact that they refused to answer the questions. This could be fixed by requiring that Grice’s cooperative principle maxims are observed (especially quality and relevance) but even then, I could see a system trained to deal with a single time-bound conversation pass without any underlying intelligence.

As Scott Aaronson showed, it is possible to defeat a current level AI system simply by asking ‘What is bigger a shoebox or Mount Everest’. But once a pattern of questioning becomes known, it becomes a target and therefore a bad measure.

Similar things happen with all standardised aptitude tests designed so that they cannot be studied for. Job interview techniques designed to get interviewees to reveal their inner strengths and weaknesses. All of these immediately spawn industries of prep schools, instructional guides, etc. That makes them less useful over time (assuming they were all that useful to start with).

Towards a test by Critical Turing Internship

That’s why the Turing test cannot be a ‘test’ in the traditional sense. At the very least, it cannot be a single test.

History and a lot of human-computer interaction research has also shown that people are very bad at administering the Turing test (or playing the imitation game). But this is paradoxically because they’re very good the very thing the machines have been failing at: meaning making. Because we almost never encounter meaningless symbols but often encounter incomplete ones, we are conditioned to always infer some sort of meaning from any communication. And it is difficult if not impossible to turn it off.

Every time we see a bit of language we automatically imbue it with some meaning. So, any Turing tester must not only be trained in the principles of cognition but also to discard their own linguistic instincts. We don’t know what it will take for a machine to become truly intelligent but we do know that humans are notoriously bad at telling machines apart from other humans. We simply cannot entrust this sort of thing to such feeble foundations.

As I said above, I suspect that by the time machines do achieve human-level performance on these tasks, it will be obvious. We probably won’t need such a test. Assuming we get there which is not a given. But if a test were needed, it could look something like this.

To replace the Turing test, I would like to propose a sort of Turing Internship. We don’t entrust critical tasks in fields like medicine to people who just passed a test but require they prove ourselves in a closely supervised context. In the same way, we should not trust any AI system based on a benchmark.

Any proposed human-level AI system can be placed in multiple real contexts with several well-informed human supervisors who would monitor its performance for a period of weeks or months to allow for any tricks to be exposed. For example, most people after a few weeks with Alexa, Google Assistant or Siri, get a clear picture of its strengths and limitations. Five minutes with Alexa may make you feel like the singularity is here. Five months will firmly convince you that it is nowhere in sight.

But at the moment, we don’t need this. We don’t need months or weeks to evaluate AI for human-level intelligence. We need minutes. I estimate that we would not need to use this kind of AI internship for another 50 years but likely for much much longer. We are too obssessed with the rapid progress of some basic technologies but ignore many examples of stagnation. My favourite here is the Roomba which has been on the market for 17 years now and has hardly progressed at all. Equally, the current NLP technologies have made massive strides in utility but have not progressed towards anything that could be meaningfully described as understanding.

That is not to say that tests like GLUE or even BLUE are completely useless. They can certainly help us compare ML approaches (up to a point). They’re just useless for comparing human performance with those of machine-generated models.

Note on Nick Bostrom and Superintelligence

One obvious objection to the Turing Internship idea is that if human-level AI is the last step before Bostrom’s ‘Superintelligence’, unleashing it in any real context would be extremely dangerous.

If you believe in this ‘demon in the machine’ option, there’s nothing I can do to convince you. But I personally don’t find Superintelligence in any way persuasive. The reason is that most of the scenarios described are computationally infeasible in the first place. Bostrom does not mention the issue of computability and things like P=NP almost at all. And he completely ignores questions of nonlinear complexity.

It is hard to judge whether a ‘superintelligent’ system could take over the world. But could it predict the weather 20 days out with 1% tolerance of temperature estimates in any location? The answer is most likely not. There may not be enough atoms in the universe to compute the weather arbitrarily precisely more than a few days in advance. Could it predict earthquakes? Could it run an economy more efficiently than an open market relying on price signals? The answers to all those questions are most likely no. Not because the superintelligence is not super enough but because these may not be problems that can be solved by adding ‘more’ intelligence. Assuming that ‘intelligence’ is a linearly scalable property in the first place. It may well be like body size, after a certain amount of increase, it would just collapse onto itself.

Superintelligence requires a conspiracy theorist’s mindset. Not that people who believe are conspiracy theorists. But they assume that complexity can be conquered with intelligence. They don’t believe that humans are ‘smart’ enough to control everything. But they believe that it is inherently possible. Everything we know about complexity, suggests that this is not the case. And that is why I’m not worried.

Fruit loops and metaphors: Metaphors are not about explaining the abstract through concrete but about the dynamic process of negotiated sensemaking

Share

Note: This is a slightly edited version of a post that first appeared on Medium. It elaborates and exemplifies examples I gave in the more recent posts on metaphor and explanation and understanding.

One of the less fortunate consequences of the popularity of the conceptual metaphor paradigm (which is also the one I by and large work with on this blog) is the highlighting of the embodied metaphor at the expenses of others. This gives the impression that metaphors are there to explain more abstract concepts in terms of more concrete ones.

Wikipedia: “Conceptual metaphors typically employ a more abstract concept as target and a more concrete or physical concept as their source. For instance, metaphors such as ‘the days [the more abstract or target concept] ahead’ or ‘giving my time’ rely on more concrete concepts, thus expressing time as a path into physical space, or as a substance that can be handled and offered as a gift.“

And it is true that many of the more interesting conceptual metaphors that help us frame the fundamentals of language are projections from a concrete domain to one that we think of as more abstract. We talk about time in terms of space, emotions in terms of heat, thoughts in terms of objects, conversations as physical interactions, etc. We can even deploy this aspect of metaphor in a generative way, for instance when we think of electrons as a crowd of little particles.

But I have come to view this as a very unhelpful perspective on what metaphor is and how it works. Instead, going back to Lakoff’s formulation in Women, Fire, and Dangerous Things, I’d like to propose we think of a metaphor as a principle that helps us give structure to our mental models (or frames). But unlike Lakoff, I like to think of these as an incredibly dynamic and negotiated process rather than as a static part of our mental inventory. And I like to use conceptual intergation or blending as way of thinking about the underlying cogntivive processes.

Metaphor does two things: 1. It helps us (re)structure one conceptual domain by projecting another conceptual domain onto it and 2. In the process of 1, it creates a new conceptual domain that is a blend of the two source domains.

We do not really understand one domain in terms of another through metaphor. We ‘understand’ both domains in different ways. And this helps us create new perspectives which are themselves conceptual domains that can be projected or projected into. (As described by Fauconnier and Turner in The Way We Think).

This makes sense when we look at some of the conventional examples used to illustrate metaphors. “The man is a lion” does not help us understand lesser known or more abstract ‘man’ by using the better known or more concrete ‘lion’. No, we actually know a lot more about men and the specific man we’re thus describing than we do about lions. We are just projecting the domain of ‘lions’ including the conventionalised schemas of bravery and fierceness onto a particular man.

This perspective depends on our conventionalised way of projecting these 2 domains. Comparison between languages illustrates this further. The Czech framing of lions is essentially the same as English but the projection into people also maps lion’s vigour into work to mean ‘hard working’. So you can say “she works as a lion”, meaning she works hard. But in the age of documentaries about lions, a joke subverting the conventionalised mapping also appeared and people sometimes say “I work like a lion. I roar and go take a nap.” This is something that could only emerge as more became conventionally known about lions.

But even more embodied metaphors do not always go in a predictable direction. We often structure affective states in terms of the physical world or bodily states. We talk about ‘being in love’ or ‘love hitting a rocky patch’ or ‘breaking hearts’ (where metonymy also plays a role). But does that really mean that we somehow know less about love than we know about travelling on roads? Love is conventionally seen as less concrete than roads or hearts but here we allow ourselves to be mislead by traditional terminology. The domain of ‘love’ is richly structured and does not ‘feel’ all that abstract to the participants. (I’d prefer to think of ‘love’ as a non-prototypical noun; more prototypical than ‘rationalisation’ but less prototypical than ‘cat’).

Which is why ‘love’ can also be used as the source domain. We can say things like “The camera loves him.” and it is clear what we mean by it. We can talk about physical things “being in harmony” with each other and thus helping us understand them in different ways despite harmony being supposedly more abstract than the things being in harmony.

The conceptual domains that enter into metaphoric relationships are incredibly rich and multifaceted (nothing like the dictionaries or encyclopedias we often model linguistic meaning after). And the most important point of unlikeness is their dynamic nature. They are constantly adapting to the context of the listeners and speakers, never exactly the same from use to use. We have a rich inventory of them at our disposal but by reaching into it, we are also constantly remaking it.

We assume that the words we use have some meanings but it is us who has the meanings. The words and other structures just carry the triggers we use to create meanings in the process of negotiation with the world and our interlocutors.

But this sounds much more mysterious and ineffable than it actually is. These things are completely mundane and they are happening every time we open our mouths or our minds. Here’s a very simple but nevertheless illuminating illustration of the process.

Not too long ago, there were two TV shows that had some premise similarities (Psych and The Mentalist). One of them came out a year earlier and its creators were feeling like their premise was copied by the other one. And they used the following analogy:

“When you go to the cereal aisle in a grocery store, and you see Fruit Loops there. If you look down on the bottom, there’s something that looks just like Fruit Loops, and it’s in a different bag, and it’s called Fruity Loop-Os.” 

I was watching both shows at the time but their similarity did not jump out at me. But as soon as I read that comparison it was immediately clear to me what the speaker was trying to say. I could automatically see the projection between the two domains. But even though it seemed the cereal domain was more specific, it actually brought a lot more with it than the specificity of cereal boxes and their placement on store shelves. What it brought over was the abstract relationship between them in quality and value but also many cultural scripts and bits of propositional knowledge associated with cereal brands and their copycats.

But there was even more to it than that. The metaphor does not stop at its first outing (it’s kind of like mushrooms and their  in this way). Whenever, I see a powerful analogy or generative metaphor on the internet, I always look for the comments where people try to reframe it and create new meanings. Something I have been calling ‘frame negotiation’. Take almost any salient metaphoric domain projection and you will find that it is only a part in a process of negotiated sense making. This goes beyond people simply disagreeing with each other’s metaphors. It includes the marshalling of complex structuring conceptual phenomena from schemas, rich images, scenarios, scripts, to propositions, definitions, taxonomies and conventionalised collocations.

This blog post and its comments contain almost all of them: . First, the post author spends three paragraphs (from third on), comparing the two shows and finding similarities and differences. This may not seem like anything interesting but it reveals that the conceptual blends compressed in the cereal analogy are completely available and can be discussed as if it was a literal statement of fact.

Next, the commenters, who have much less space, return to debating the proposition by recompressing it into more metaphors. These are the first four comments in full:

  1. Anonymous said… They’re not totally different. It’s more like comparing Fruit Loops to Fruit Squares which happen to taste like beef.
  2.  said… I think a better comparison would Corn Flakes and Frosted Flakes. Both are made with the same cereal, but one’s sweeter (Psych).
  3.  said… Sweeter as in more comedy oriented? They are vastly different shows that are different on so many levels.
  4. Anonymous said… nikki could not be more right with the corn flakes and frosties analogy

Here we see the process of sense making in action. The metaphoric projection is used as one of several structuring devices around which frames are made. Comment 1 opens the the process by bringing in the idea of reframing through other analogs in the cereal domain. 2. continues that process by offering an alternative. 3. challenges the very idea of using these two domains and 4. agrees with 2 as if this were a literal statement but also referring to the metalinguistic tool being used.

The subsequent comments return to comparing the two shows . Some by offering propositions and scenarios, others by marshalling a new analogy.

 said… The reason the Mentalist feels like House is because house is a modern day medical version of Homes as in Holmes Sherlock. Also both Psych and The Mentalist are both Holmsian in creation. That being said I love the wit and humor of psych

Again, there is no evidence of the concrete/abstract duality or even one between less and better known domains. It is all about making sense of the domains in both cognitive and affective ways. Some domains have very shallow projections (partial mappings) such as cornflakes and frosty flakes, others have very deep mappings such as Sherlock Holmes. They are not providing new information or insight in the way we traditionally think of them. Nor are they providing an explanation to the uninitiated. They are giving new structure to the existing knowledge and thus recreating what is known.

The reason I picked such a seemingly mundane example is because all of this is mundane and it’s all part of the same process. One of my disagreements with much of metaphor application is the overlooking of the ‘boring’ bits surrounding the first time a metaphor is used. But metaphors are always a part of a complex textual and discursive patterns and while they are not parasitic on the literal as was the traditional slight against them, they are also not the only thing that goes on when people make sense.