This Uncanny Valley of Voice Recognition

February 9, 2015

Ever since Star Trek first aired, we’ve held unreasonable expectations over computers. Not only do we expect them to work for free, but we expect them to listen to all of our problems.

Hey Siri, what’s the temp?
Calling “mom”.
No. Cancel.
…
Siri, cancel. Stop. Siri stop. Siri eat a dick.
Hello?
Hi mom! Uh, I’ve missed you!

We’ve reached the Uncanny Valley of voice recognition, and everyone except these fucking computers knows what I’m talking about.

Canniness

The Uncanny Valley is a term that originated from the computer animation industry. In 1992, while finishing A Bug’s Life, Pixar had to build a digital valley for Buzz Lightyear to drive his Ford® F-150™ pickup through on the way to the hospital so he could get a vasectomy. Pixar’s staff found that the valley looked uncanny, meaning it looked good, but not perfect. They ended up illustrating a crate of Campbell’s® Tomato Soup™ in the corner to make it feel a bit more canny.

The concept of The Uncanny Valley was born. If an emulation doesn’t quite match up to how a behavior works in the real world, humans can easily pick up on this discrepancy and they’ll get pissed off and start an angry hashtag on Twitter.

This never worked until now

Look, we all grew up with those ads in our Yahoo! Internet Life magazine subscriptions for that Dragon’s Den voice recognition software thing that never fucking worked but they still inevitably got testimonials from lawyers in blue power shirts who beamed through their dorky microphone headsets while they happily dictated like a robot the winning litigation strategy for their client who nonetheless got caught soliciting a federal agent for counterfeit laundry detergent and also some sexual favors but they’ll get off because they “have a fraternity brother in Justice who can pull some strings”.

Yahoo! Internet Life

We laughed at that back then because we had some real dope inventions at the time: like, keyboards and shit. Have you tried one? You can put words on a screen real fast compared to flapping words through your oral meatflaps. It’s also beneficial while you’re writing your steamy 50 Shades of Grey fan fiction in your public library. I mean, to make it less creepy you could try whispering the words to your computer instead of loudly dictating them, but are you sure that doesn’t make it more creepy?

Voice recognition used to be a gimmick.

Things done changed

The last year or two have seen some real mass-market changes, though. Apple has Siri, Google has OK Google or something, Amazon has an entire standalone device in Echo, and Microsoft is pushing for voice control in Xbox One with Kinect.

The difference is that these things finally do some cool stuff. We’re not dictating litigation strategy to our secretaries; we’re interacting with our devices in real ways. It kinda blew my mind that I can walk into my living room, say “Xbox on”, and my Xbox turns on, my TV gets switched on, the input get changed over to my Apple TV, and it’s all ready to watch by the time I reach my chair.

Voice intelligence

The problem is we’re at this uncanny valley. We want to talk to our devices like humans, but they still act like toddlers wearing headphones who only speak Portuguese or something.

If I’m playing music in the background, my Xbox has a tough time identifying what I’m saying. It’s not a mistake a human would readily make.

If you ever end up in bed with someone (congrats!) with both your iPhones plugged into the wall and one of you wakes up and asks “Hey Siri what’s the weather like today”, you now have two iPhones — in addition to any iPads laying around — dutifully responding at the same time. iPhones understand words, but they don’t understand you. It’s not a mistake a human would readily make.

When voice recognition works, it’s great, but when you have to repeat yourself or it just doesn’t understand you, the level of frustration feels much higher than other software. Why can’t you understand me, you dumb robot?

Words as UI

Part of this frustration is the user interface itself is less standardized than the desktop or mobile device UI you’re used to. Even the basic terminology can feel pretty inconsistent if you’re jumping back and forth between platforms.

Siri aims to be completely conversational: Do you think the freshman Congressman from California’s Twelfth deserved to sit on HUAC, and how did that impact his future relationship with J. Edgar?

Xbox One is basically an oral command line interface, of the form:

Xbox <verb> (direct object)

For example: Xbox, go to Settings. But this is not the case if it’s in “active listening” mode, in which case you drop the Xbox and attempt to address it conversationally (go to Settings). But you can’t really converse with it, because it’s functionally less capable than Siri or Google Now. The context switching is a little frustrating. On the other hand, since it’s so cut-and-dry, there’s less of an uncanny valley here because I don’t personify my Xbox as much as I do Siri; my Xbox just responds to commands. Funny how a different voice UI here results in a totally different experience.

Amazon Echo

Amazon Echo’s UI is similar to Siri’s conversational form, although you’re almost always going to invoke it by saying Alexa (whereas you can bypass Hey Siri by holding down the Home button and talking normally — a beneficial side effect of having the device in-hand).

There’s good reasons for all these inconsistencies — Xbox, for example, benefits from clear, directed dialogue because there’s fewer functions you require once you’re sitting in front of a TV. But it’s these inconsistencies that are frustrating as you jump back and forth between devices. And we’re only going to scale this up, particularly at home, because again, when this all works it’s awesome. I want to control entire workflows in my home by voice- hey I’m heading out might turn down my thermostat, turn off my lights, and check that my oven’s turned off.

It took decades before computing settled on the standard concepts of the GUI: the desktop metaphor, overlapping windows, scrollbars, and so on. Hopefully voice UI catches up and standardizes, too.

it’s gon b creepy

Voice recognition, if it ever crosses the chasm of the uncanny valley, is going to have to get much smarter. And I don’t mean preprogrammed jokes about where to hide a dead body, but our voice assistants are going to have to start learning about us. Building a relationship with us. Knowing us.

And that’s going to be creepy. Especially if we don’t trust who’s on the other end of the line. Maybe.

Her (2013)

I think the reason people liked Her (2013) so much was that it didn’t seem all that creepy. It seemed like you’re gaining a friend. And it’s going to be weird at first, since it’s going to need to be always-on, always listening, and always learning from you. But if we can ever jump past this uncanny valley, that’s where we’ll basically build AI, for all intents and purposes, and we’re going to have a friend following us around. And it’s going to make life better for us.

Well, depending on which science fiction you watch. We could all end up depressed or die, too. Hug a fan of Black Mirror or Transcendence (2014) today.

We’re probably still a long, long way from crossing the Valley to our utter doom or sublime utopia, since computers are hard and voice recognition is apparently really hard. (Or at least that’s what I assume; I just do fake computer science like User.find() so I wouldn’t know, myself.) We’re going to have to deal with this uncanny valley in the meantime. That’s a little frustrating, but hey, at least we don’t have to wear boom microphone headsets anymore.

Exciting stuff is afoot.