Fast, private email that's just for you. Try Fastmail free for up to 30 days.
CHI ’24 Paper: ’Apple’s Knowledge Navigator: Why Doesn’t That Conversational Agent Exist Yet?’
I’m a massive fan of Apple’s 1987 Knowledge Navigator concept video. Like other tech nerds, I often filter technology advancements through the lens of that vision: How close are we to that future?
Much of what it anticipates has come to pass in the ensuing four decades—video streaming, touchscreens, globally connected computers, wireless networking, and more.
Even some portions of the most fantastical and oft-discussed aspect of the video—the human-like digital assistant, Phil—are possible today; for example, Phil’s ability to summarize vast amounts of data, understand the spoken word, or speak in a voice that’s virtually indistinguishable from human.
However, the core of the video—where a professor has a human-like conversation with his digital assistant, which can anticipate needs and act autonomously on the professor’s behalf—well, we’re not quite there yet.
Well, this fascinating research paper (PDF, video summary) attempts to answer the questions I’ve often asked myself: Why aren’t we there yet? What’s preventing us from having a “conversational agent” like Phil? Is it purely technological limitations, or are there other issues at play?
What I enjoyed about this paper was the systematic approach the authors took to identify the nature of the interactions between the professor and Phil: What is Phil’s role at any given moment? Is it proactive, interruptive, collaborative, or passive?
The researchers looked at every verbal exchange between the professor and his digital assistant, then identified what those exchanges represent and how various concerns—or “constraints”—are preventing—or at least delaying the implementation and adoption of conversational agents today.
The authors applied three theoretical frameworks to analyze the interactions between the professor and Phil:
[T]he Distributed Cognition for Teamwork (DiCoT) model, the Human-Agent Team Game Analysis Framework, and Flows of Power (FoP) framework. These frameworks enabled a thorough examination of the cognitive dynamics, human-agent interactions, and power relations within the video.
Using these frameworks, the researchers captured “dialogue, actions, and agent capabilities” and identified “events” that were:
[…] feasible and common today, feasible and not common today, or not feasible today. Feasibility was determined by comparing the demonstrated agent capabilities to those of widely adopted agents like Apple's Siri and to current trends in HCI [Human Computer Interaction] research and development. These characterizations were then used to consider why the Phil agent differs from today's personal digital assistants.
From this effort, they identified
[…] a list of 26 agent capabilities, such as “Knowledge of contacts and relationships” (e.g., Mike's mother) and “Can accurately extract data from a publication” (e.g., Phil summarizes the results of an academic paper using a graph).
Those 26 agent capabilities were condensed into nine broad capabilities—knowledge of user history, knowledge of the user, advanced analytic skills, and so on. For each of those, they focused on two actionable categories (“currently feasible but not common today” and “not currently feasible”).
For me, these “agent capabilities” and their feasibility were the most intriguing part of the study. When Apple announced Apple Intelligence last June, I did a very naïve version of this with their demos, writing:
A friend sends you a Message with his new address. You say to Siri "add this address to his contact card". Siri knows what's on your screen, what an address is, how to get and format an address, what "this" address refers to, what a "contact card" is, who "his" means, what it means to "add to contact card", and how to add it to the Contacts app.You're picking your mom up from the airport. You ask Siri "what time is my mom's flight landing?" Siri knows who "my mom" is, what flight she's on (because of an email she sent earlier), and when it will land (because it can access real-time flight tracking). You follow up with *"what's our lunch plan?" *Siri knows "our" means you and your mom, when "lunch" is, that it was discussed in a Message thread, and that it's today. Finally, you ask "how long will it take us to get there from the airport?". Siri knows who "us" is, where "there" is, which airport is being referenced, where you are now, and real-time traffic conditions.
I wish I was familiar with the frameworks this paper used. They do a great job of clearly identifying agent behavior and responsibility.
Back to the paper…. The nine broad capabilities were then:
[…] tagged with constraints that restrict their adoption or development […]. Some were based on the user, such as trust or privacy, and some were based on available technology itself. The authors used categories similar to those used in previous studies of barriers to technology adoption to group the constraints into three user-centered categories (privacy, social and situational, trust and perceived reliability), and one technology category.
Those “constraints” are effectively reasons why it may be difficult—or impossible—to develop and deploy a “conversational agent” today. A few reasons, from my perspective:
My takeaway from the paper is that while (much) improved technology is a necessary component to enable conversational agents, it is not sufficient. Overcoming the technical hurdles does not immediately bring us the levels of human-digital assistant engagement we see Knowledge Navigator. Even if there’s an unexpected leap forward on the technology side, the other three constraints remain as significant barriers to the introduction and eventual adoption of a Phil-level agent.