Designing AI for Multilingual Clarity

* Specific details have been omitted or modified for proprietary reasons.

Role:

Research lead; designed and led a mixed-method field evaluation combining live gate observation, post-demo and semi-structured interviews, post-use micro-feedback, and a service agent survey.

Informed adoption of AI in frontline customer- service workflows

Partners:

Airport Operations
Gates team

Scope:

Field evaluation in a live airport gate environment; first deployment of AI by airport customer-facing employees at United for customer service; presented to senior leadership across Airport Operations and the Gates team.

Informed adoption of the tool by gate agents in live operations.

Summary

Gate agents routinely support passengers whose English proficiency varies, often during moments where timing, documentation, and boarding direction are operationally consequential. Today, that support depends heavily on informal workarounds — nearby multilingual coworkers, personal translation apps, gestures, repeated explanation, escalation. The result is inconsistent customer experience and avoidable burden on agents during peak gate operations. I led the field evaluation of an AI-based multilingual translation tool piloted at the gate to understand whether it could support faster, clearer, safer interactions between agents and non-English-speaking passengers — the first deployment of AI by airport customer-facing employees at United for customer service. The research showed that AI translation is most valuable when it reduces short, high-frequency clarification work without forcing agents to abandon operational flow. It also showed that adoption depends less on enthusiasm for AI than on access speed, trust thresholds, mode fit, and clear boundaries for when translation support is safe enough to use. The tool is now used by gate agents in live operations, and the work has shaped how the broader organization thinks about adopting AI into frontline workflows.

Problem Space

International gate operations concentrate several forms of pressure into a small window: boarding timing, documentation checks, wayfinding, seat and group questions, last-minute rebooking concerns, and passenger anxiety. When a passenger doesn't speak English comfortably, even simple clarifications can become difficult to resolve quickly — especially when agents are already managing boarding groups, announcements, delays, document verification, and crowd movement.

Language support in this environment has historically been inconsistent. Agents rely on whoever nearby may speak the passenger's language, on personal translation apps, on gestures, on paper documents, on repeated explanation. These methods can work in isolated cases, but they're not reliable as an operational system. They vary by flight, shift, agent tenure, language, crowd density, and the availability of informal help.

The challenge wasn't simply whether AI could translate words correctly. At the gate, translation has to be fast enough to fit live operations, accurate enough to avoid unsafe or misleading instructions, and simple enough that agents will use it at the moment of need rather than defaulting to faster informal workarounds. The key question was whether the tool could become a practical part of gate work, rather than an additional system agents had to manage.

That made the pilot a study of operational fit as much as translation quality. A tool that performs well in a demo can still fail in live conditions if it requires too many steps, struggles with noise, produces phrasing agents don't trust, or doesn't align with the difference between a short boarding clarification and a more nuanced customer conversation. Whether the technology worked and whether agents used it were two separate questions, and both needed to be answered before scale.

Approach

The research was conducted as a two-part field evaluation in an airport gate environment, beginning with a soft-launch briefing and post-demo discussion, followed by live gate evaluation during the pilot. The study focused on gate agents working international and domestic flights, with particular attention to high-diversity routes, peak boarding windows, documentation-related conversations, and moments where agents would otherwise have relied on informal translation support.

The study combined five inputs: post-demo interviews with gate agents (n=10) immediately following the soft-launch briefing; contextual observations during live gate operations (n=8 sessions) capturing 47 translation-relevant interactions; post-use micro-feedback captures (n=52) collected when operationally feasible; semi-structured interviews with agents and leads (n=15); and an all-agent pulse survey (n=32 responses).

The design choice that mattered most was structuring the study to capture both anticipated and actual adoption. The post-demo phase established what agents thought the tool would do for them — which workflows looked promising, which modes seemed usable, where they expected breakdowns. The live observation phase tested those expectations against real conditions. Adoption questions in frontline operations are usually answered too early, before the gap between demo behavior and live behavior becomes visible. Pairing the two phases meant the research could surface that gap directly — and explain it.

What We Found

The Tool Was Strongest Where Translation Functioned as Operational Clarification

The clearest fit was short, high-frequency gate work: confirming a gate location, explaining boarding-group timing, directing a passenger to a document check, clarifying where to stand, repeating a procedural instruction in the passenger's preferred language. These interactions were already bounded. Agents knew what needed to be communicated, and the passenger's need could usually be resolved in one or two turns.

In these cases, AI translation reduced the interpretive work agents normally carry. Rather than improvising with gestures, asking another employee for help, or typing into a personal app, agents could communicate a clear operational message and check whether the passenger understood.

The weaker fit appeared in more complex or sensitive interactions. Documentation conversations, policy explanations, irregular operations, and emotionally charged customer issues required more than translation. They required judgment, interpretation of airline policy, escalation awareness, and sensitivity to what the passenger might infer from the message. In those moments, agents were more cautious, and rightly so.

The implication wasn't that AI translation should replace multilingual service support. It was that the tool's strongest value is as a clarification layer for recurring, operationally bounded moments where speed and consistency matter most.

Adoption Depended on Whether the Tool Preserved Gate Momentum Under Load

Agents didn't evaluate the tool in isolation. They evaluated it against the pace of gate work. During slower periods, agents were willing to experiment — open the tool, select a language, try Conversational mode, assess the result. During active boarding or peak crowding, the threshold changed. A small amount of friction — opening the app, finding the right phrase, confirming the language, waiting for output, adjusting volume — could be enough to push agents back toward gestures, quick English repetition, or informal help.

The pattern was consistent: the more operational pressure increased, the less tolerance agents had for setup work. The tool had to be available at the moment of need, not merely available somewhere in the ecosystem.

This matters because gate operations aren't evenly paced. A tool can appear usable in training and still fail in the moments where it could create the most value. In high-pressure conditions, agents don't want another system to manage. They want the interaction to move forward.

Scale readiness depends heavily on workflow integration. Persistent access, faster launch points, default language shortcuts for common routes, easier mode selection, clearer audio and text delivery — these matter as much as translation quality itself. For agents, usability isn't a property of the interface alone. It's a property of the interface under boarding pressure.

Quick Help and Conversational Mode Supported Different Kinds of Trust

Agents treated the two modes as meaningfully different tools. Quick Help — preset phrases — was understood as faster, safer, and better suited to high-pressure moments because the content was already bounded. Preset phrases reduced the risk that the agent would say something imprecise, and they fit recurring questions agents already knew how to answer.

Conversational mode carried a different value. It was more flexible and better suited to passenger-specific questions, but it required agents to place more trust in speech recognition, translation accuracy, tone, and the tool's ability to handle multi-turn exchanges. That made it more useful when the passenger's need couldn't be anticipated in a phrase list — and more vulnerable to noise, misrecognition, or phrasing that didn't sound operationally safe.

Quick Help supports consistency and speed when the airline already knows the message. Conversational mode supports flexibility when the passenger's need can't be anticipated. The phrase library should be treated as an operational asset, shaped around the highest-frequency passenger questions by route, boarding phase, and language context. Conversational mode requires training around how to phrase inputs, when to verify, and when to stop using AI and escalate.

Trust Was Shaped by Boundaries, Not Accuracy Alone

Agents weren't only asking whether the translation was right. They were asking whether it was safe to use in a specific operational moment. A translation could be mostly accurate and still feel risky if it sounded too formal, omitted context, mishandled airline terminology, or left room for misunderstanding around documentation, boarding eligibility, or next steps.

This matters because gate agents are accountable for the passenger interaction even when AI produces the translation. If a customer misunderstands boarding timing, document requirements, or where to go next, the operational consequences still return to the agent and the airline. Accuracy scores are necessary but insufficient. The more important operational question is whether the tool helps agents produce messages that are clear, polite, contextually appropriate, and safe within the constraints of gate work.

Training therefore becomes a central part of the product experience. Agents need more than a demo. They need clear guidance on where the tool is appropriate, how to verify passenger understanding, and what to do when the translation feels wrong.

What Changed

The study showed that the tool has value and that its two functions have distinct situational advantages.

The tool is now used by gate agents in live operations, applied in the bounded, high-frequency moments the research identified as the strongest fit. Snack-cart-style early intervention isn't the framing here, but the underlying logic is similar to what we found in other studies: the value isn't in the tool's nominal capability but in whether the tool fits the moment of need cleanly enough to be used at all.

This was the first deployment of AI by airport customer-facing employees at United for customer service — a meaningful organizational first that complements the work establishing CX research on AI from the customer side (covered in a separate case study on AI in travel planning). Together, these two studies represent the first behavioral evaluations of AI in customer-facing contexts at the company, on both the customer-experience and employee-experience sides. The findings here have shaped how the broader organization thinks about adopting AI into frontline workflows: the principle that AI in operational service environments succeeds or fails on workflow integration rather than on raw capability.

The work was presented to senior leadership across Airport Operations and the Gates team, with broader Customer Experience leadership engagement. The framework — operational fit, momentum preservation, mode-specific trust, boundary-defined safety — is being used to evaluate further AI integrations in airport customer-service workflows. It clarifies what scale readiness actually requires (faster access, real-gate phrase libraries, clear use and escalation boundaries, audio and text delivery designed for noise) and what training has to address before AI tools can move from pilot to standard practice.

The deeper takeaway, for AI in frontline service environments more broadly, is that adoption is governed by the same constraints that shape other day-of-travel tools: timing, trust, clarity, fit within physical workflow. Agents will use AI when it helps them preserve forward motion. They'll avoid it when it adds interpretation work, slows the interaction, or leaves them uncertain about whether the output is safe enough to stand behind. The strategic opportunity isn't to make agents "use AI." It's to design the tools so they can resolve common passenger needs more consistently, while retaining the judgment and fallback options required in high-stakes service moments.