When a video, film, e-learning course, or marketing campaign crosses a language border, the translation work involved is rarely one-dimensional. Two distinct layers of text demand attention: the words spoken aloud by people on screen, and the text that appears visually — in graphics, lower thirds, title cards, menus, and UI elements. These two categories, broadly known as dialogue translation and on-screen text (OST) translation, follow different rules, require different skill sets, and serve different purposes in the viewer’s experience.
Yet the terminology surrounding these two disciplines is frequently misunderstood, used interchangeably, or conflated in project briefs. What exactly is the difference between a subtitle and a caption? Is a voice-over the same as dubbing? Where does transcreation fit in? And why does it matter whether a lower-third graphic is treated as OST or handled as part of the subtitle track?
This glossary cuts through the confusion. Whether you are a content producer briefing a localization vendor, a marketing manager commissioning multilingual video content, or an e-learning developer preparing materials for an international audience, this guide gives you the precise vocabulary to communicate clearly, scope projects accurately, and deliver polished multilingual content. Where relevant, we also highlight how professional localization services address the unique demands of each category.
What Is On-Screen Text (OST)?
On-screen text refers to any written text that is embedded into, overlaid on, or displayed within the visual frame of a video — text that is seen rather than heard. This includes title cards at the opening of a film, lower-third name identifiers in a documentary, promotional text in an advertisement, menu text in a software tutorial, on-screen instructions in an e-learning module, or even the writing on a whiteboard visible in a training video.
OST translation is a visually driven discipline. The translator must work not only with the meaning of the source text but also with spatial constraints — the translated text must physically fit within the same graphic space allocated to the original. In many cases, this requires close coordination with desktop publishing and typesetting specialists who reformat the visual layout for the target language, especially when scripts like Arabic, Japanese, or Thai are involved. Languages that expand significantly in translation (such as German or French relative to English) require particular care to avoid text overflow or redesign.
What Is Dialogue Translation?
Dialogue translation is the process of converting spoken language in a video or audio production into either a written or spoken equivalent in the target language. Depending on the production format and audience, dialogue translation can take the form of subtitles (written text displayed below the spoken audio), dubbing (replacement of the original spoken audio with a recorded translation), or voice-over (a translated narration layered over the original audio). Each approach serves different purposes and suits different content types.
Unlike OST translation, dialogue translation is heavily time-constrained. Subtitles, for example, must be timed precisely to the rhythm of speech. Dubbed audio must match lip movements closely enough to avoid distracting synchronisation gaps. Even voice-over narration must stay roughly within the duration of the original audio segment. These constraints mean that dialogue translators often work within a discipline known as audiovisual translation (AVT), which is a distinct specialisation within the broader language translation services field.
Complete Glossary: On-Screen Text and Dialogue Translation Terms
The following definitions cover the core vocabulary used across both disciplines. They are grouped by category, but note that many terms overlap in practice, particularly in large-scale video localisation projects.
On-Screen Text Terms
On-Screen Text (OST): Any visible text appearing within the video frame that is not part of the subtitle or caption track. OST is typically baked into the video file or added as a graphic overlay during post-production. Common examples include title cards, lower-thirds, banners, credits, chyrons, watermarks, and text that appears as part of a screen recording or product interface.
Lower Third: A graphic element positioned in the lower portion of the screen, typically used to display speaker names, titles, locations, or contextual information. In multilingual productions, lower-thirds must be redesigned in the target language, which often requires collaboration between translators and graphic designers or DTP specialists.
Title Card: A full-screen or semi-overlay text element used to introduce segments, chapters, speakers, or key messages in a video. Title card translation must account for both meaning and visual impact — particularly in branded content where font style and layout are tightly controlled.
Text Overlay: Any text superimposed over video footage, regardless of position. This is a broad term that encompasses lower-thirds, title cards, call-to-action banners, and instructional prompts. Translating text overlays often requires DTP rework to maintain design consistency across languages.
Burned-in Text (Hardcoded Text): Text that is permanently embedded into the video file and cannot be separated from the visual layer. Translating burned-in text requires editing the original source file or recreating the graphic in the target language, making it more labour-intensive than translating text from a separate file layer.
UI Text (User Interface Text): In screen-capture videos, software tutorials, or app demonstrations, UI text refers to the menu labels, button text, navigation items, and error messages that appear within the recorded interface. UI text translation frequently requires close coordination with software and digital localisation teams.
Interstitial Text: Text frames or cards inserted between scenes or segments of a video to provide context, transition information, or narrative breaks. Common in documentary films and social media content, interstitial text translation must maintain the tone and pacing intended by the original editors.
Dialogue Translation Terms
Subtitles: Written translations of spoken dialogue displayed as text at the bottom of the screen, synchronised to the timing of the audio. Subtitles assume the viewer can hear the audio but does not understand the spoken language. They differ from captions in purpose and scope. Subtitle files are typically delivered in formats such as SRT, VTT, or ASS and are stored separately from the video file, allowing different language tracks to be switched easily.
Captions: Text displayed on screen that represents the spoken audio in the same language, primarily designed for deaf or hard-of-hearing viewers. Unlike subtitles, captions also capture non-speech audio cues — for example, [applause], [phone ringing], or [music playing] — to convey the full acoustic environment of the scene. When captions are translated into another language for a foreign audience, they technically become subtitles.
Closed Captions (CC): Captions that can be toggled on or off by the viewer using player controls. Closed captions are embedded as a separate data stream in the video file rather than burned into the picture. They are the standard accessibility requirement for broadcast television, streaming platforms, and many educational video platforms.
Open Captions (OC): Captions that are permanently visible and cannot be turned off by the viewer, similar to burned-in subtitles. Open captions are often used in social media videos where auto-play with muted audio is common, ensuring the content remains accessible regardless of viewer settings.
SDH Subtitles (Subtitles for the Deaf and Hard of Hearing): A hybrid format that combines the translation function of subtitles with the descriptive elements of captions — including speaker identification and audio cues — for foreign-language deaf and hard-of-hearing audiences. SDH subtitles are increasingly required on international streaming platforms.
Dubbing (Lip Sync Dubbing): A form of dialogue translation where the original spoken audio is entirely replaced with a recorded performance in the target language. High-quality dubbing attempts to match the lip movements of the on-screen speakers — a process called lip sync — which places significant constraints on sentence length and rhythm. Dubbing is the dominant localisation format in countries such as Germany, France, Italy, Spain, and much of Latin America.
Voice-Over: A target-language narration recorded over the original audio, which is typically reduced in volume rather than removed entirely. Voice-over is commonly used in documentaries, corporate videos, news broadcasts, and instructional content. Unlike dubbing, voice-over does not attempt to match lip movements, but it should roughly align with the duration of the original speech segments.
Free Commentary: A looser form of voice-over in which the target-language narrator summarises or paraphrases the source content rather than translating it verbatim. Often used in news packages and live broadcasting where a word-for-word translation would be impractical within time constraints.
Transcription: The process of converting spoken audio into written text, typically in the same language as the original recording. Transcription is often the first step before translation — a transcript of the source dialogue is produced and then translated before being adapted into subtitles or a dubbing script. Professional transcription services are essential for maintaining accuracy at this foundational stage.
Subtitle Script / Spotting File: A timed document that specifies exactly which subtitle line appears on screen, when it starts (in-cue), and when it ends (out-cue). Subtitle scripts are the working document from which subtitle files are generated. The process of assigning timing to subtitle lines is called spotting or cueing.
Lip Sync Script: The adapted translation script prepared for dubbing, specifically crafted to synchronise with the mouth movements and breath patterns of the on-screen actors. Writing a lip sync script is a specialised skill distinct from standard translation, often requiring a translator with acting or scriptwriting experience.
Shared and Overlapping Terms
Audiovisual Translation (AVT): The umbrella discipline covering all forms of translation applied to audio and visual media, including subtitling, dubbing, voice-over, and OST translation. AVT professionals are trained to work within the unique constraints of time, space, and synchrony that distinguish media translation from document translation.
Localisation: A broader process that adapts content — including dialogue, on-screen text, cultural references, humour, colour choices, and formats — to suit a specific target market. Localisation goes beyond direct translation to ensure that the entire viewing experience feels native to the target audience. This is especially critical for advertising, entertainment, and e-learning content. Full localisation services address both OST and dialogue layers simultaneously.
Transcreation: A creative form of translation used when the goal is to replicate the emotional impact and intent of the source content rather than its literal meaning. Commonly applied to marketing slogans, humour, brand taglines, and culturally sensitive dialogue. Transcreation may apply to both on-screen graphic text and spoken dialogue in promotional video content.
Source Language / Target Language: The source language is the original language in which the content was created. The target language is the language into which it is being translated. In multilingual video projects, a single source video may have multiple target languages, each requiring separate OST rework and separate dialogue translation tracks.
Time Code: A numerical reference system used to identify specific moments within a video, formatted as hours:minutes:seconds:frames (e.g., 00:02:15:10). Time codes are the universal reference system for both subtitle spotting and OST placement in post-production, ensuring that text appears and disappears precisely on schedule.
Reading Speed: In subtitling, reading speed refers to the number of characters or words per minute that viewers can comfortably read. Standard guidelines recommend approximately 17 characters per second for general audiences, though this varies by platform and target demographic. OST elements face similar constraints — if a graphic is only visible for two seconds, the translated text must be short enough to be readable within that window.
Text Expansion / Contraction: The phenomenon whereby translated text is longer (expansion) or shorter (contraction) than the source text. German and Finnish tend to expand relative to English; Chinese and Japanese tend to contract. Text expansion is a critical consideration in both OST design — where graphic space is fixed — and in subtitle timing, where longer text requires longer display duration.
Proofreading and Quality Review: The final stage of any translation workflow, during which an independent reviewer checks the translated text for linguistic accuracy, consistency, formatting compliance, and cultural appropriateness. In video localisation, this review should cover both the subtitle file and all OST elements, checked against the final rendered video. Professional proofreading services are especially valuable at this stage to catch errors before content is distributed.
Master Template / Style Guide: A reference document that specifies consistent translation choices, terminology, tone, character limits, and formatting rules for a recurring series or brand. Maintaining a master template ensures that OST elements and dialogue translations remain consistent across episodes, campaign videos, or platform versions.
OST vs Dialogue Translation: Core Differences at a Glance
While both OST and dialogue translation serve the shared goal of making video content accessible and meaningful to a new language audience, they differ in four key dimensions.
- Medium: OST is purely visual; dialogue translation is primarily auditory (dubbing, voice-over) or read in sync with audio (subtitles).
- Space constraints: OST is constrained by the graphic frame in the video design; subtitles are constrained by line length and on-screen duration; dubbing is constrained by syllable count and lip movement timing.
- Post-production involvement: OST translation almost always requires DTP or graphic design rework; dialogue translation typically requires only a text file or re-recorded audio.
- Oversight: OST errors are often more visually prominent and harder to correct after final render; subtitle errors can usually be corrected by updating the text file without re-rendering the video.
When Do You Need Both?
Most professional multilingual video productions require both OST translation and dialogue translation. A corporate training video, for example, might feature subtitles for the spoken narration (dialogue translation) alongside translated lower-thirds identifying speakers, and translated title cards introducing each module (OST translation). An advertisement might be dubbed for broadcast while all graphic text overlays are recreated in the target language.
Failing to address both layers is one of the most common mistakes in video localisation. A perfectly subtitled documentary that retains all its English lower-third identifiers will still feel incomplete to a Japanese or Arabic viewer. Conversely, a video with beautifully redesigned OST graphics but mistimed or poorly phrased subtitles will undermine viewer trust. For brands distributing content across markets — such as those in the Asia Pacific region managed through platforms that require website translation as well as video localisation — treating both layers as equally important is essential to maintaining brand quality.
Professional Considerations for Each Type
Engaging a professional translation partner for video content means ensuring that the scope of work explicitly covers both OST and dialogue layers. When briefing a translation project, content producers should clarify which elements fall under each category, provide source files for all graphic elements (not just a transcript), specify the target reading speed for subtitles, and identify any text that is burned in versus text delivered as a separate overlay layer.
Quality assurance should include a final video review pass — not just a document review — to confirm that all translated text renders correctly within the visual frame, that subtitle timing feels natural at normal playback speed, and that no source-language text has been inadvertently left untranslated. A rigorous review process that covers both translation accuracy and visual presentation is the hallmark of a full-service localisation partner rather than a simple translation vendor.
Conclusion
The distinction between on-screen text translation and dialogue translation is not just a matter of terminology — it reflects genuinely different workflows, constraints, and quality criteria. Understanding both disciplines, and the vocabulary that surrounds them, helps content teams scope projects accurately, brief vendors clearly, and deliver multilingual video content that feels truly native in every target market.
Whether you are working with subtitles, dubbing scripts, lower-third graphics, or full OST redesigns, the goal is the same: every word on screen and every word spoken should feel as though it was created specifically for your target audience. With the right glossary in hand and the right translation partner alongside you, that standard is entirely achievable.
Need Expert Help with Video Localisation or On-Screen Text Translation?
Translated Right works with brands across Singapore and the Asia Pacific region to deliver precise, culturally appropriate translations for every layer of your video content — from subtitle scripts and dubbing files to OST graphic rework and typesetting. With over 5,000 certified translators across 50+ languages and a rigorous four-stage quality assurance process, we ensure your content lands exactly as intended in every market.






