Audio Dubbing, Subtitles, and Visual Synchronization Alignment in Video Translation | pyVideoTrans Official - Open Source Free Video Translation & Dubbing Software pyvideotrans.com pyvideotrans github github.com/jianchang512/pyvideotrans

Audio dubbing, subtitle, and visual synchronization alignment in video translation has always been a technical challenge. This is because the grammatical structures and speech rates of different languages vary greatly. When the same sentence is translated into another language, the character count and speech rate change, causing the duration of the translated dubbing to be inconsistent with the original audio duration, leading to misalignment between subtitles, audio, and visuals.

Specific manifestations include: the character in the original video has finished speaking, but the dubbing is only halfway through; or the next line in the original video has already started, but the dubbing is still delivering the previous line, etc.

Character Count Changes Due to Translation

For example, after translating the following Chinese sentences into English, their length and syllable count change significantly, and the corresponding audio duration also changes:

Chinese: 得国最正莫过于明
English: There is no country more upright than the Ming Dynasty
Chinese: 我一生都在研究宇宙
English: I have been studying the universe all my life
Chinese: 北京圆明园四只黑天鹅疑被流浪狗咬死
English: Four black swans in Beijing's Yuanmingyuan Garden suspected of being bitten to death by stray dogs

As can be seen, after translating Chinese subtitles into English subtitles and dubbing them, the dubbing duration usually exceeds the original Chinese audio duration. To solve this problem, the following strategies are typically adopted:

Several Coping Strategies

Increase Dubbing Speech Rate: Theoretically, as long as there is no upper limit on speech rate, matching audio duration to subtitle duration can always be achieved. For example, if the original audio duration is 1 second and the dubbing duration is 3 seconds, increasing the dubbing speech rate to 300% can synchronize them. However, this method makes the audio sound rushed and unnatural, with varying speeds, resulting in suboptimal overall quality.
Simplify the Translation: Reduce dubbing duration by shortening the translated text. For example, translating "我一生都在研究宇宙" into the more concise "Cosmology is my life's work." Although this method yields the best results, it requires modifying subtitles sentence by sentence, which is very inefficient.
Adjust Silence Between Subtitles: If there is silence between two subtitle segments in the original audio, part of this silence can be reduced or removed to bridge the duration gap. For example, if there is 2 seconds of silence between two subtitles in the original audio, and the translated first subtitle is 1.5 seconds longer than the original, the silence can be shortened to 0.5 seconds, allowing the dubbing time for the second subtitle to align with the original audio timing. However, not all subtitles have sufficient adjustable silence between them, limiting the applicability of this method.
Remove Silence Before/After Dubbing: Silence is often retained before and after dubbing. Removing this silence can effectively shorten the dubbing duration.
Slow Down Video Playback: If simply speeding up the dubbing is ineffective, consider slowing down the video playback in combination. For example, the original audio duration for a subtitle segment is 1 second, but becomes 3 seconds after dubbing. We can shorten the dubbing duration to 2 seconds (1x speed increase) while simultaneously reducing the playback speed of the corresponding video segment by half (extending its duration to 2 seconds), achieving synchronization.

Each of the above methods has its pros and cons and cannot perfectly solve all problems. To achieve optimal synchronization, manual fine-tuning is usually required, but this contradicts the goal of software automation. Therefore, video translation software typically combines several of these strategies to strive for the best possible outcome.

Implementation in Video Translation Software

In software, these strategies are typically controlled through the following settings:

Main Interface Settings:

Dubbing Speed-Up: Setting used to automatically increase dubbing duration to match subtitles.

Video Slow-Down: Setting used to automatically reduce video playback speed to match dubbing duration.

Remove Silence Between Subtitles: Optional when neither Dubbing Speed-Up nor Video Slow-Down is selected, to remove silence between subtitles.

Align Subtitle & Audio: When neither Dubbing Speed-Up nor Video Slow-Down is selected, the audio and subtitles are not aligned. Selecting this option forces subtitle and audio alignment.

Dubbing Speech Rate: Setting used to globally accelerate dubbing.

Secondary Recognition: If embedding single subtitles, select this. It will perform speech-to-text on the dubbed file after dubbing is complete to create subtitles, ensuring precise alignment between subtitles and dubbing.

Advanced Options Settings (Menu Bar -- Tools/Options -- Advanced Options -- Subtitle Audio Visual Alignment):

Maximum Audio Speed-Up Factor / Video Slow-Down Factor limit the degree of acceleration and deceleration to prevent audio distortion or excessively slow video playback.

By flexibly applying the above settings, video translation software can automate the synchronization of subtitles and dubbing as much as possible, improving translation efficiency.

Character Count Changes Due to Translation ​

Several Coping Strategies ​

Implementation in Video Translation Software ​

Character Count Changes Due to Translation

Several Coping Strategies

Implementation in Video Translation Software