Speech Recognition to Text Tool

Speech Recognition to Text Tool Open Source Address

This is an offline, locally run speech recognition to text tool based on the openai-whisper open-source model. It can recognize human speech in video/audio and convert it to text, outputting in JSON format, SRT subtitle format with timestamps, or plain text format. It can be used as a self-deployed alternative to OpenAI's speech recognition API or Baidu Speech Recognition, with accuracy basically equivalent to the official OpenAI API interface.

After deployment or download, double-click start.exe to automatically open the local web page in your browser.
Drag and drop or click to select the audio/video file to recognize, then choose the spoken language, output text format, and the model to use (the base model is built-in). Click "Start Recognition", and after recognition is complete, the result will be output on the current web page in the selected format.
The entire process requires no internet connection, runs completely locally, and can be deployed on an intranet.
The openai-whisper open-source model has base/small/medium/large/large-v3 variants. The base model is built-in. From base to large-v3, the recognition effect improves, but it also requires more computer resources. You can download other models as needed and place them in the models directory.
All Model Download Address

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment

Click here to open the Releases page and download the pre-compiled files.
After downloading, extract the files to a location, e.g., E:/stt.
Double-click start.exe and wait for the browser window to open automatically.
Click the upload area on the page, find the audio or video file you want to recognize in the pop-up window, or directly drag and drop the audio/video file into the upload area. Then select the spoken language, text output format, and the model to use. Click "Start Recognition Now". Wait a moment, and the recognition result will be displayed in the selected format in the bottom text box.
If the machine has an NVIDIA GPU and the CUDA environment is correctly configured, CUDA acceleration will be used automatically.

Source Code Deployment (Linux/Mac/Windows)

Requires Python 3.9 -> 3.11.
Create an empty directory, e.g., E:/stt. Open a command prompt window in this directory by typing cmd in the address bar and pressing Enter.
Use git to pull the source code to the current directory: git clone git@github.com:jianchang512/stt.git .
Create a virtual environment: python -m venv venv.
Activate the environment. On Windows: %cd%/venv/scripts/activate. On Linux and Mac: source ./venv/bin/activate.
Install dependencies: pip install -r requirements.txt. If you encounter version conflict errors, please run pip install -r requirements.txt --no-deps.
On Windows, extract ffmpeg.7z and place ffmpeg.exe and ffprobe.exe in the project root directory. On Linux and Mac, go to the ffmpeg official website to download the corresponding version of ffmpeg, extract it, and place the ffmpeg and ffprobe binary programs in the project root directory.
Download the model archive. Download the model as needed. After downloading, place the xx.pt file from the archive into the models folder in the project root directory.
Execute python start.py and wait for the local browser window to open automatically.

Speech Recognition to Text Tool ​

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment ​

Source Code Deployment (Linux/Mac/Windows) ​

Speech Recognition to Text Tool

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment

Source Code Deployment (Linux/Mac/Windows)