November 21, 2024

CrisperWhisper: Enhanced Speech Recognition with Precise Time Stamps

Listen to this article as Podcast
0:00 / 0:00
CrisperWhisper: Enhanced Speech Recognition with Precise Time Stamps
```html CrisperWhisper: Precise Speech Recognition with Improved Timestamps

CrisperWhisper is an advancement of OpenAI's Whisper, developed for fast, precise, and verbatim speech-to-text (STT) with accurate timestamps at the word level. In contrast to the original Whisper, which tends to skip filler words and pauses, CrisperWhisper transcribes every spoken word exactly as it is, including filler words, pauses, stutters, and incomplete sentences. The model aims for a verbatim transcription that captures every detail of spoken language.

Key Features of CrisperWhisper

CrisperWhisper is characterized by several key features:

  • Accurate Word-Level Timestamps: The model delivers precise timestamps, even with interruptions and pauses, through the use of a customized tokenizer and a specific attention loss function during training.
  • Verbatim Transcription: Every spoken word is reproduced exactly, including filler words like "um" and "uh".
  • Filler Word Detection: Filler words are recognized and accurately transcribed.
  • Minimization of Hallucinations: Transcription hallucinations, meaning words that were not spoken, are minimized to increase accuracy.

Improved Performance Compared to Whisper

CrisperWhisper significantly outperforms Whisper Large v3, particularly on datasets with a verbatim transcription style like AMI and TED-LIUM. The improved performance is evident in both transcription accuracy and segmentation. Especially noteworthy is the precise capturing of filler words and pauses, which are relevant for the analysis of speech patterns and cognitive processes.

Applications and Integration

CrisperWhisper can be used in various applications, including:

  • Qualitative Speech Research: Detailed analysis of speech patterns, filler words, and pauses.
  • Clinical Speech Analysis: Diagnosis of speech disorders and evaluation of therapy progress.
  • Automatic Subtitling: Creation of accurate subtitles for videos and audio files.
  • Improved Speech Assistants: Development of speech assistants that respond to natural language.

The model can be integrated into common frameworks like Transformers and Faster Whisper and offers flexible application possibilities. A Streamlit app is available for user-friendly operation.

Mindverse and CrisperWhisper

For Mindverse, a German company that develops AI-powered content tools, CrisperWhisper represents a valuable addition to the portfolio. The precise speech recognition technology can be integrated into various Mindverse solutions, such as chatbots, voicebots, AI search engines, and knowledge systems. This allows Mindverse customers to benefit from improved accuracy and efficiency in processing speech data.

Bibliography Wagner, L., Thallinger, B., Zusag, M. (2024). CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. INTERSPEECH 2024. https://github.com/nyrahealth/CrisperWhisper/blob/main/README.md https://replicate.com/collectiveai-team/crisperwhisper/readme https://www.gradio.app/guides/real-time-speech-recognition https://arxiv.org/html/2408.16589v1 https://openai.com/index/whisper/ https://www.isca-archive.org/interspeech_2024/zusag24_interspeech.html https://github.com/SYSTRAN/faster-whisper https://hub.docker.com/r/liquidinvestigations/openai-whisper-gradio ```