The problem:

Sometimes people send me voice messages over discord, but I’m not in a place to read the text at the moment.

The solution:

Display a transcript of the voice message

Overview:

I’m familiar with a model called “Whisper” released by OpenAI (the same people that brought you ChatGPT), and the model is open source! Whisper is fantastic at taking audio, and returning a transcription. I want to hook into discord, so a chrome extension seems like the first tool I reach for here. The extension will inject my script and css into the discord webpage, allowing custom functionality. I’ll interact with my Whisper model over runpod serverless containers, and that’s it for the most part.

The stack one more time:

RunPod Serverless (host)
Chrome Extension (UI / frontend)
Whisper (AI Model for audio => text)

Details:

Let’s set up an api first.

RunPod is fantastic, if you’ve never heard of it before, they’re setting new standards for AI / heavy GPU applications. Basically, they have a community networks of GPUs that you can rent by the hour, and highly specalized mechines for <200ms cold start on serverless functions, with 32gb of vRAM. Hourly gpu usually costs me ~$0.20/hr and serverless ~$0.00026/s. RunPod works with docker on their containers, which is pretty versatile.

So, scan through some docker repos for one pre-configured with whisper running on it, throw in some customization for API to have token verification, and my backend is pretty much complete and I’ll spend very little $ on running it. I have a few dedicated servers I like to run personal projects on, however they’re not powerful enough for a speedy whisper response, so it makes sense to rely on this server. It takes ~10s for 3m of audio to be transcribed, so a heavy run should cost me $0.00078. It also helps to be familiar with this model and runpod configuration considering a past project where I built a tool to transcribe youtube videos, with the intention of potentially creating vectors for LLMs.

The “frontend” was less exciting, I turned chrome developer tools on, and allowed custom extensions to be used. Wrote a quick injection script to modify the DOM, adding a little circle button to the end of each audio message. When clicked, it finds the source address for the audio it plays, and sends that to a background script. The background script sends that audio link to the Whisper API we made, and polls for completion. After completion, it hands the transcription back to the injected page script, which updates the DOM again with the transcript.

Improvements:

This was really more like a “Proof of Concept” and personal tool than it was a project I could share easily. There’s tons that could be improved before something like this could hit market. I would need some user authentication, registration, and a pay structure to bill the user using an API.

I think it could be more useful than just discord audio, maybe have it understand facebook, instagram, snapchat, youtube, or maybe just audio embedded in any webpage. That would be a task, but not unreasonable. If the only reason to use it were discord voice messages, I fear the niche of need would either be too small, or the ability for discord to implement their own transcription service would be too high. They could wrap it into their “Nitro” premium service and make any commercial extension obsolete that’s too focused on just discord.

The UI could use some updating. I think some color indicators instead of text to show process stage of audio would be good.

Thoughts?

I use this extension regularly, and enjoy having it. I made it to solve a problem in my own life, one of my greatest motivators for creating. It’s really incredible how easy it is to interact with these AI models and solve problems that seemed much more difficult before their existance.

Discord Transcription