Over the past few months, I’ve been training a new mindset, boss mentality, and took it upon myself to level up my public speaking. I brainstormed 3 ways to do so: Authenticity, Enunciation, and Electric Torture!
For a less technical summary of my project, check out my youtube video, here I’ll explain how everything actually works :)
The core function of this project is detecting when the user says a specific word, in this case, “Um” and then forwarding that detection elsewhere. This is very similar to Siri, which will later be broken down.
Using an Arduino, we can turn on and off any electronic, we will be using both a siren and a TENS Unit, a therapeutic device repurposed to electrocute me.
Audio Processing With AI
The first step is to detect when we say a particular word, meaning that the device will constantly be listening in on your conversation, without a doubt that raises privacy concerns, yet Siri does the same!
Not only is Siri listening to you, but it’s constantly processing your audio into waveforms, 16 thousand times per second, to be exact!
The conversion from audio to useful data starts with a spectrum analysis, this will change the waveforms to a sequence of frames, each describing the sound spectrum to around a tenth of a second. From here, we take 20 of the frames we just created (0.2 seconds of audio) and apply it over to an acoustic model. This model uses a Deep Neural Network (DNN), which takes the acoustic patterns into a confidence score compared to the dataset of files saying “Hey Siri.”
The DNN isn’t anything special, just like any other DNN, it takes multiple inputs and runs it through complex matrix multiplication and logistic linear algebra, that’s right, a human can do all of this! Just not thousands of times per second 😉
The main concern with a hands-free voice assistant is it must be quick, require little power and not take too many resources away from the phone. We do this by only listening for the trigger phrase (like “hey google”) instead of processing the data. Devices like the iPhone have specific chipsets for this purpose named an Always On Processer (AOP). This has access to the phone’s microphone and redirects the signal from the standard amplifier.
We have many goals in a wake word detection model. As a summary, we group the frames into 0.2 seconds, run it through the small DNN, confirm through a more intense DNN, and then start Siri. This process happens more times per day than you’ll pick up your phone in the next decade. 🤯
Now we know when the user says our wake word, but so what? This is when we start taking action, with a couple of cheap electronics, we can do anything from watering a plant to shocking people with nearly fatal amounts of electricity!
I personally used an Arduino as my microcontroller, as a Raspberry PI is overkill, and there is plenty of documentation with Arduino’s.
I won’t go in-depth as it’s quite simple, but we are essentially constantly making API calls to a web server hosted on wherever the AI is running. If we get a response that says someone said “Um” we trigger the relay, which switches power on using an electromagnetic pulse. This then starts our over-powered modified TENS Unit, and power runs through the electrodes!
While although this wasn’t the most useful or impactful project, I learned a lot and had some fun! In summary, I now understand how exactly voice-activated devices work, connecting ML models to IoT and lastly, how to remove filler words from my speech ;)