Im sure this works but from my recent experience you need your STT on a machine more powerful than a PI atm. Tiny models are just not accurate enough and the bigger ones need more than the PI has to give any sort of decent response time. Compared to where this was two years ago I look forward to where it is in two more.
One of the largest improvements imo has been microwakeword and the ability to run the hotword detection “on device” but I believe it only runs on ESP32 devices so not an option if want everything on a pi.
I spent a little bit of time getting a fully local voice pipeline setup in home assistant last month and I’d say it is near perfect(after adding a few additional community integrations) with the exception of the microphones on current hardware. I look forward to the next HA voice device from Nabu Casa.