Developer’s Log: Adventures in Voice ID and Text-to-Speech

Hey there, fellow code enthusiasts! Today, I want to share a little journey I embarked on while tinkering with voice ID and text-to-speech technology. It’s been quite the adventure, filled with moments of discovery, problem-solving, and the occasional “aha!” moment that makes programming so rewarding.

So, picture this: I’m sitting at my desk, surrounded by the usual developer’s workspace chaos - a mix of gadgets, adapters, and various odds and ends that seem to accumulate mysteriously. My wife isn’t the biggest fan of my “organized clutter,” but hey, that’s where the magic happens, right?

I decided to dive into improving my Rubber Ducky project, a nifty little tool I’ve been working on. The goal was to enhance its voice capabilities, specifically by making it easier to select different voices without having to remember complex alphanumeric IDs. You know how it goes - you create something cool, but then realize there’s always room for improvement.

The first step was to update the voice selection process. Instead of passing in a long, unmemorable string of characters, I wanted to use simple, human-readable names. It’s one of those quality-of-life improvements that can make a big difference in usability.

As I started coding, I found myself in that familiar state of “rubber ducking” - talking through the problem out loud. It’s funny how often this technique leads to solutions, isn’t it? There’s something about vocalizing your thoughts that helps clarify the path forward.

I began by updating the `voices` variable to be a hash, mapping friendly names to the actual voice IDs. This approach allows users to input something like “me” or “android” instead of a cryptic string. It’s a small change, but one that makes the tool much more intuitive to use.

Of course, as with any coding session, I ran into a few hiccups along the way. I had to consider how this change would affect the existing random sampling feature. It’s always a balancing act, isn’t it? Improving one aspect while making sure you don’t break another.

One interesting challenge was deciding where in the code to convert the input string to a symbol. It might seem like a minor detail, but these are the kinds of decisions that can impact maintainability down the line. I opted to make the conversion early in the process, ensuring consistency throughout the rest of the code.

As I worked, I found myself thinking about future improvements. There’s always more to do, isn’t there? I started jotting down ideas for additional features - things like the ability to toggle post creation on and off, or creating different profiles for various output styles. It’s exciting to see a project grow and evolve like this.

After implementing the changes, it was time for the moment of truth - testing. I created a small test script, choosing to try out the “android” voice. There’s always a mix of anticipation and nervousness when you run that first test, isn’t there?

The result was… interesting, to say the least. The generated text wasn’t entirely accurate (it mentioned Amazon Polly, which I hadn’t actually used), but it showcased the potential of the system. It’s a reminder that AI-generated content, while impressive, still needs human oversight and editing.

This little adventure in coding reminded me of why I love development. It’s these small improvements, these little victories, that make the process so satisfying. Sure, there’s still work to be done - refactoring here, optimizing there - but that’s part of the fun, isn’t it?

For those of you thinking about diving into similar projects, I say go for it! Start small, be patient with yourself, and don’t be afraid to talk to your rubber duck (or whatever inanimate object you prefer). You never know where your coding adventures might lead you.

As I wrap up this log entry, I’m already thinking about the next steps. Maybe I’ll explore that Node.js rebuild I’ve been considering, or perhaps I’ll dive into creating that React Native front-end. The possibilities are endless, and that’s what makes development so exciting.

Remember, fellow coders, every line of code is a step forward. Sometimes you’re sprinting, sometimes you’re stumbling, but you’re always moving. Keep experimenting, keep learning, and most importantly, keep having fun with it.

Until next time, happy coding!

Original transcript

Just doing a little improvement on Rupper Ducky. And honestly I came to this little improvement, decided to document it, just to share a little bit of rubber ducky. Parts of it are mess, parts of it need some refactoring, lots of it needs tests. But it’s been about 30 minutes here, actually change, making a change based on me attempting a rubber duck. You’re not attempting one. Attempting to come up with a title for something else. I wanted to share as a post and as I was writing that out, that command out, and I was trying to come up with a title, I came up with a title and then I go on to, oh yeah, I do want to use eleven labs for the voice. And as remarkable as it is to, well, let me step back. I wanted to use my voice again. Well, these are all voice ids. As you can see here. I’ve added a number of voices. The first one is this one here that I created. And in fact I’m going to have to, I’m going to change that to me id as the constant name here. And what I wanted to do, because in the command here I would, as I have it written right now, I would be passing in the v flag and passing in the string of that alphanumeric, or actually, yes, alphanumeric, uppercase, lowercase, combination string is that id that unique id and pass that in. And as you can imagine, obviously I’m not going to remember G capital AI, blah blah blah. So of course I’m always coming back to this file. And a while back, as I was writing and expanding on this and adding these voices, testing them out, and also what you’re going to see as a result of this rubber ducky itself that will become this post is that you can see I’m sampling for fun. There’s some fun voices in there. There are some more serious voices, british voices, etcetera, some weird voices I created and on and on. But to get to my point, what I am going to do here today is something that as I was expanding on this, I realized I was going to likely quickly want to do, and that is to be able to pass in maybe a type in the command, something where I can specify by id name. So we may have it in enumerable, a dictionary, a hash, some sort of key value in the code itself to map these key names, keywords to the ids, to be able to pass it at the front. So one thing I can imagine doing for that would be just flag that specifies, well, I can update the v flag, which is the flag for the voice id to pass in a name. So like for my voice, it would be me. And I think we can then do a little bit of not too much crazy metaprogramming to make it easy on us mapping those to the constants that already exist. So I’m going to try that for right now. Let’s go into our posts and these tests. This will be an interesting play here if you think about it. Right now I’m recording the video to capture the screen I am recording with rubber Ducky, which will end up as this actual post. And the video I will because especially at 30 minutes, it gets quite heavy and just processing, converting it to webm and then even it’s heavy for, well, for GitHub to add these clips. And of course, you know, I go down the path of thinking of managing those. But sometimes you have to think ahead. Do I want to have that problem? Anyway? So with respect to that, realize most of this is visual. If I decide to share, well, I will share the clip of this actual, the raw audio of me speaking to, well, myself speaking to you in the future. Hopefully. If you’re listening to this right now, look at your speakers and realize that across time I have reached to you and I hope someone’s listening. But this raw audio is being recorded. And of course, as I’ve mentioned before, that will become a transcript and that will become also a rendered created post in style. As you see, I’m going to repeat myself a lot and that’s okay. But what’s fun here is as I test out, I’ll be creating little tests and I’m going to create them in the posts directory so that I can maybe use and pull them in. It could be kind of a fun meta experiment. It could also get messy. Regardless, I’m going for it. So yeah, we’re going to test some things out. Let’s get started. Here we have the eleven labs client that I wrote. This is responsible for communication with eleven labs to pass it, the text to be spoken, and gets back the audio data and saves that to a file. And that will be another episode to improve this. Don’t laugh too hard. It’s pretty, pretty basic right now and could use a little help. It’s not too terrible, but for right now it’s working well. And you know, one big thing I think that we’ll want to look at in the future is optimizing. When I mean, like this for example, is probably, well, the thing is that passing to eleven labs, you’re passing text. So really the only area in which the length of this recording matters is passing it to AssemblyAI. And that’s where yet another improvement I want to make is we want to update the socks recorder to trim out silence, etcetera. I want to reduce the file size there. And I also want to utilize for a time based recording instead of specifying a time that you could record but then stay silent for however many seconds that it will then automatically stop. So there’s plenty of path to go down and I’m going to mention them and try not to get pulled off. But the whole point of this too is to be conversational, to just rubber ducky, as it were. So that’s what I’m doing. I’m rubber ducking. Rubber ducking. I’m ducking. Who knows what that really means deep down. But yeah. So let’s start with our rubber ducky script and we will go to our voice id. And right now, hey look, it’s just passing in. Do we want to update this to be voice? We’ll keep a voice id. I mean, the name is still an id for that voice. It’s just another abstracted layer. It’s an id for the id for that voice. So we will keep it that way. So I don’t think the script will need to change because it will come in, it will go into our YT transcriber and if I recall really we don’t really, we are passing the voice id down and this is another refactor, I think that will greatly improve this as it grows is we have things where like we have to, if we want to specify at the command line level, hey, I want to use this particular voice when you speak it back to me. We have a chain here where I believe voice id just gets passed down to next object. Next object until we get to our speaker class in which the voice id is used for. Yeah, it’s just passed the eleven labs client. So yeah, so at this point, like we can just say, yeah, it’s passed in. It’s going to be a string of me, for example, in this case of trying to test this out. So then we can expect in the eleven labs client that where is it? Yes. So our speak and this is really just a direct pass in. We could, I like the sampling. So I’m just going to keep it this way. And then the voice id where we actually use it. Here’s what we need to do is because invoice id right now is a string of me and we want to say, yeah, how about what if we had a method that was voices to which we passed in that voice id. Wow, voice looks weird to me. Now we pass that voice id in and that gives us the actual map to the actual string of the voice id. And we’re just going to plop that in, in place of voice id here for now. Okay. And then we are going to go ahead and add that method. Let’s just put it here for the moment. And that is going to be, where did it go? Ah yes, voices. Oh, actually this is what we need to do instead of a method. And in fact I’m wondering at this point if we shouldn’t. Is that a rails active record and is it a gem even or. No, it’s built in, but it is active record that I was called me to think about that, to manage. But we’re not doing a rails app and we don’t want a bunch of bloat. We’re not storing anything yet at this point. So let’s stick to in this case, actually, let’s not even make this a method. All we really need to do, a simple way to do this instead is to, oh yeah, well, what we can do is update voices and let’s say, oh man. What I’m thinking here, and let me just rubber duck you through it, what I’m thinking here now is that we update voices to be a hash. So then we get the key value, or we can get the key me to the constant MeID, which gets the value for the iD, etcetera. However, then we need to rethink how we’re doing the sampling. If we’re doing a random sample, we’re using that as a resetter at the beginning of the speak method so that we can not pass a voice id in. So if it’s nil, then it will get a sample. However, then yes, passing in the abstracted like me voice id. So if that’s passed in only me, that gets passed to the voices, etcetera. I mean, you know, something I could do is we could just add, let’s do this right now. So since we’re going off the keys here instead, we’re going to say voices, keys, sample, and then the voices, the voice id. Oh, also should we, because we were wanting these to be symbols. So we’re going to need to think about above here. So let’s go back to our speaker and say in our speaker that we go ahead and let’s pass that into sim here. And I have some, I have some leaky feelings about what I am doing here to try to smooth this transition. So I mean, obviously we pass in the command line argument they’re going to be strings. Actually, you know what, what makes the most sense instead of doing it? The speaker in that case, because we’re really making an adaptation for, as I just rubber duckies there we are making an accommodation, a translation from the command line, which, as I stated, the arguments, they’re always going to be strings. Our program doesn’t even want the voice id until now at the bottom or at the end or at this point it passes it through, etcetera. We’re not going to deal with that problem right now. But what it does, what I recognize from that where a rubber ducky is helping again, just talking through these things and that is that. Well, then that means at that point, let’s not get it here because, you know, and here’s something start to recognize as you do these things over and over again. Is that like, yeah, I could fix it there. But what I recognize immediately is that that’s easily, easily buried. And later on, say something switches and we want strings for some reason and we go over and over again and it’s like, God, somewhere it keeps changing to a symbol and, you know, not too hard to find a two sim to act for that, etcetera. But yes, so that could get buried. It could become kind of a nuisance later, nothing major, but you never know. So my point then is that we go from our script, which is going to take it into the string, but that’s at the point where we should just change it. Okay, so I mean, we could, let’s do it here at this point, we know we’re always going to want that to be two sim, okay. And my goodness. Okay. So that guarantees we have a sim coming in. And so the symbol me comes in, it gets set as a voice id. Now, then it’s going to get the voices by that, the voices value by that key. So now we need to update this to be a hash and update it with the keys and whatever keys we want because that is what the string will be. That is the argument we pass in with our command line. And I did mention I was going to try some sort of meta to get the constants and all that. But I like this better. It’s a little bit more clear what’s going on. I was going to like get the const names and down case strings, all that stuff, but nope, let’s not do that. Don’t worry about it. Okay. And this will be really fun for the last few minutes here. To test this out with another rubber ducky and hopefully it works out. We can hear the result of a small clip. Alright, so now what we’re gonna do, we have this. Let’s go and build it. So let’s see here. We’re going to just go ahead and make sure my scripts are updated. We’re going to run oops, our builder here, duckbuild and we’ll build. I should have updated the version number. I can still do that. Let’s go to the gym spec and we’re just going to do a very minor update. Version, update, version and blah. I’m about done here. 30 minutes is about good for getting some. This is great. So yeah, we’re going to build this version and test it out. And actually we’re going to do that in the posts here. So let’s see here. We’re going to do just a little short one, let’s say about 10 seconds. Title test, voice id by name and. Oh yes. And so we’re going to make sure it’s a eleven labs speaker and we want to give it a voice of me. Actually, you know what, for fun, I think we should test out one of these other ones. Let’s see. Let’s try. Let’s try Android. How about that? That might be fun. Okay, so let’s see if this works. Okay, so now we’re testing with passing a voice id by name, in this case Android, into our command line and yeah, let’s see what the result is there. Let’s make sure my volume is up so you can hear it. It’ll be interesting too to hear more voices going back in. This is also going to create, it may have to write out the final minutes. Who knows how long this will end up being because, yeah, again, right now. Okay, side note, another feature we’re going to do, we want to be able to flag on and off, like whether we want to post out of this or not because yeah, we’re going to have things like this where like we just want to make a little small snippet. There’s so many different features and flags to go on and I think at some point I need to start a document and like a table of these features and start prioritizing them because this actually has a lot of potential for variety of uses. But yeah, again, a little thing like that right now by default and without any control, it’s because of the train that I have been running to get it to this point. Some things like this, it’s like, well, it’s brand new so I wrote it to just do it right. So it’s writing a story or a developer log post the way it’s stylized right now. Another thing, I want to create profiles that we can pass in for different styles from the command line of rubber ducky. So you can say rubber ducky. Here’s a blog post based on the transcript written in the style you requested. Developers log adventures in text to speech hey there, fellow code wranglers and curious minds. Today I want to share a little adventure I had while tinkering voice id thing to speak technology. It’s been quite the journey, filled with moments of triumph, frustration, and plenty of coffee fueled coding sessions. So picture this. I’m sitting at my desk surrounded by empty energy drink cans and half eaten snacks. Don’t judge, we’ve all been there. When I decide to dive into the world. Okay, I’m pausing that for a second. Yes, my desk is cluttered, but not with empty energy drinks. If anything. Maybe. And I just moved it to the kitchen, maybe one empty sparkling water can. And what was the other one? Oh, snack wrappers or whatever. Yeah, maybe. But like, I have a trash can here now. Like, yeah, don’t judge. But no, I have plenty of my own mess. Various things for artwork and utilities, cable management stuff, you know, little odds and ends, adapters, wire wraps, etcetera. A few steps in here. Like my wife doesn’t like clutter. It’s not her favorite place. I grew up in the south mostly, where, I don’t know, I put things up on the wall, whatever trinket I think of. Anyhow, I digress. The mess is different. But yes, I was right. Or my Android voice was right. Of text to speech. Why? Well, why not? It’s not like I had any pressing plans to binge watch another series or reorganize my sock drawer for the umpteenth time. First things first, I had to choose a text to speech API. After some research and maybe a few coin flips, I settled on Amazon Poly. Now I know what you’re thinking. Okay, see there, now there’s a, not a hallucination, as they call it, but a guess that, because what was it? Yeah, it was a ten second clip. So I mean, in that clip I didn’t make a AssemblyAI, so it filled in. And as you can imagine, like, I mean, I could go and edit that, post that and go. There’s going to be a lot of cases where I do that, where I’m going to have to edit the text before I post it. If I want it to be accurate. Like I hadn’t heard of Amazon Poly I might check that out but like so it, it guessed and I may add to the prompt something, you know that’s the cool thing. Like look if you want to see and don’t judge but right now I just have this literally in a sandbox. I’m in the midst of building up the node side of rubber ducky essentially I was going to rebuild it in node and we’ll see how that goes. But, and the reasons for that is because I want to build a react, react native front end to this at some point. But yes anyhow this right here the kind of descriptor in this text for we got about 20 seconds here but this is for Claude to style it. So yeah we can update that and tell it I don’t know, you can be like I don’t know, be more sunny, be more salty. Don’t I could say maybe don’t fill in specific.