Finding sound in books, or my data science project. Week 5
ANOUK DYUSSEMBAYEVA | OCTOBER, 10 / 2020
I started a project in hopes of creating a software that can analyze how much sound there is in a certain book, and whether it could add sound to the text in real time to produce an audio drama on the go. Here is what I did during this week to get closer to my goal. To read the introductory article, click here
Photo by Lukas from Pexels
In the previous article, I was trying to create a Graphic User Interface (GUI), where everyone would be able to upload their book and discover how much sound it has. I was experiencing some issues with tkinter, which opened my eyes to the bigger picture: I was deviating from my original goal. "Creating this GUI won't get me closer to developing a software that would add actual sound to text," I thought.

What would, on the other hand, is using a free sound bank's API. I found a few free sound bank APIs, but I was clueless on how to work with them. Back in 2019, I made attempts in mastering API implementation, but, if you know, every API is different and has its own tricks. Furthermore, looking back on my current piece of code, it was centered around text, not audio. This left me completely puzzled — the more I pondered about how exactly I will attach sound to text, the more I was dumbfounded.

In fact, I was quite lost, and reached out to Kuanysh Abeshev, the Dean of School of Engineering at Almaty Management University, to ask for advice. Originally, I contacted him to ask a quick question of how I should solve the problem with tkinter. While on Zoom, however, the conversation took a 180-degree turn when I said I don't know how to further continue with my idea.
Kuanysh asked me how I am going to be incorporating the API sounds into text — impossible. That meant that there already needs to be an audio version of the book, on top of which the sounds will be layered. He suggested I record myself reading an excerpt of the novel and have an algorithm that would transform speech to text. At this point, I was thinking that all of my previous work with PDFs seemed useless. Nonetheless, speech to text is not always entirely accurate, and double-checking with the PDF/actual text version of the book makes a lot of sense.

Once that is accomplished, I would have to run my code that compares the sound words list I created earlier to the audio (by then it would be in text format), and only then can I begin integrating sound bank APIs.

I felt relieved, since I was starting to get a sense of the direction this project was going to take. Grabbing my iPhone, I recorded myself reading a paragraph from Fahrenheit 451, giving an ode to the very first book I analyzed.

Like with everything I do when it comes to data science, the issues rose even before I moved on to the implementation stage. Because I was using Voice Memos, the format of the audio was mp4. Well, in order for the speech recognition to work, the file has to be wav. I thought this wasn't going to be a problem and just changed the name of the file from "Fahrenheit 451.mp4" to "Fahrenheit 451.wav", assuming this was going to cut it. Nothing in my Jupyter Notebook worked due to the file format, so I had to use the Online Audio Converter to make it a wave file.

Surprisingly, the code from the tutorial I found on PythonCode was immaculate: it was able to print out every word I said accurately. The only thing was that it only did that for the first minute of the audio.
r = sr.Recognizer() 
audio = "Fahrenheit 451.wav"

with sr.AudioFile(audio) as source:
    audio_data = r.record(source)
    text = r.recognize_google(audio_data)
    print(text)
The code in my Jupyter Notebook
After some Google research, I learned that this code does actually only analyze about a minute of any given audio file. There was one way to proceed if your audio was longer — creating chunks that would divide the file, and applying the algorithm that would convert speech to text to every chunk. The chunks could be divided either by assigning a specific time after which a new chunk would be created (for example, every ten seconds) or by paying attention to pauses in the audio (every time the narrator is silent). The latter makes more sense solely because when you are diving an audio file by a certain time length, you can miss some words or break the sentence, leading to loss of context and meaning.

Now, here's a story to top the article off. The code for processing longer audio that I discovered on the first website ran perfectly. The only thing was that it wasn't giving any outcomes at all. I was frustrated and confused, but not ready to give up, so I ended up scouring through what seemed like 50 other websites (it was probably around 20). All of them had somewhat of the similar solution, but none of them were getting me closer to finding the right formula.

The search lasted two days, yet I simply couldn't find the answer. Feeling exhausted, I chose to look back on the very first code.. And I realized that it was a function. Of course it wasn't giving any outcomes because I had to assign an input — the audio file! When I did just that, the machine began implementing its mechanisms to life, and in less than a minute I got the complete transcription of my audio into text.
Made on
Tilda