Finding sound in books, or my data science project. Week 2
ANOUK DYUSSEMBAYEVA | SEPTEMBER, 8 / 2020
I started a project in hopes of creating a software that can analyze how much sound there is in a certain book, and whether it could add sound to the text in real time to produce an audio drama on the go. Here is what I did during this week to get closer to my goal. To read the introductory article, click here
Photo by Tim Gouw from Pexels
The very first book I chose to analyze was Fahrenheit 451 because my intuition convinced me that it probably has a lot of sound. The results I achieved last week showed that the novel has 123 sounds, which seemed abnormally small to me. Looking back on the function that I used to count the sound words, len(set(keywords) & set(word)), I realized that it only counts each word once.

Taking the first word that I saw from the list, 'buzz', I searched for it in the PDF out of curiosity. It came as no surprise that it appeared at least three times before I became too bored to scroll through the rest of the text. Because my hypothesis turned out to be true, I knew this issue should be the first on the fix list.

After trying all the variations of Google searches you can imagine, I got the advice to use loops. When I first heard this, I felt very embarrassed — this was one of the very first topics that we covered in our data science course last year. Too often we try to search for answers by applying gigantic formulas when all we really need to do is use our basic skill set.

At the same time, I noticed how the text still had words that started with capital letters, which could potentially hinder the analysis and not count every word that produced sound. While I knew what function to use, text.lower(), I had no idea where to put it, and through trial and error, putting it inside of the code where we try to extract the text from the PDF into a readable format finally worked.
FRN451 = open('F451.pdf', 'rb') 
F451 = PyPDF2.PdfFileReader(FRN451)

#Discerning the number of pages will allow us to parse through all the pages.

num_pages = F451.numPages
count = 0
text = "" 

#The while loop will read each page.

while count < num_pages:
    pageObj = F451.getPage(count)
    count +=1
    text += pageObj.extractText() 
    text = text.lower() 

#This if statement exists to check if the above library returned words. It's done because PyPDF2 cannot read scanned files.

if text != "":
   text = text

#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text.

else:
   text = textract.process(fileurl, method='tesseract', language='eng')
The code in my Jupyter Notebook
I also wanted to update my list of sound words with more musical terms by including various composition names such as concerto, and adding names of instruments like violin.

In order to make a loop, I first had to transform the result we received, len(set(keywords) & set(word)), from a string into a list by using list().
combination = set(keywords) & set(word) 
a = list(combination) 

for i in a:
    print(keywords.count(i)) 
Essentially, the loop prints out every element from the sound words we found earlier, and checks how many times they appear in the original book text. The next step was to figure out the addition of all the numbers to give out the single amount of sound words we have in Fahrenheit 451.

While this took much longer than expected, I applied the concept of using an empty list into which we could add all of our outputs from the loop using the .append() function. Because the numbers were saved into a list, I could now just sum all of them together to give us a total of 1015. This means that there are 1015 words that could make sound in Fahrenheit 451 — this result is much more believable than the number 123.
wordfreq = []
for i in a:
    wordfreq.append(keywords.count(i)) 
    
print(wordfreq) #prints the table of individual values
print(sum(wordfreq)) #prints the total of sound words
Even though it seemed as if I came to a believable answer, when looking back to the words that correlated between the text and the list, I spotted a few terms that wouldn't be able to make sound at all: air, happening, touch, rather, among others. Because I compiled the list from words that were found on different websites, some of those lists contained terms that were irrelevant to the goal at hand. Running all of the above commands once again after deleting such outliers, the total number of sound words in the book became 558.

Next week, I will be comparing different books to determine how much sound each has and whether there are any correlations between authors and the amount of sound words in their books, as well as other things.

Subscribe to the newsletter and stay updated for any advances in my data science project!
Made on
Tilda