Finding sound in books, or my data science project. Week 3
ANOUK DYUSSEMBAYEVA | SEPTEMBER, 15 / 2020
I started a project in hopes of creating a software that can analyze how much sound there is in a certain book, and whether it could add sound to the text in real time to produce an audio drama on the go. Here is what I did during this week to get closer to my goal. To read the introductory article, click here
Photo from Pexels
As you already know, the first book I have analyzed was Fahrenheit 451, due to the fact that there were a lot of sound words to be found just by looking at the first page. Although I've solved issues such as not counting the repetitions of sound terms throughout the book, it turned out that I was nowhere near perfection. Yet.

I made the text all lower case, in order for the word 'Buzz' and 'BUZZ' to be interpreted the same as 'buzz'. However, I had no clue about how to make the words in the text stick together, and not randomly break. This sounds very confusing, so here's an example: when the text was printed, there were words such as 'and', that were split into 'an' and 'd', which made them be read as two separate terms.

After some time, and Stack Overflow, I came to a decision to alter my loop, where all the text was being read through and printed, and add the function text.replace(). The name speaks for itself — I replaced '\n' (which means a jump to the next line) and ' ' (a big space) for a simple one space character.

F451 = PyPDF2.PdfFileReader(FRN451)

num_pages = F451.numPages
count = 0
text = "" 

while count < num_pages:
    pageObj = F451.getPage(count)
    count +=1
    text += pageObj.extractText() 
    text = text.lower() 
    text = text.replace('\n', '')   
    text = text.replace('  ', '')

if text != "":
   text = text

else:
   text = textract.process(fileurl, method='tesseract', language='eng') 
The code in my Jupyter Notebook
Once this function was added, the text started looking more put together, and there were no word breaks. However, there were two words that were joined at the start, but that only happened twice and didn't impact sound words, as far as I know. There were also hyphens that connected two words, but they were supposed to be there because the meaning would change if they were to be taken in separately.

I ran through the same code used in the previous weeks to divide individual words and exclude punctuation marks from affecting the results, and the number of sounds in Fahrenheit 451 became 580. This procedure of fixing the word breaks revealed more sound, since last week the total was 558.

The time came for trying out the algorithm on other novels. Surprisingly, one of the most challenging parts about this whole project is finding a PDF that would actually have the possibility of being interpreted. As I found, most books in this format are just pictures of the text, which makes it impossible to read with a code that is used for working with actual text.

Opening one website after the other, I finally found a book other than Fahrenheit 451The Hunger Games, one of my all-time favorites.

While the algorithm had no problem in printing the text of the book, there were random symbols every time there was supposed to be an apostrophe, there was a ™ sign instead, which I found to be quite peculiar. Feeling thankful that the .replace() function exists, I quickly used it to solve this minor problem. This time, however, there were hyphens that split words like 'fin-gers', but I didn't come up with a solution for that yet.

while count < num_pages:
    pageObj = THG.getPage(count)
    count +=1
    text += pageObj.extractText() 
    text = text.lower() 
    text = text.replace('\n', '')   
    text = text.replace('  ', '')
    text = text.replace("™", "'")
As I've learned more about the PyPDF2 library, which is used to read PDF documents, I've come to understand that it has its downturns, too: the main one is that it often has trouble printing out the text exactly like it appears in the original file. I can't complain though, because this allows for experimentation and new functions to be used.

There was no need to change the rest of the algorithm as everything ran smoothly to give the total of 875 words for the first book of The Hunger Games trilogy.

The second book I chose to analyze was 1984, a classic I read back in my sophomore year of high school. While in a bookstore, my dad stumbled upon the novel and decided to read a couple of pages to see how much sound he would discover, and he told me that there really wasn't as much of it in the book.

As usual, the PDF didn't return the text in its original form quite well: there were random symbols that didn't really mean anything all throughout the novel, and some words had their letters removed. If in the PDF the sentence was 'gestures', in the Jupyter Notebook it turned out to be 'g-˜˚˛˝˝ures'. Once again, there were hyphens in the middle of words, which is also something I couldn't solve.

Nevertheless, the current total is 748 sound words in 1984, which seems too little for such a huge novel.

Next week, I will be working on creating a prototype of this, which, if successful, would give an opportunity for everyone to be able to interact with the algorithm and see how much sound there is in their favorite books.

Subscribe to the newsletter and stay updated for advances in my data science project!
Made on
Tilda