Turning PDFs into audiobooks

Using textract and Text-To-Speech models to create audiobooks from PDFs

Recently, I have looked at speech-to-text models and they have surprised me with their accuracy and speed.
It occurred to me that models specialized in the reverse, which is text-to-speech models, should also work much better now.
That is how I got this idea of trying to build an easy tool to convert PDFs into audiobooks.
In the tutorial below I will show you how I was able to build a simple tool that is capable of turning PDFs into audiobooks.

In order to run this code yourself you will need to run:
`pip install textract transformers sentencepiece 'datasets[audio]' jupyterlab scipy`

Next we will need to download our book in a PDF format.
I recommend using Google for that and for the purpose of this experiment I have used the book called "Ender's Game".
Okay, once you have downloaded your book, let us make all the necessary imports and load the text from the PDF.
Make sure you run the code inside a jupyter lab notebook.
You can start it with `jupyter-lab`, but previously make sure you have create an IPython kernel, because we want to be able to hear some samples.

                                
import numpy as np
import typing
import re
import textract
import torch
from datasets import load_dataset
from IPython.display import Audio
from scipy.io import wavfile
from tqdm import tqdm
from transformers import pipeline

long_text = textract.process("/path/to/your/YOUR_BOOK.pdf", method='pdfminer')
                                
                            

Cool, so now we wil have to split our text into smaller chunks since the transformers have upper limits on input text length.
We can do that with a simple function:

                                
def create_chunks(text: str, max_chunk_len: int = 100) -> typing.List[str]:
    chunks = []
    chunk = ""
    for part in re.split("\s", text):
        if len(part) + len(chunk) < max_chunk_len:
            chunk += " " + part
        else:
            chunks.append(chunk)
            chunk = part
    if len(chunk) > 0:
        chunks.append(chunk)
    return chunks

chunks = create_chunks(long_text.decode("utf-8").replace("\n", " ... "))
print(len(chunks))
                                
                            

We are using a max character length of 100 since that is a reasonable default that worked well during my experimentation.
Alright, now that we have our preprocessed chunks, let us do the magic! We are going to create a short audio sample at first, so that we can test our code.

                                
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

combined_speech = None
sampling_rate = None
for i in tqdm(range(10)):
    # This is using multiprocessing underneath so we just have to do one chunk at a time
    speech = synthesiser(chunks[i], forward_params={"speaker_embeddings": speaker_embedding})
    if combined_speech is None:
        combined_speech = speech["audio"]
        sampling_rate = speech["sampling_rate"]
    else:
        combined_speech = np.concatenate([combined_speech, speech["audio"]])

Audio(combined_speech, rate=sampling_rate)
                                
                            

Great success! We managed to convert a PDF into audio! In order to do this for the entire book we will need to adjust our code a bit.
Firstly, we will need to iterate over all chunks and we will also want to store our output into a `.wav` file.

                                
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

combined_speech = None
sampling_rate = None
for i in tqdm(range(10)):
    # This is using multiprocessing underneath so we just have to do one chunk at a time
    speech = synthesiser(chunks[i], forward_params={"speaker_embeddings": speaker_embedding})
    if combined_speech is None:
        combined_speech = speech["audio"]
        sampling_rate = speech["sampling_rate"]
    else:
        combined_speech = np.concatenate([combined_speech, speech["audio"]])

wavfile.write("audiobook.wav", sampling_rate, combined_speech)
                                
                            

The audiobook will be pretty large since our `.wav` file is not compressed.
We can use third party tools to optimize it. However, we will not do that as part of this experiment.
To summarize, it is pretty cool that we can use our machines to read to us.
Consider near future where we will be able to not only create stories using GPT like models, but also create entire video games using generative AI.
Our results will rely mostly on our ability to be creative and come up with a good way to relay our style to the AI. It is both fascinating and exciting!