Text to Speech using OpenAI’s API

We are almost there with computer talking.
Text to Speech
OpenAI
Published

February 14, 2024

OpenAI has a pretty good overview of their text to speech API. It’s not free or impulsively cheap, like ChatGPT3.5, but, in my opinion, it is the best available model.

from openai import OpenAI

This assumes that you have an OpenAi API key, and have stored it as an environment variable.

def stt(text, voice, mp3fn, model = "tts-1-hd"):
    # A function to call the API and save it as an MP3
    
    client = OpenAI() # 
    response = client.audio.speech.create(
        model = model, # model="tts-1" is cheaper and pretty close in quality.
        voice=voice,
        input=text
    )
    
    response.stream_to_file(mp3fn)
graph = (
    "Sociology is the last of the great sciences. It is only a little more than "
    "a generation old, and, as yet, its principles are not quite definite. So that "
    "among any large number of people who call themselves sociologists, one might "
    "find as many shades of opinion as he would among the large number of persons "
    "who call themselves Christians. Unlike biology, or astronomy, or mathematics, "
    "there is as yet no definite set of fundamental principles upon which all "
    "sociologists agree."
) # cite: Wright, Richard R. 1911. "The Negro Problem" https://www.crisisopportunity.org/articles/negro_problem.html


stt(graph, "shimmer", 'shimmer_sample.mp3')
/var/folders/6c/yvlqyrq97gz6xg8h66c__jgc0000gn/T/ipykernel_6745/4001769771.py:11: DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, `.with_streaming_response.method()` should be used instead
  response.stream_to_file(mp3fn)

Note: The DeprecationWarning is new. Something with OpenAI’s saving method isn’t playing nicely with notebooks.

Sample with the Shimmer voice

from IPython.display import Audio

Audio('shimmer_sample.mp3')

The only other voices I like are Onyx and Echo.

stt(graph, "onyx", 'onyx_sample.mp3')
/var/folders/6c/yvlqyrq97gz6xg8h66c__jgc0000gn/T/ipykernel_6745/4001769771.py:11: DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, `.with_streaming_response.method()` should be used instead
  response.stream_to_file(mp3fn)
Audio('onyx_sample.mp3')
stt(graph, "echo", 'echo_sample.mp3')
Audio('echo_sample.mp3')
/var/folders/6c/yvlqyrq97gz6xg8h66c__jgc0000gn/T/ipykernel_6745/3136648165.py:11: DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, `.with_streaming_response.method()` should be used instead
  response.stream_to_file(mp3fn)

The HD model costs $0.030 per 1,000 characters, which works out to be about 18 cents for 1,000 words. The non-HD model costs half that.

For fun, I used ChatGPT to write an introduction to a W.E.B. DuBois short story, used TTS to produce audio versions of the intro and story using different voices, and then used pydub to splice the two.

Content warning: The story contains racial epithets, which Du Bois often used when writing dialogue for white racists. Related, whatever content filters exist for ChatGPT don’t seem to be there for their TTS model.

Audio('on_being_crazy_hd.mp3')