Audio Transcription with Speaker Identification

This is my version of the WhisperPlus demo, trying to get it to work on my Mac.

Prior to running this, create a new Python environment with Python 3.11. I don’t normally create new environments, but this library needed one to get all the dependencies to play nicely together.

conda create --name whisperplus python==3.11 notebook pip
conda activate whisperplus
jupter notebook

pip install whisperplus -U -q

Note: you may need to restart the kernel to use updated packages.

Some warnings show up after running the commands, but the don’t seem to have much impact.

from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3

/Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")

Transcription without speaker identification.

model = 'openai/whisper-small' # the smallest and quickest model, but less accurate than the others like  whisper-large-v3 

audio_path = 'sample/quiton_baxter_interview_sample.mp3'
pipeline = SpeechToTextPipeline(model_id=model)
transcript = pipeline(audio_path, model, "english")

print(transcript)

2024-02-14 11:28:36,149 - INFO - Loading model...
2024-02-14 11:28:38,246 - INFO - Model loaded successfully.
2024-02-14 11:28:38,285 - INFO - Using device: mps
2024-02-14 11:28:38,961 - INFO - Transcribing audio...

 Hello, this is Chris McGinnis. Today is Saturday, February 23rd, and I'm interviewing Mr. Quinton Baker at his home in Hillsborough, North Carolina. This tape is a continuing series of interviews that contribute to the Gay and Lesbian Southern History Project, which is part of the Southern Oral History Program at UNC Chapel Hill. This project is currently focusing on the history of gay men, lesbians, bisexual and transgender history in Chapel Hill and the Triangle area over the 20th century. This tape will be stored in the Southern Historical Collection, which is located in Wilson Library on the campus of the University of North Carolina and Chapel Hill. The number for this tape is 02.23.02-QB.1. Here we go. Well, first off, Quentin, just to, this is a general question I ask everybody, tell me a little bit about where you were born, where you grew up, and just a general synopsis of the early years. The early years. The early years. I was born in Greenville, North Carolina, and I spent the first 18, 18 years there. I was born in a family of four children. I'm the youngest of four. My parents were laborers. My mother was the domestic, my father was a laborer. We lived in town at that time. Greenville was about 21,000 people. What did your father do? Did he work in textile mill? My father did various jobs. He worked in a furniture store. He sometimes worked in the fields, he worked in the tobacco factory, so that there was never one job, there was a serious variety of jobs. He even learned to repair televisions while he was working for...

Now with speaker identification, powered by PyAnnote’s Speaker Diarization.

from whisperplus import (
    ASRDiarizationPipeline,
    download_and_convert_to_mp3,
    format_speech_to_dialogue,
)

model = 'openai/whisper-small' 
audio_path = 'sample/quiton_baxter_interview_sample.mp3'
device = "mps"  # "mps" if you are on a modern Mac, "cuda" if have a GPU (the fastest option), or "cpu" (the slowest option). 

pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model=model,
    diarizer_model="pyannote/speaker-diarization",
    use_auth_token='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', # This is mine. Get your own at https://huggingface.co/pyannote/speaker-diarization
    chunk_length_s=30,
    device=device,
)

output_text = pipeline(audio_path, 
                       num_speakers=2, 
                       min_speaker=1, 
                       max_speaker=2)


dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-14 11:37:44,135 - INFO - Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
2024-02-14 11:37:44,224 - INFO - Fetch hyperparams.yaml: Using existing file/symlink in /Users/nealcaren/.cache/torch/pyannote/speechbrain/hyperparams.yaml.
2024-02-14 11:37:44,224 - INFO - Fetch custom.py: Delegating to Huggingface hub, source speechbrain/spkrec-ecapa-voxceleb.

Model was trained with pyannote.audio 0.0.1, yours is 3.1.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.0. Bad things might happen unless you revert torch to 1.x.

2024-02-14 11:37:44,498 - INFO - Fetch embedding_model.ckpt: Using existing file/symlink in /Users/nealcaren/.cache/torch/pyannote/speechbrain/embedding_model.ckpt.
2024-02-14 11:37:44,499 - INFO - Fetch mean_var_norm_emb.ckpt: Using existing file/symlink in /Users/nealcaren/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt.
2024-02-14 11:37:44,500 - INFO - Fetch classifier.ckpt: Using existing file/symlink in /Users/nealcaren/.cache/torch/pyannote/speechbrain/classifier.ckpt.
2024-02-14 11:37:44,501 - INFO - Fetch label_encoder.txt: Using existing file/symlink in /Users/nealcaren/.cache/torch/pyannote/speechbrain/label_encoder.ckpt.
2024-02-14 11:37:44,502 - INFO - Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder

Speaker 1:  Hello, this is Chris McGinnis. Today is Saturday, February 23rd, and I'm interviewing Mr. Quinton Baker at his home in Hillsborough, North Carolina. This tape is a continuing series of interviews that contribute to the Gay and Lesbian Southern History Project, which is part of the Southern Oral History Program at UNC Chapel Hill. This project is currently focusing on the history of gay men, lesbians, bisexual and transgender history in Chapel Hill and the Triangle area over the 20th century. This tape will be stored in the Southern Historical Collection, which is located in Wilson Library on the campus of the University of North Carolina and Chapel Hill. The number for this tape is 02.23.02-QB.1. Here we go. Well, first off, Quentin, just to, this is a general question I ask everybody, tell me a little bit about where you were born, where you grew up, and just a general synopsis
Speaker 2:  of the early years. The early years. I was born in Greenville, North Carolina, and I spent the first 18, 18 years there. I was born in a family of four children. I'm the youngest of four. My parents were laborers. My mother was the domestic, my father was a laborer. We lived in town at that time. Greenville was about 21,000 people.
Speaker 1:  What did your father do? Did he work in textile mill?
Speaker 2:  My father did various jobs. He worked in a furniture store. He sometimes worked in the fields, he worked in the tobacco factory, so that there was never one job, there was a serious variety of jobs. He even learned to repair televisions while he was working for...

The transcription took less than 30 seconds (using the small model) for the 2-minute interview on my M2 MacBook Air with 8GBs of memory. In contrast, the full diarization took almost an hour, so you still might want to use Google Colab and a GPU.