audio transcription with whisper from R
Last week, OpenAI released version 2 of an updated neural net called Whisper that approaches human level robustness and accuracy on speech recognition. You can now directly call from R a C/C++ inference engine which allow you to transcribe .wav audio files.
To allow to easily do this in R, BNOSAC created an R wrapper around the whisper.cpp code. This R package is available at https://github.com/bnosac/audio.whisper and can be installed as follows.
remotes::install_github("bnosac/audio.whisper")
The following code shows how you can transcribe an example 16-bit wav file with a fragment of a speech by JFK available here.
library(audio.whisper) model <- whisper("tiny") path <- system.file(package = "audio.whisper", "samples", "jfk.wav") trans <- predict(model, newdata = path, language = "en", n_threads = 2) trans $n_segments [1] 1 $data segment from to text 1 00:00:00.000 00:00:11.000 And so my fellow Americans ask not what your country can do for you ask what you can do for your country. $tokens segment token token_prob 1 And 0.7476438 1 so 0.9042299 1 my 0.6872202 1 fellow 0.9984470 1 Americans 0.9589157 1 ask 0.2573057 1 not 0.7678108 1 what 0.6542882 1 your 0.9386917 1 counstry 0.9854987 1 can 0.9813995 1 do 0.9937403 1 for 0.9791515 1 you 0.9925495 1 ask 0.3058807 1 what 0.8303462 1 you 0.9735528 1 can 0.9711444 1 do 0.9616748 1 for 0.9778513 1 your 0.9604713 1 country 0.9923630 1 . 0.4983074
Another example based on a Micro Machines commercial from the 1980's.
I've always wanted to get the transcription of the performances of Francis E. Dec available on UbuWeb Sound - Francis E. Dec like this performance: https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3. This is how you can now do that from R.
library(av) download.file(url = "https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3",
destfile = "rant1.mp3", mode = "wb") av_audio_convert("rant1.mp3", output = "output.wav", format = "wav", sample_rate = 16000)
trans <- predict(model, newdata = "output.wav", language = "en", duration = 30 * 1000, offset = 7 * 1000, token_timestamps = TRUE) trans $n_segments [1] 11 $data segment from to text 1 00:00:07.000 00:00:09.000 Look at the picture. 2 00:00:09.000 00:00:11.000 See the skull. 3 00:00:11.000 00:00:13.000 The part of bone removed. 4 00:00:13.000 00:00:16.000 The master race Frankenstein radio controls. 5 00:00:16.000 00:00:18.000 The brain thoughts broadcasting radio. 6 00:00:18.000 00:00:21.000 The eyesight television. The Frankenstein earphone radio. 7 00:00:21.000 00:00:25.000 The threshold brain wash radio. The latest new skull reforming. 8 00:00:25.000 00:00:28.000 To contain all Frankenstein controls. 9 00:00:28.000 00:00:31.000 Even in thin skulls of white pedigree males. 10 00:00:31.000 00:00:34.000 Visible Frankenstein controls. 11 00:00:34.000 00:00:37.000 The synthetic nerve radio, directional and an alloop. $tokens segment token token_prob token_from token_to 1 Look 0.4281234 00:00:07.290 00:00:07.420 1 at 0.9485379 00:00:07.420 00:00:07.620 1 the 0.9758387 00:00:07.620 00:00:07.940 1 picture 0.9734664 00:00:08.150 00:00:08.580 1 . 0.9688568 00:00:08.680 00:00:08.910 2 See 0.9847929 00:00:09.000 00:00:09.420 2 the 0.7588121 00:00:09.420 00:00:09.840 2 skull 0.9989663 00:00:09.840 00:00:10.310 2 . 0.9548351 00:00:10.550 00:00:11.000 3 The 0.9914295 00:00:11.000 00:00:11.170 3 part 0.9789217 00:00:11.560 00:00:11.600 3 of 0.9958754 00:00:11.600 00:00:11.770 3 bone 0.9759618 00:00:11.770 00:00:12.030 3 removed 0.9956936 00:00:12.190 00:00:12.710 3 . 0.9965582 00:00:12.710 00:00:12.940
...
Maybe in the near future we will put it on CRAN, currently it is only at https://github.com/bnosac/audio.whisper.
Get in touch if you are interested in this and let us know what you plan to use it for.