Tacotron-2 Audio Synthesis

Mar 28, 2020


When people talk or sing,  different muscles are being used, including some in the month and throat. Just like other muscles in human body, overuse of the ones that help human speak can lead to fatigue, strain and injury.

In Feb 2018. Google team published a paper, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram, where they presented a neural text-to-speech model that learns to synthesise speech directly from (text, audio) pairs.

System setup

git clone https://github.sydney.edu.au/TechLab/tacotron.git
  1. Build docker image
docker build -t nginx/tacotron2 .

In /tacotron

  1. Run the built docker image (tacotron/tacotron2)
docker run --gpus all -it -p 8888:8888 nginx/tacotron2
cd tacotron2/

git submodule init; git submodule update --remote --merge

python waveglow/convert_model.py waveglow_256channels.pt waveglow_256channels_new.pt

jupyter notebook --ip --no-browser --allow-root &

sed -i -- 's,DUMMY,LJSpeech-1.1/wavs,g' filelists/*.txt


The solution from TechLab team is using the Tacotron 2 based on the Nvidia pytorch implementation of paper Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions (J. Shen, et al.)

A deep dive on the audio with LibROSA

Install libraries

Firstly, let's install and import libraries such as librosa, matplotlib and numpy.

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

Loading in an audio file and plot the wave

# Load audio file
filename = 'output/chunk2.mp3'
y, sr = librosa.load(filename)

# Trim silent edges
speech, _ = librosa.effects.trim(y)

# Plot the wave
librosa.display.waveplot(speech, sr=sr)

Plot the Mel spectrogram

# Mel spectrogram
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

plt.figure(figsize=(10, 4))
S_dB = librosa.power_to_db(S, ref=np.max)
librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sr)  #default is fmax=sr/2
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-frequency spectrogram')

Transfer Learning using a pre-trained model


  • Tacotron 2 is one of the most successful sequence-to-sequence models for text-to-speech, at the time of publication.
  • The experiments delivered by TechLab
  • Since we got a audio file of around 30 mins, the datasets we could derived from it was small. The appropriate  approach for this case is to start from the pre-trained Tacotron model (published by NVidia) which made use of the LJ Speech dataset for the training, and then fit to our small dataset.

In our experiments, there are several things to note:
-  Sampling rate differences: the transfer learning did not manage to work well if the sampling rate of custom audio is different from that of LJ Speech dataset. We should always convert the sampling rate of our own dataset to be identical with the sampling rate of dataset used in the pre-trained model.
- If we put a larger number of dropout, say 0.4-0.5, a surge of MEM occupation would happen and the mini-batch training would be stopped after several epochs. There are discussions on similar scenarios on GitHub and other platforms e.g. StackOverflow. People are prone to reckon it's due to the PyTouch architecture. The real cause of this scenario is still dim.
- A smaller batch size (e.g. 8 or 16) would lead to severe overfit.


However, the problem we were facing is the severe overfitting caused by the size of the dataset. Generally we can use some techniques to reduce the overfitting and do a better convergence:

  • Enlarge the dataset by getting more data from Jason;
  • Enlarge the dataset by using augmentation techniques on the audio sample provided (ref. to “Data Augmentation for Audio” below)
  • Other machine learning/deep learning techniques, e.g. regularization, dropout, early stopping, and bigger batch size in per epoch (TechLab team has some limitation of using all the techniques listed here due to the GPU capabilities we have)

Future work

Data Augmentation for Audio

  • Noise injection - add some random value into data (may help reduce overfitting)
  • Shifting time - shift audio to left/right with a random second (our team had implemented this method to create more data samples)
  • Changing pitch
  • Changing speed

- https://www.webmd.com/rheumatoid-arthritis/why-am-i-losing-my-voice

Lydia Gu

Innovation Engineer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.