Finding various parameters(energy, zero-crossing, autocorrelation, pitch) in speech signal for voiced and unvoiced region
We have used multiple libraries to get verious parameters for voiced and unvoiced region and compared between them. We have observed their differences and have come to certain conclusions.
Plotting the time — domain waveform
# For this matplotlib, scipy numpy packages are used in Python
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile
from scipy import signal
fs, data = wavfile.read('Problem_3.wav')
duration = len(data)/fs
# The default value of the scipy sample rate is 44.1 khz. We take 8khz for more visual clarity
fs = 8000
samps = duration * fs
data = scipy.signal.resample(data, int(samps+1))
#Creating the time vector
time = np.arange(0,duration,1/fs)
%matplotlib inline
plt.plot(time,data)
plt.xlabel('Time in seconds')
plt.ylabel('Amplitude')
plt.title('Problem - 3')
plt.show()
Finding the maximum and minimum pitch frequencies in a voiced region -
We select a particular voiced region and find the maximum and the minimum pitch frequency in that region.
Pitch is only valid for voiced region as it is defined as the rate of vibration of the vocal folds. So, in this problem, we select a particular voiced region and find the maximum and minimum frequency in that region. We have used the “parselmouth” library to calculate the maximum and minimum pitch frequencies and to plot the pitch contour in the selected voiced region.
import parselmouth
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
snd = parselmouth.Sound("Problem_3.wav")
snd_part = snd.extract_part(from_time=5.3,to_time = 5.45, preserve_times=True)
The selected voiced region segment is 5.30–5.45 seconds. We plot the time — domain plot for this region first.
plt.figure()
plt.plot(snd_part.xs(), snd_part.values.T*30853)
plt.xlim([snd_part.xmin, snd_part.xmax])
plt.xlabel("time [s]")
plt.ylabel("amplitude")
plt.title("Time domain plot for the selected segment")
plt.show()
Now, we plot the pitch contour for this segment
def plot_pitch(pitch):
pitch_values = pitch.selected_array['frequency']
# replace unvoiced samples by NaN to not plot
pitch_values[pitch_values==0] = np.nan
plt.plot(pitch.xs(), pitch_values,'o', markersize=5, color='w')
plt.plot(pitch.xs(), pitch_values,'o', markersize=4)
plt.grid(True)
plt.ylim(0, pitch.ceiling/2)
plt.ylabel("Pitch in Hz")
plt.xlabel("time in seconds")#Getting the sound segment's pitch
pitch = snd_part.to_pitch()
plt.figure()
#Passing the parameter to the function plot_pitch to plot the pitch contour
plot_pitch(pitch)
plt.xlim([snd_part.xmin, snd_part.xmax])
plt.title('Pitch Contour for the voiced region')
plt.show()
We calculate the maximum and the minimum pitch frequencies in the segment
pitch_values = pitch.selected_array['frequency']
pitch_values[pitch_values==0] = np.nan
print("The maximum pitch frequency is = {maximum} Hz , at time t = {time} seconds".format(maximum = np.nanmax(pitch_values), time = pitch.xs()[np.nanargmax(pitch_values)]))
print("The minimum pitch frequency is = {minimum} Hz , at time t = {time} seconds".format(minimum = np.nanmin(pitch_values), time = pitch.xs()[np.nanargmin(pitch_values)]))The maximum pitch frequency is = 118.33134099729776 Hz , at time t = 5.425 seconds
The minimum pitch frequency is = 110.61746467065444 Hz , at time t = 5.3950000000000005 seconds
The maximum and minimum pitch frequencies are found to be around 118 Hz and 110 Hz respectively, occuring at time 5.425 and 5.395 seconds respectively.
The region of the maximum and minimum pitch frequency is highlighted in the time domain plot.
plt.figure()
plt.plot(snd.xs(), snd.values.T)
plt.xlim([snd_part.xmin, snd_part.xmax])
plt.xlabel("time [s]")
plt.ylabel("amplitude")
plt.axvspan(5.425,5.427, color = 'red', alpha = 0.5, label = "Max-Pitch")
plt.axvspan(5.395,5.397, color = 'green' , alpha = 0.5, label = "Min-Pitch" )
plt.legend(bbox_to_anchor =(1.75, 1.15), ncol = 2)
plt.title('Time Domain Plot')
plt.show()
pitch = snd.to_pitch()
plt.figure()
plot_pitch(pitch)
plt.title('Pitch Plot')
plt.xlim([snd_part.xmin, snd_part.xmax])
plt.show()
Fundamental Frequency
The fundamental frequency or F0 is the frequency at which vocal chords vibrate in voiced sounds. This frequency can be identified in the sound produced, which presents quasi-periodicity, the pitch period being the fundamental period of the signal (the inverse of the fundamental frequency).[5] Here, our pitch frequency or fundamental frequency is approximately 114 Hz. We can also find the pitch frequency by taking the inverse of the absolute time duration difference in the successive peaks of autocorrelation.
Unvoiced Region (Segment : 8.49- 8.51)
import math
start = math.ceil(8.49 * fs)
stop = math.ceil(8.51 * fs)
Energy
Low Energy is observed in this region, maximum being 1220 J . So , concluded to be an unvoiced region.
Energy = np.square(data, dtype= np.float64)
%matplotlib inline
plt.xlabel('Time (in secs)')
plt.ylabel('Energy (in Joules)')
plt.plot(time[start:stop], Energy[start:stop])
plt.show()
print("Maximum Energy = {e} J".format( e = np.amax(Energy[start:stop])))
Maximum Energy = 1220.9602063061272 J
Zero- Crossings
We use librosa to implement the zero-crossings. First we load the file using librosa.load
import librosa , librosa.display
Zooming in and Plotting the Unvoiced segment (8.49–8.51 seconds)
%matplotlib inline
plt.xlabel('Time [s]')
plt.ylabel('Amp ')
plt.plot(time[start:stop], data[start:stop])
plt.title('Unvoiced Region Plot')
plt.show()
Total Zero crossings in the given segment
zcr =np.asarray([int(x) for x in librosa.zero_crossings(data[start:stop])])
print("Total zero crossings in the frame is = ",sum(zcr))Total zero crossings in the frame is = 31
As Observed above, the total zero-crossings in the timeframe (8.49–8.51 seconds) is high. So , it is concluded to be an unvoiced region.
Autocorrelation
import scipy
from scipy import signal
autocorr = scipy.signal.correlate(data[start:stop],data[start:stop],mode = 'same',method = 'auto')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation ')
plt.plot(autocorr)
plt.show()
This signal is observed to be highly uncorrelated. There is just one peak at the centre, and the second dominant peaks are very low in amplitude. There is no quasistationarity . This is expected for an unvoiced region, because the region contains random noise.
Voiced Region Segment (9.51–9.53 seconds)
start = math.ceil(9.51 * fs)
stop = math.ceil(9.53 * fs)
Energy
Energy = np.square(data, dtype= np.float64)
%matplotlib inline
plt.xlabel('Time (in secs)')
plt.ylabel('Energy (in Joules)')
plt.plot(time[start:stop], Energy[start:stop])
plt.title('Energy vs Time plot')
plt.show()
print("Maximum Energy = {e} J".format( e = np.amax(Energy[start:stop])))
Maximum Energy = 83929083.58535491 J
Maximum Energy is observed to be 8.39 * 10⁷ J, which is very large. So, it is a voiced region.
Zero Crossings
Zooming in and Plotting the voiced segment (9.51–9.53 seconds)
%matplotlib inline
plt.xlabel('Time [s]')
plt.ylabel('Amplitude ')
plt.plot(time[start:stop], data[start:stop])
plt.title('Voiced Region Plot')
plt.show()
Total Zero crossings in the given segment
zcr =np.asarray([int(x) for x in librosa.zero_crossings(data[start:stop])])
print("Total zero crossings in the frame is = ",sum(zcr))Total zero crossings in the frame is = 20
As Observed above, the total zero-crossings in the timeframe (9.51–9.53 seconds) is 20, which is low. So , it is concluded to be an voiced region.
Autocorrelation
autocorr = scipy.signal.correlate(data[start:stop],data[start:stop],mode = 'same',method = 'auto')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation ')
plt.plot(autocorr)
plt.show()
This signal is observed to be highly correlated. The value of the autocorrelation is very high. Due to the quasistationarity and periodicity, we can see the two second — dominant peaks as well. Hence, it is concluded to be voiced
Comments :
Among all the three methods for finding and characterising voiced and unvoiced regions, the autocorrelation method seems the most accurate. The Zero- crossing rate can get easily affected by the presence of noise. The energy can also get affected by the presence of louder noise, which may lead to the energy of the unvoiced region being more than the voiced. Inorder to avoid these pitfalls, it is better to use the autocorrelation method to detect these. Also, because of the large difference in the peak value of the autocorrelation for voiced and unvoiced, it is much easier to detect.
Sources:
- https://parselmouth.readthedocs.io/en/stable/index.html
- https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d
- https://towardsdatascience.com/how-i-understood-what-features-to-consider-while-training-audio-files-eedfb6e9002b
- https://docs.scipy.org/doc/scipy/reference/index.html
- https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-73003-5_775