preface
MATLAB speech signal analysis and synthesis (Second Edition) is a disgusting work accumulated by song Zhiyong, the boss of the Institute of acoustics of the Chinese Academy of Sciences. Students interested in speech signal processing can get started if they hope to do some research in the fields of speech signal analysis, processing and synthesis in the future.
Speech signal processing is an important branch of digital signal processing. This book contains many digital signal processing methods and MATLAB functions. The book consists of 10 chapters. Chapter 1-4 introduces some basic analysis methods and means of speech signal processing, as well as the corresponding matlab functions; Chapter 5-9 introduces speech signal preprocessing and feature extraction, including eliminating trend terms and basic noise reduction methods, as well as endpoint detection, pitch extraction and formant extraction. Using the basic methods of speech signal processing, a variety of extraction methods and corresponding matlab programs are given; In Chapter 10, combined with the detection of various parameters, the synthesis of speech signal, the variable speed and tone change processing of speech signal are introduced. The speech synthesis of time domain pitch synchronous superposition (TD PSOLA) is also introduced, and the corresponding matlab program is given. The methods and ideas of debugging complex programs are given in Appendix A. This book can be used as an auxiliary reading for senior undergraduate students, postgraduates or scientific research and engineering technicians engaged in speech signal processing. It can also be used as a reference book for scientific research and engineering technicians engaged in signal processing research and application.
My graduate tutor's main research direction is related to speech signal processing. Although my major thesis during my graduate study was digital image processing, the so-called speech and image are not separated. Although I rowed in the teacher's graduate lecture on wavelet transform, I also included a small doorway in the course design and engineering application of speech signal processing in the later tutor, and won the first place in the group in the class closing test, The tutor also specially distributed 300 ocean food funds as encouragement.
Picking up speech recognition again this time, I just started with Mr. Song's book. Let's review it again. It mainly introduces the source code in each chapter. This is the 14 simulation application examples in Chapter 10 of this book. Don't say much. Let's start!
1. Data and function path setting
Some functions often called in the book (self compiled functions or functions from other application toolkits) have been concentrated in basic_ In the TBX toolbox, please set the toolbox (set with set path) under the working path before running the program of this book;
When you want to run EMD processing, set the EMD toolbox under the working path;
When the subject extension pitch detection is to be run, pitch_ The ztlib toolbox is set under the working path;
When time domain pitch synchronous superposition speech synthesis is to be performed, PSOLA_ The Lib toolbox is set under the working path;
When you want to apply the voice data provided in this book, you'd better put speech_signal is set under the working path.
All functions and programs in this book are debugged under MATLAB R2009a version. (I use MATLAB2015b. Some functions have been updated, so I will modify them to pass the debugging)
The path setting method is as follows:
Open MATLAB, click "home page" to find the setting path
Add all the above folder paths to the MATLAB search path
After adding, save and start the simulation.
2. MATLAB simulation 1: overlapping phase addition speech synthesis
% pr10_1_1 % hold conv Additive convolution with overlap conv_ovladd Compare functions % clear all; clc; close all; y=load('data1.txt')'; % read in data M=length(y); % Data length t=0:M-1; h=fir1(100,0.125); % Design FIR wave filter x=conv(h,y); % use conv Function for digital filtering x=x(51:1050); % Take the filter output without delay z=conv_ovladd1(y,h,256); % Convolution is calculated by overlapping phase addition % Mapping pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-140)]); plot(t,y,'k'); title('Original waveform of signal') xlabel('Sample point'); ylabel('amplitude'); %xlabel('n'); ylabel('Am'); figure(2) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-100)]); hline1 = plot(t,z,'k','LineWidth',1.5); hline2 = line(t,x,'LineWidth',5,'Color',[.6 .6 .6]); set(gca,'Children',[hline1 hline2]); title('Comparison of convolution and overlapping additive convolution') xlabel('Sample point'); ylabel('amplitude'); legend('convolution conv','Overlapping additive convolution',2) ylim([-0.8 1]);
function z=conv_ovladd1(x,h,L) % Convolution calculation by overlapping phase addition x=x(:)'; % hold x Convert to one line NN=length(x); % calculation x long N=length(h); % calculation h long N1=L-N+1; % hold x Length of segment x1=[x zeros(1,L)]; H=fft(h,L); % seek h of FFT by H J=fix((NN+L)/N1); % Find the number of blocks y=zeros(1,NN+2*L); % initialization for k=1 : J % Processing of each section xx=zeros(1,L); MI=(k-1)*N1; % The first i Duan Zai x Start position on-1 nn=1:N1; xx(nn)=x1(nn+MI); % Take a paragraph xi XX=fft(xx,L); % do FFT YY=XX.*H; % Convolution by multiplication yy=ifft(YY,L); % do FFT Inverse transformation yi % Overlapping addition between adjacent segments if k==1 % The first block does not need to be overlapped and added for j=1 : L y(j)=y(j)+real(yy(j)); end elseif k==J % The last piece is only made N1 Overlap and add data points for j=1 : N1 y(MI+j)=y(MI+j)+real(yy(j)); end else for j=1 : L % Starting from block 2, each block should be overlapped and added y(MI+j)=y(MI+j)+real(yy(j)); end end end nn=floor(N/2); z=y(nn+1:NN+nn); % Ignore delay,Take output and input x Equal length
3. MATLAB simulation II: overlapping storage method speech synthesis
% % pr10_1_2 % hold conv Convolution with overlapping storage method conv_ovlsav1 Compare functions clear all; clc; close all; y=load('data1.txt'); % read in data M=length(y); % Data length t=0:M-1; h=fir1(100,0.125); % Design FIR wave filter x=conv(h,y); % use conv Function for digital filtering x=x(51:1050); % Take the filter output without delay z=conv_ovlsav1(y,h,256); % Convolution is calculated by overlapping storage method % Mapping pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-140)]); plot(t,y,'k'); title('Original waveform of signal') xlabel('Sample point'); ylabel('amplitude'); figure(2) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-100)]); hline1 = plot(t,z,'k','LineWidth',1.5); hline2 = line(t,x,'LineWidth',5,'Color',[.6 .6 .6]); set(gca,'Children',[hline1 hline2]); title('Comparison between direct convolution and overlapping storage convolution') xlabel('Sample point'); ylabel('amplitude'); legend('Direct convolution conv','Overlapping storage convolution',2) ylim([-0.8 1]);
function y=conv_ovlsav1(x,h,L) % Convolution calculation with overlapping storage method x=x(:)'; Lenx = length (x); % calculation x long N=length(h); % calculation h long N1=N-1; M=L-N1; H=fft(h, L); % seek h of FFT by H x=[zeros(1,N1), x, zeros(1, L-1)]; % Fill zero before and after K = floor((Lenx+ N1-1)/M); % Find the number of frames Y=zeros(K+1, L); % initialization for k=0 : K % Process each frame Xk=fft(x(k*M+1:k*M+L)); % do FFT Y(k+1,:)=real(ifft(Xk.*H)); % Convolution by multiplication end Y=Y(:, N:L)'; % Keep only the last in each frame M Data nm=fix(N/2); y=Y(nm+1:nm+Lenx )'; % Ignore delay,And turn two dimensions into one,Take output and input x Equal length
4. MATLAB simulation 3: linear proportional overlapping addition speech synthesis
% % pr10_1_3 clear all; clc; close all; load Labdata1 % Read in experimental data and extended data N=length(xx); % Data length time=(0:N-1)/fs; % Time scale T=10091-8538+1; % Length of missing data interval x1=xx(1:8537); % Previous data x2=xx(10092:29554); % Later segment data y1=ydata(:,1); % Extended data 1 xx1=[x1; y1; x2]; % Synthesis with extended data 1 y2=ydata(:,2); % Extended data 2 xx2=[x1; y2; x2]; % Synthesized with extended data 2 % The extended data 1 and extended data 2 are overlapped and added in a linear proportion Wt1=(0:T-1)'/T; % Form oblique triangular window function w1 Wt2=(T-1:-1:0)'/T; % Form oblique triangular window function w2 y=y1.*Wt2+y2.*Wt1; % Linear scale overlapping addition xx3=[x1; y; x2]; % Synthetic data %Mapping pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)-240]); plot(time,xx,'k'); axis([0 29.6 -15 10]); title('Waveform of original signal'); xlabel('time/s'); ylabel('amplitude') line([8.537 8.537],[-15 10],'color','k','linestyle','-'); line([10.092 10.092],[-15 10],'color','k','linestyle','--'); figure(2) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)+50]); subplot 221; plot(time,xx1,'k'); axis([9.5 10.5 -10 5]); line([10.091 10.091],[-15 10],'color','k','linestyle','-.'); title('Waveform synthesized after the first segment extension'); xlabel(['time/s' 10 '(a)']); ylabel('amplitude') subplot 222; plot(time,xx2,'k'); axis([8 9.5 -10 5]); line([8.538 8.538],[-15 10],'color','k','linestyle','-.'); title('Waveform synthesized after the extension of the second segment'); xlabel(['time/s' 10 '(b)']); ylabel('amplitude') subplot 212; plot(time,xx3,'k'); axis([0 29.6 -15 10]); line([8.537 8.537],[-15 10],'color','k','linestyle','-'); line([10.092 10.092],[-15 10],'color','k','linestyle','--'); title('Waveform synthesized after linear proportional overlap and addition'); xlabel(['time/s' 10 '(c)']); ylabel('amplitude')
5. MATLAB simulation 4: spectrum parameter method speech synthesis
% % pr10_3_1 clear all; clc; close all; filedir=[]; % set up path filename='colorcloud.wav'; % File name setting fle=[filedir filename]; % Make up the complete path and file name % [x, fs, bits] = wavread(fle); % Read in data file [x, fs] = audioread(fle); % Read in data file x=x-mean(x); % Eliminate DC component x=x/max(abs(x)); % Amplitude normalization xl=length(x); % Data length time=(0:xl-1)/fs; % Calculate the time scale p=12; % LPC The order of is 12 wlen=200; inc=80; % Frame length and frame shift msoverlap = wlen - inc; % Length of overlap per frame y=enframe(x,wlen,inc)'; % Framing fn=size(y,2); % Frame count % Speech analysis:Find the of each frame LPC Coefficient and prediction error for i=1 : fn u=y(:,i); % Get a frame A=lpc(u,p); % LPC Obtain the coefficient aCoeff(:,i)=A; % Store in aCoeff In array errSig = filter(A,1,u); % Calculate the prediction error sequence resid(:,i) = errSig; % Store in resid In array end % speech synthesis:The synthetic speech of each frame is superimposed into a continuous speech signal for i=1:fn A = aCoeff(:,i); % Obtain the prediction coefficient of the frame residFrame = resid(:,i); % Obtain the prediction error of the frame synFrame = filter(1, A', residFrame); % Prediction error excitation,Synthetic speech outspeech((i-1)*inc+1:i*inc)=synFrame(1:inc); % Overlapping storage method for storing data % If it's the last frame,hold inc Add the following data if i==fn outspeech(fn*inc+1:(fn-1)*inc+wlen)=synFrame(inc+1:wlen); end end; ol=length(outspeech); if ol<xl % hold outspeech Zero filling,Make and x Equal length outspeech=[outspeech zeros(1,xl-ol)]; end % Vocalization % wavplay(x,fs); audioplayer(x,fs); pause(1) % wavplay(outspeech,fs); audioplayer(outspeech,fs); % Mapping subplot 211; plot(time,x,'k'); xlabel(['time/s' 10 '(a)']); ylabel('amplitude'); ylim([-1 1.1]); title('Original voice signal') subplot 212; plot(time,outspeech,'k'); xlabel(['time/s' 10 '(b)']); ylabel('amplitude'); ylim([-1 1.1]); title('Synthetic speech signal')
6. MATLAB simulation 5: linear prediction coefficient and prediction error method speech synthesis
% % pr10_4_1 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x=xx/max(abs(xx)); % normalization N=length(x); % Data length time=(0:N-1)/fs; % Time scale wlen=240; % Frame length inc=80; % Frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency wind=hanning(wlen); % Window function X=enframe(x,wind,inc)'; % Framing fn=size(X,2); % Number of frames T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 15]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Calculate the timescale for each frame for i=1 : fn % Calculate the linear prediction coefficient and gain of each frame u=X(:,i); [ar,g]=lpc(u,p); AR_coeff(:,i)=ar; Gain(i)=g; end % pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0); tal=0; % Initialize leading zero zint=zeros(p,1); for i=1:fn; ai=AR_coeff(:,i); % Get page i Prediction coefficient of frame sigma_square=Gain(i); % Get page i Gain coefficient of frame sigma=sqrt(sigma_square); if SF(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise [synt_frame,zint]=filter(sigma,ai,excitation,zint); % Synthesizing speech with white noise else % Talking frame PT=round(Dpitch(i)); % Take periodic value exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0) = 1; % A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % How many pulses are there in the calculation tal=inc-index(eal); % Calculate the leading zero point of the next frame end gain=sigma/sqrt(1/PT); % gain [synt_frame,zint]=filter(gain, ai,excitation,zint); % Speech synthesis using impulse end if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Processing synthetic data by overlapping and adding in linear proportion output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end ol=length(output); % Output output Extended to and input signal xx Equal length if ol<N output1=[output; zeros(N-ol,1)]; else output1=output(1:N); end bn=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; % Filter coefficient an=[1.000000 -3.928040 5.786934 -3.789685 0.930791]; output=filter(bn,an,output1); % High pass filtering output=output/max(abs(output)); % Amplitude normalization % Pronunciation through sound card,Compare original speech and synthetic speech % wavplay(x,fs); audioplayer(x,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %Mapping figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)]) subplot 411; plot(time,xx,'k'); axis([0 max(time) -1 1.2]); xlabel('(a)'); title('Signal waveform'); ylabel('amplitude') subplot 412; plot(frameTime,Ef,'k'); hold on axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude') xlabel('(b)'); text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.65,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); axis([0 max(time) 0 110]);title('Pitch period');ylabel('Sample value') xlabel( '(c)'); subplot 414; plot(frameTime,Dfreq,'k'); axis([0 max(time) 0 250]);title('Pitch frequency');ylabel('frequency/Hz') xlabel(['time/s' 10 '(d)']); figure(2) subplot 211; plot(time,x,'k'); title('Original speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time,output,'k'); title('Synthetic speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')
7. MATLAB simulation VI: pitch and formant speech synthesis I
% % pr10_5_1 clear all; clc; close all; F0 = [700 900 2600]; Bw = [130 70 160]; fs=8000; [An,Bn]=formant2filter4(F0,Bw,fs); % Call the function to calculate the filter coefficient for k=1 : 4 % Make frequency response curves for four second-order bandpass filters A=An(:,k); % Obtain filter coefficients B=Bn(k); fprintf('B=%5.6f A=%5.6f %5.6f %5.6f\n',B,A); [H(:,k),w]=freqz(B,A); % Obtain the response curve end freq=w/pi*fs/2; % Frequency axis scale % Mapping plot(freq,abs(H(:,1)),'k',freq,abs(H(:,2)),'k',freq,abs(H(:,3)),'k',freq,abs(H(:,4)),'k'); axis([0 4000 0 1.05]); grid; line([F0(1) F0(1)],[0 1.05],'color','k','linestyle','-.'); line([F0(2) F0(2)],[0 1.05],'color','k','linestyle','-.'); line([F0(3) F0(3)],[0 1.05],'color','k','linestyle','-.'); line([3500 3500],[0 1.05],'color','k','linestyle','-.'); title('Response curve of second-order bandpass filter with three formants and a fixed frequency') ylabel('amplitude'); xlabel('frequency/Hz')
8. MATLAB simulation VII: pitch and formant speech synthesis II
% % pr10_5_2 clear all; clc; close all; Fs=8000; % sampling frequency Fs2=Fs/2; fp=60; fs=20; % Passband ripple and stopband frequency wp=fp/Fs2; ws=fs/Fs2; % Convert to normalized frequency Rp=1; Rs=40; % Passband and stopband attenuation [n,Wn]=cheb2ord(wp,ws,Rp,Rs); % Calculate filter order [b1,a1]=cheby2(n,Rs,Wn,'high'); % Calculate filter coefficients fprintf('b=%5.6f %5.6f %5.6f %5.6f %5.6f\n',b1); fprintf('a=%5.6f %5.6f %5.6f %5.6f %5.6f\n',a1); fprintf('\n'); [db,mag,pha,grd,w]=freqz_m(b1,a1); % Find the frequency response of the filter a=[1 -0.99]; db1=freqz_m(1,a); % Calculate the frequency response of low-pass filter A=conv(a,a1); % Calculate series filter coefficients B=b1; db2=freqz_m(B,A); fprintf('B=%5.6f %5.6f %5.6f %5.6f %5.6f\n',B); fprintf('A=%5.6f %5.6f %5.6f %5.6f %5.6f %5.6f\n',A); % Mapping pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)+100]); subplot 221; plot(w/pi*Fs2,db1,'k'); title('Amplitude frequency response curve of low-pass filter') ylim([-50 0]); ylabel('amplitude/dB'); xlabel(['frequency/Hz' 10 '(a)']); subplot 222; plot(w/pi*Fs2,db,'k'); title('Amplitude frequency response curve of high pass filter') axis([0 500 -50 5]); ylabel('amplitude/dB'); xlabel(['frequency/Hz' 10 '(b)']); subplot 212; semilogx(w/pi*Fs2,db2,'k'); title('Amplitude frequency response curve of bandpass filter'); ylabel('amplitude/dB'); xlabel(['frequency/Hz' 10 '(c)']); axis([10 3500 -40 5]); grid
9. MATLAB simulation 8: pitch and formant speech synthesis 3
% % pr10_5_3 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x1=xx/max(abs(xx)); % normalization x=filter([1 -.99],1,x1); % Pre aggravation N=length(x); % Data length time=(0:N-1)/fs; % Time scale of signal wlen=240; % Frame length inc=80; % Frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency wind=hanning(wlen); % Window function X=enframe(x,wlen,inc)'; % Framing fn=size(X,2); % Number of frames Etemp=sum(X.*X); % Calculate the energy per frame Etemp=Etemp/max(Etemp); % Energy normalization T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 15]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Time scale corresponding to each frame Doption=0; % Use subject-Extended pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(x1,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,Doption); %% Formant extraction Frmt=Formant_ext2(x,wlen,inc,fs,SF,Doption); Bwm=[150 200 250]; % Set fixed bandwidth Bw=repmat(Bwm',1,fn); %% speech synthesis zint=zeros(2,4); % initialization tal=0; for i=1 : fn yf=Frmt(:,i); % Fetch i Three formant frequencies and bandwidth of the frame bw=Bw(:,i); [an,bn]=formant2filter4(yf,bw,fs); % Convert to four second-order filter coefficients synt_frame=zeros(wlen,1); if SF(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise for k=1 : 4 % Parallel input of four filters An=an(:,k); Bn=bn(k); [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k)); synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together end else % Talking frame PT=round(Dpitch(i)); % Take cycle value exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % How many pulses are there in the calculation tal=inc-index(eal); % Calculate the leading zero point of the next frame end for k=1 : 4 % Parallel input of four filters An=an(:,k); Bn=bn(k); [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k)); synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together end end Et=sum(synt_frame.*synt_frame); % Synthesis of speech using energy normalization rt=Etemp(i)/Et; synt_frame=sqrt(rt)*synt_frame; if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Processing by linear addition of overlapping data output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end ol=length(output); % Output output Extended to and input signal xx Equal length if ol<N output=[output; zeros(N-ol,1)]; end % Synthetic speech passes through a band-pass filter out1=output; out2=filter(1,[1 -0.99],out1); b=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; a=[1.000000 -3.928040 5.786934 -3.789685 0.930791]; output=filter(b,a,out2); output=output/max(abs(output)); % Pronunciation through sound card,Compare original speech and synthetic speech % wavplay(xx,fs); audioplayer(xx,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %% Mapping figure(1) figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)]) subplot 411; plot(time,x1,'k'); axis([0 max(time) -1 1.1]); title('Signal waveform'); ylabel('amplitude') subplot 412; plot(frameTime,Ef,'k'); hold on axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude') text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.65,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); axis([0 max(time) 0 110]);title('Pitch period'); ylabel('Sample value') subplot 414; plot(frameTime,Dfreq,'k'); axis([0 max(time) 0 250]);title('Pitch frequency'); ylabel('frequency/Hz') xlabel('time/s'); figure(2) subplot 211; plot(time,x1,'k'); title('Original speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time,output,'k'); title('Synthetic speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude') figure(3) out1=out1/max(out1); subplot 311; plot(time,out1,'k'); title('Waveform before passing the filter out1') ylabel('amplitude'); ylim([-0.5 1]); xlabel('(a)'); out2=out2/max(out2); subplot 312; plot(time,out2,'k'); title('Waveform after passing through low-pass filter out2') ylabel('amplitude'); ylim([-0.5 1]); xlabel('(b)'); subplot 313; plot(time,output,'k'); title('Waveform after passing through high pass filter output') xlabel(['time/s' 10 '(c)']); ylabel('amplitude')
10. MATLAB simulation IX: pitch and formant speech synthesis IV
% % pr10_5_4 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x1=xx/max(abs(xx)); % normalization x=filter([1 -.99],1,x1); % Pre aggravation N=length(x); % Data length time=(0:N-1)/fs; % Time scale of signal wlen=240; % Frame length inc=80; % Frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency wind=hanning(wlen); % Window function X=enframe(x,wlen,inc)'; % Framing fn=size(X,2); % Number of frames Etemp=sum(X.*X); % Calculate energy per frame Etemp=Etemp/max(Etemp); % Energy normalization T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 12]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Time scale corresponding to each frame % pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0); const=fs/(2*pi); % constant Frmt=ones(3,fn)*nan; % initialization Bw=ones(3,fn)*nan; zint=zeros(2,4); % initialization tal=0; cepstL=6; % Width of window function on inverted frequency wlen2=wlen/2; % Take half the frame length df=fs/wlen; % Calculate the frequency scale in the frequency domain for i=1 : fn %% Extraction of formant and bandwidth u=X(:,i); % Take a frame of data u2=u.*wind; % Signal windowing function U=fft(u2); % Press formula(10-5-6)calculation FFT U2=abs(U(1:wlen2)).^2; % Calculated power spectrum U_abs=log(U2); % Press formula(10-5-7)Calculate logarithm Cepst=ifft(U_abs); % Press formula(10-5-8)calculation IDFT cepst=zeros(1,wlen2); cepst(1:cepstL)=Cepst(1:cepstL); % Press formula(10-5-9)Windowing function cepst(end-cepstL+2:end)=Cepst(end-cepstL+2:end); spect=real(fft(cepst)); % Press formula(10-5-10)calculation DFT [Loc,Val]=findpeaks(spect); % Find peak Spe=exp(spect); % Press formula(10-5-11)Calculate the linear power spectrum value ll=min(3,length(Loc)); for k=1 : ll m=Loc(k); % set up m-1,m and m+1 m1=m-1; m2=m+1; pp=Spe(m); % set up P(m-1),P(m)and P(m+1) pp1=Spe(m1); pp2=Spe(m2); aa=(pp1+pp2)/2-pp; % Press formula(10-5-13)calculation bb=(pp2-pp1)/2; cc=pp; dm=-bb/2/aa; Pp=-bb*bb/4/aa+cc; % Press formula(10-5-14)calculation m_new=m+dm; bf=-sqrt(bb*bb-4*aa*(cc-Pp/2))/aa; F(k)=(m_new-1)*df; % Press formula(10-5-12)Calculate formant frequency bw(k)=bf*df; % Press formula(10-5-12)Calculate formant bandwidth end Frmt(:,i)=F; % Store formant frequency in Frmt in Bw(:,i)=bw; % Store bandwidth in Bw in end %% speech synthesis for i=1 : fn yf=Frmt(:,i); % Fetch i Three formant frequencies and bandwidth of the frame bw=Bw(:,i); [an,bn]=formant2filter4(yf,bw,fs); % Convert to four second-order filter coefficients synt_frame=zeros(wlen,1); if SF(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise for k=1 : 4 % Parallel input of four filters An=an(:,k); Bn=bn(k); [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k)); synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together end else % Talking frame PT=round(Dpitch(i)); % Take periodic value exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0)=1; % A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % How many pulses are there in the calculation tal=inc-index(eal); % Calculate the leading zero point of the next frame end for k=1 : 4 % Parallel input of four filters An=an(:,k); Bn=bn(k); [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k)); synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together end end Et=sum(synt_frame.*synt_frame); % Synthesis of speech using energy normalization rt=Etemp(i)/Et; synt_frame=sqrt(rt)*synt_frame; synt_speech_HF(:,i)=synt_frame; if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Processing synthetic data by overlapping and adding in linear proportion output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end ol=length(output); % Output output Extended to and input signal xx Equal length if ol<N output=[output; zeros(N-ol,1)]; end out1=output; b=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; % Filter coefficient a=[1.000000 -4.918040 9.675693 -9.518749 4.682579 -0.921483]; output=filter(b,a,out1); % Filter connected in series through low-pass and high pass output=output/max(abs(output)); % Amplitude normalization % Pronunciation through sound card,Compare original speech and synthetic speech % wavplay(xx,fs); audioplayer(xx,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %% Mapping figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)]) subplot 411; plot(time,x1,'k'); axis([0 max(time) -1 1.1]); title('Signal waveform'); ylabel('amplitude'); subplot 412; plot(frameTime,Ef,'k'); hold on axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude'); text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.65,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); axis([0 max(time) 0 110]);title('Pitch period'); ylabel('Sample value'); subplot 414; plot(frameTime,Dfreq,'k'); axis([0 max(time) 0 250]);title('Pitch frequency'); ylabel('frequency/Hz'); xlabel('time/s'); figure(2) subplot 211; plot(time,x1,'k'); title('Original speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time,output,'k'); title('Synthetic speech waveform'); axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')
11. MATLAB simulation X: speed change of voice signal
% % pr10_6_1 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x=xx/max(abs(xx)); % normalization N=length(x); % Data length time=(0:N-1)/fs; % Time scale of signal wlen=240; % Frame length inc=80; % Frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency X=enframe(x,wlen,inc)'; % Framing fn=size(X,2); % Number of frames T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 15]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Calculate the timescale for each frame in=input('Please enter that the time length of the telescopic voice is a multiple of the original voice time length:','s');%Enter the extension length scale rate=str2num(in); for i=1 : fn % Calculate the prediction coefficient and gain of each frame u=X(:,i); [ar,g]=lpc(u,p); AR_coeff(:,i)=ar; Gain(i)=g; end % pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(x,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0); tal=0; % initialization zint=zeros(p,1); %% LSP Parameter extraction for i=1 : fn a2=AR_coeff(:,i); % Get the prediction coefficient of this frame lsf=ar2lsf(a2); % call ar2lsf Function solution lsf Glsf(:,i)=lsf; % hold lsf store in Glsf In array end % Shorten or lengthen the corresponding array by interpolation fn1=floor(rate*fn); % Sets the total number of new frames fn1 Glsfm=interp1((1:fn),Glsf',linspace(1,fn,fn1))';% hold LSF Coefficient interpolation Dpitchm=interp1(1:fn,Dpitch,linspace(1,fn,fn1));% Interpolate the pitch period Gm=interp1((1:fn),Gain,linspace(1,fn,fn1));%Interpolate the gain coefficient SFm=interp1((1:fn),SF,linspace(1,fn,fn1)); %hold SF Coefficient interpolation %% speech synthesis for i=1:fn1; lsf=Glsfm(:,i); % Get the of this frame lsf parameter ai=lsf2ar(lsf); % call lsf2ar Function handle lsf Convert to prediction coefficient ar sigma=sqrt(Gm(i)); if SFm(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise [synt_frame,zint]=filter(sigma,ai,excitation,zint); else % Talking frame PT=round(Dpitchm(i)); % Take cycle value exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % How many pulses are there in the calculation tal=inc-index(eal); % Calculate the leading zero point of the next frame end gain=sigma/sqrt(1/PT); % gain [synt_frame,zint]=filter(gain,ai,excitation,zint);%Speech synthesis using excitation pulses end if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Treatment of overlapping parts output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end output(find(isnan(output)))=0; bn=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; % Filter coefficient an=[1.000000 -3.928040 5.786934 -3.789685 0.930791]; output=filter(bn,an,output); % High pass filtering output=output/max(abs(output)); % Amplitude normalization % Pronunciation through sound card,Compare original speech and synthetic speech % wavplay(x,fs); audioplayer(x,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %% Mapping figure(1) ol=length(output); % Output data length time1=(0:ol-1)/fs; % Find the time series of the output sequence pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)]) subplot 411; plot(time,x,'k'); xlim([0 max(time)]); title('Signal waveform'); ylabel('amplitude'); subplot 412; plot(frameTime,Ef,'k'); hold on axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude'); text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.6,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); axis([0 max(time) 0 100]);title('Pitch period'); ylabel('Sample value'); subplot 414; plot(frameTime,Dfreq,'k'); axis([0 max(time) 0 450]);title('Pitch frequency'); ylabel('frequency/Hz'); xlabel('time/s'); figure(2) subplot 211; plot(time,x,'k'); title('Original speech waveform'); axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time1,output,'k'); title('Synthetic speech waveform'); xlim([0 max(time1)]); xlabel('time/s'); ylabel('amplitude')
fle = colorcloud.wav Please enter that the time length of the telescopic voice is a multiple of the original voice time length:1.5
12. MATLAB simulation Xi: tone change of voice signal
% % pr10_6_2 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x=xx/max(abs(xx)); % Amplitude normalization lx=length(x); % Data length time=(0:lx-1)/fs; % Find the corresponding time series wlen=240; % Set frame length inc=80; % Sets the length of the frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency X=enframe(x,wlen,inc)'; % Framing according to parameters fn=size(X,2); % Total frames T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 15]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Time scale corresponding to each frame in=input('Please enter the multiple of pitch frequency rise and fall:','s'); % Input pitch frequency increase / decrease ratio rate=str2num(in); for i=1 : fn % Calculate the prediction coefficient and gain of each frame u=X(:,i); [ar,g]=lpc(u,p); AR_coeff(:,i)=ar; Gain(i)=g; end % pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(x,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0); if rate>1, sign=1; else sign=-1; end lmin=floor(fs/450); % Minimum pitch period lmax=floor(fs/60); % Maximum pitch period deltaOMG = sign*100*2*pi/fs; % The amount of clockwise or counterclockwise rotation of the root value dθ Dpitchm=Dpitch/rate; % Pitch period after increase or decrease Dfreqm=Dfreq*rate; % Pitch frequency after increase or decrease tal=0; % initialization zint=zeros(p,1); for i=1 : fn a=AR_coeff(:,i); % Get the of this frame AR coefficient sigma=sqrt(Gain(i)); % Obtain the gain coefficient of this frame if SF(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise [synt_frame,zint]=filter(sigma,a,excitation,zint); else % Talking frame PT=floor(Dpitchm(i)); % Change the period value to an integer if PT<lmin, PT=lmin; end % Judge whether the modified cycle value exceeds the limit if PT>lmax, PT=lmax; end ft=roots(a); % Root the prediction coefficient ft1=ft; %Increase or decrease the formant frequency to find a new root value for k=1 : p if imag(ft(k))>0, ft1(k) = ft(k)*exp(1j*deltaOMG); elseif imag(ft(k))<0 ft1(k) = ft(k)*exp(-1j*deltaOMG); end end ai=poly(ft1); % Reconstitute the prediction coefficient from the new root value exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % Calculate the number of pulses tal=inc-index(eal); % Calculate the leading zero point of the next frame end gain=sigma/sqrt(1/PT); % gain [synt_frame,zint]=filter(gain,ai,excitation,zint);%Speech synthesis using excitation pulses end if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Treatment of overlapping parts output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end output(find(isnan(output)))=0; ol=length(output); % Output output Extended to and input signal xx Equal length if ol<lx output1=[output; zeros(lx-ol,1)]; else output1=output(1:lx); end bn=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; % Filter coefficient an=[1.000000 -3.928040 5.786934 -3.789685 0.930791]; output=filter(bn,an,output1); % High pass filtering output=output/max(abs(output)); % Amplitude normalization % Pronunciation through sound card,Compare original speech and synthetic speech % wavplay(x,fs); audioplayer(x,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %% Mapping figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-150,pos(3),(pos(4)+100)]) subplot 411; plot(time,xx,'k'); xlim([0 max(time)]); title('Signal waveform'); ylabel('amplitude'); xlabel('(a)'); subplot 412; plot(frameTime,Ef,'k'); hold on; xlabel('(b)') axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude'); text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.6,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); hold on; xlabel('(c)') line(frameTime,Dpitchm,'color',[.6 .6 .6]); axis([0 max(time) 0 120]); title('Pitch period'); ylabel('Sample value'); subplot 414; plot(frameTime,Dfreq,'k'); hold on line(frameTime,Dfreqm,'color',[.6 .6 .6]); axis([0 max(time) 0 450]); title('Pitch frequency'); ylabel('frequency/Hz'); xlabel(['time/s' 10 '(d)']); figure(2) subplot 211; plot(time,x,'k'); title('Original speech waveform'); axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time,output,'k'); title('Synthetic speech waveform'); xlim([0 max(time)]); xlabel('time/s'); ylabel('amplitude')
fle = colorcloud.wav Please input the rising and falling multiple of pitch frequency:2
13. MATLAB simulation XII: speed change and tone change of voice signal
% % pr10_6_3 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs,nbits]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x=xx/max(abs(xx)); % Amplitude normalization lx=length(x); % Data length time=(0:lx-1)/fs; % Find the corresponding time series wlen=320; % Set frame length inc=80; % Sets the length of the frame shift overlap=wlen-inc; % Overlap length tempr1=(0:overlap-1)'/overlap; % Oblique triangular window function w1 tempr2=(overlap-1:-1:0)'/overlap; % Oblique triangular window function w2 n2=1:wlen/2+1; % Subscript value of positive frequency X=enframe(x,wlen,inc)'; % Framing according to parameters fn=size(X,2); % Total frames T1=0.1; r2=0.5; % Endpoint detection parameters miniL=10; % Minimum number of frames of a segment mnlong=5; % Minimum frame number of vowel body ThrC=[10 15]; % threshold p=12; % LPC Order frameTime=frame2time(fn,wlen,inc,fs); % Time scale corresponding to each frame in=input('Please enter that the time length of the telescopic voice is a multiple of the original voice time length:','s');%Enter the extension length scale rate=str2num(in); in=input('Please enter the multiple of pitch frequency rise and fall:','s'); % Input pitch frequency increase / decrease ratio rate1=1/str2num(in); Dpitch=zeros(1,fn); Dfreq=zeros(1,fn); for i=1 : fn % Calculate the prediction coefficient and gain of each frame u=X(:,i); [ar,g]=lpc(u,p); AR_coeff(:,i)=ar; Gain(i)=g; end % pitch detection [Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=... Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0); Dpitchm=Dpitch*rate1; if rate1==1, sign=0; elseif rate1>1, sign=1; else sign=-1; end lmin=floor(fs/450); % Minimum pitch period lmax=floor(fs/60); % Maximum pitch period deltaOMG = sign*100*2*pi/fs; % The pole is left-handed or right-handed dθ tal=0; % initialization zint=zeros(p,1); for i=1 : fn a2=AR_coeff(:,i); % Get the prediction coefficient of this frame lsf=ar2lsf(a2); % call ar2lsf Function solution lsf Glsf(:,i)=lsf; % hold lsf store in Glsf In array end % Shorten or lengthen the corresponding array by interpolation fn1=floor(rate*fn); % Sets the total number of new frames fn1 Glsfm=interp1((1:fn)',Glsf',linspace(1,fn,fn1)')'; %hold LSF Coefficient interpolation Dpitchm=interp1(1:fn,Dpitchm,linspace(1,fn,fn1)); %Interpolate the pitch period Gm=interp1((1:fn),Gain,linspace(1,fn,fn1)); %Interpolate the gain coefficient SFm=interp1((1:fn),SF,linspace(1,fn,fn1)); %hold SF Coefficient interpolation for i=1:fn1; lsf=Glsfm(:,i); % Get the of this frame lsf parameter a=lsf2ar(lsf); % call lsf2ar Function handle lsf Convert to prediction coefficient ar sigma=sqrt(Gm(i)); if SFm(i)==0 % Wordless frame excitation=randn(wlen,1); % Generate white noise [synt_frame,zint]=filter(sigma,a,excitation,zint); else % Talking frame PT=floor(Dpitchm(i)); % Change the period value to an integer if PT<lmin, PT=lmin; end if PT>lmax, PT=lmax; end ft=roots(a); ft1=ft; %Increase or decrease formant frequency for k=1 : p if imag(ft(k))>0, ft1(k) = ft(k)*exp(1j*deltaOMG); elseif imag(ft(k))<0 ft1(k) = ft(k)*exp(-1j*deltaOMG); end end ai=poly(ft1); % Composed of new root values AR coefficient exc_syn1 =zeros(wlen+tal,1); % Initialization pulse generation area exc_syn1(mod(1:tal+wlen,PT)==0) = 1; % A pulse is generated at the position of the pitch period with an amplitude of 1 exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval index=find(exc_syn2==1); excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame if isempty(index) % Frame shift inc There is no pulse in the interval tal=tal+inc; % Calculate the leading zero point of the next frame else % Frame shift inc There are pulses in the interval eal=length(index); % Calculate the number of pulses tal=inc-index(eal); % Calculate the leading zero point of the next frame end gain=sigma/sqrt(1/PT); % gain [synt_frame,zint]=filter(gain, ai,excitation,zint); % Speech synthesis with impulse end if i==1 % If frame 1 output=synt_frame; % Overlapping addition is not required,Preserve composite data else M=length(output); % Treatment of overlapping parts output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+... synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)]; end end output(find(isnan(output)))=0; ol=length(output); % Output data length time1=(0:ol-1)/fs; % Find the time series of the output sequence bn=[0.964775 -3.858862 5.788174 -3.858862 0.964775]; % Filter coefficient an=[1.000000 -3.928040 5.786934 -3.789685 0.930791]; output=filter(bn,an,output); % High pass filtering output=output/max(abs(output)); % Amplitude normalization % wavplay(x,fs); audioplayer(x,fs); pause(1) % wavplay(output,fs); audioplayer(output,fs); %% Mapping figure(1) figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)]) subplot 411; plot(time,xx,'k'); xlim([0 max(time)]); title('Signal waveform'); ylabel('amplitude'); subplot 412; plot(frameTime,Ef,'k'); hold on; axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2); line([0 max(time)],[T1 T1],'color','k','linestyle','-.'); title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude'); text(3.2,T1+0.05,'T1'); for k=1 : vsl line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],... [0 1.2],'color','k','Linestyle','-'); line([frameTime(vseg(k).end) frameTime(vseg(k).end)],... [0 1.2],'color','k','Linestyle','--'); if k==vsl Tx=T2(floor((vseg(k).begin+vseg(k).end)/2)); end end text(2.6,Tx+0.05,'T2'); subplot 413; plot(frameTime,Dpitch,'k'); axis([0 max(time) 0 100]);title('Pitch period'); ylabel('Sample value'); subplot 414; plot(frameTime,Dfreq,'k'); axis([0 max(time) 0 450]);title('Pitch frequency'); ylabel('frequency/Hz'); xlabel('time/s'); figure(2) subplot 211; plot(time,x,'k'); title('Original speech waveform'); axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude') subplot 212; plot(time1,output,'k'); title('Synthetic speech waveform'); xlim([0 max(time1)]); xlabel('time/s'); ylabel('amplitude')
fle = colorcloud.wav Please enter that the time length of the telescopic voice is a multiple of the original voice time length:1.8 Please enter the multiple of pitch frequency rise and fall:1.6
14. MATLAB simulation XIII: time domain pitch synchronous superposition (TD-PSOLA) speech synthesis I
% % pr10_7_1 clear all; clc; close all; filedir=[]; % Set the path of the data file filename='bluesky3.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component x=xx/max(abs(xx)); % normalization N=length(x); % Data length time=(0:N-1)/fs; % Find the corresponding time series [LowPass] = LowPassFilter(x, fs, 500); % Low pass filtering p = PitchEstimation(LowPass, fs); % Calculate pitch frequency [u, v] = UVSplit(p); % Find out the information with and without segments lu=size(u,1); lv=size(v,1); % initialization pm = []; ca = []; for i = 1 : length(v(:,1)) range = (v(i, 1) : v(i, 2)); % Take a segment information in = x(range); % Fetch segment data % Find pitch pulse annotation for a segment [marks, cans] = VoicedSegmentMarking(in, p(range), fs); pm = [pm (marks + range(1))]; % Save pitch pulse mark position ca = [ca; (cans + range(1))]; % Save pitch pulse annotation candidate list end % Mapping figure(1) pos = get(gcf,'Position'); set(gcf,'Position',[pos(1), pos(2)-150,pos(3),pos(4)+100]); subplot 211; plot(time,x,'k'); axis([0 max(time) -1 1.2]); for k=1 : lv line([time(v(k,1)) time(v(k,1))],[-1 1.2 ],'color','k','linestyle','-') line([time(v(k,2)) time(v(k,2))],[-1 1.2 ],'color','k','linestyle','--') end title('Voice signal waveform and endpoint detection'); xlabel(['time/s' 10 '(a)']); ylabel('amplitude'); %figure(2) %pos = get(gcf,'Position'); %set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-160)]); subplot 212; plot(x,'k'); axis([0 N -1 0.8]); line([0 N],[0 0],'color','k') lpm=length(pm); for k=1 : lpm line([pm(k) pm(k)],[0 0.8],'color','k','linestyle','-.') end xlim([3000 4000]); title('Some speech signal waveforms and corresponding pitch pulses are marked'); xlabel(['Sample point' 10 '(b)']); ylabel('amplitude');
15. MATLAB simulation 14: time domain pitch synchronous superposition (TD-PSOLA) speech synthesis II
% % pr10_7_2 clear all; clc; close all; global config; % global variable config config.pitchScale = 2.0; % Set fundamental frequency modification factor config.timeScale = 1.5; % Set duration modification factor config.resamplingScale = 1; % Resampling config.reconstruct = 0; % If true, perform low-pass spectrum reconstruction config.cutOffFreq = 500; % Cut off frequency of low-pass filter global data; % global variable data,Initialize first data.waveOut = []; % Synthetic speech output adjusted by fundamental frequency modification factor and duration modification factor data.pitchMarks = []; % Pitch pulse marking of input speech signal data.Candidates = []; % Candidate list for pitch pulse marking of input speech signal filedir=[]; % Set the path of the data file filename='colorcloud.wav'; % Set the name of the data file fle=[filedir filename] % The string that makes up the path and file name % [xx,fs]=wavread(fle); % read file [xx,fs]=audioread(fle); % read file xx=xx-mean(xx); % Remove DC component WaveIn=xx/max(abs(xx)); % normalization N=length(WaveIn); % Data length time=(0:N-1)/fs; % Find the corresponding time series [LowPass] = LowPassFilter(WaveIn, fs, config.cutOffFreq); % Filter the low-pass signal PitchContour = PitchEstimation(LowPass, fs);% Find the pitch track of speech signal PitchMarking1(WaveIn, PitchContour, fs); % Mark and record pitch pulse PSOLA synthesis output=data.waveOut; N1=length(output); % Output data length time1=(0:N1-1)/fs; % Find the time series of the output sequence % wavplay(xx,fs); audioplayer(xx,fs); pause(1) % wavplay(output,fs) audioplayer(output,fs) % Mapping subplot 211; plot(time,xx,'k'); xlabel('time/s'); ylabel('amplitude'); axis([0 max(time) -1 1.2 ]); title('Original voice') subplot 212; plot(time1,output,'k'); xlabel('time/s'); ylabel('amplitude'); axis([0 max(time1) -1 1.5 ]); title('PSOLA Synthetic speech')
Summary
Speech signal synthesis is a widely used speech signal processing technology. Now there are many non artificial video dubbing and audio through speech synthesis in many audio software or video clips. At present, there is still a certain distance from human voice in terms of sentence. Because the intonation and emotion of different sentences in different scenes are different in the natural state of human voice, this difficulty still has a long way to go.
If you are interested in the content of this chapter or want to fully learn and understand it, it is recommended to study the content of Chapter 10 in the book. In the later stage, some knowledge points will be discussed and supplemented on the basis of their own understanding. You are welcome to study and exchange together.
About Mr. Song: Song Zhiyong -- an old man who silently teaches MATLAB and signal processing knowledge