MATLAB speech signal analysis and synthesis (Second Edition): Chapter 10 speech signal synthesis algorithm

preface

MATLAB speech signal analysis and synthesis (Second Edition) is a disgusting work accumulated by song Zhiyong, the boss of the Institute of acoustics of the Chinese Academy of Sciences. Students interested in speech signal processing can get started if they hope to do some research in the fields of speech signal analysis, processing and synthesis in the future.

Speech signal processing is an important branch of digital signal processing. This book contains many digital signal processing methods and MATLAB functions. The book consists of 10 chapters. Chapter 1-4 introduces some basic analysis methods and means of speech signal processing, as well as the corresponding matlab functions; Chapter 5-9 introduces speech signal preprocessing and feature extraction, including eliminating trend terms and basic noise reduction methods, as well as endpoint detection, pitch extraction and formant extraction. Using the basic methods of speech signal processing, a variety of extraction methods and corresponding matlab programs are given; In Chapter 10, combined with the detection of various parameters, the synthesis of speech signal, the variable speed and tone change processing of speech signal are introduced. The speech synthesis of time domain pitch synchronous superposition (TD PSOLA) is also introduced, and the corresponding matlab program is given. The methods and ideas of debugging complex programs are given in Appendix A. This book can be used as an auxiliary reading for senior undergraduate students, postgraduates or scientific research and engineering technicians engaged in speech signal processing. It can also be used as a reference book for scientific research and engineering technicians engaged in signal processing research and application.

My graduate tutor's main research direction is related to speech signal processing. Although my major thesis during my graduate study was digital image processing, the so-called speech and image are not separated. Although I rowed in the teacher's graduate lecture on wavelet transform, I also included a small doorway in the course design and engineering application of speech signal processing in the later tutor, and won the first place in the group in the class closing test, The tutor also specially distributed 300 ocean food funds as encouragement.

Picking up speech recognition again this time, I just started with Mr. Song's book. Let's review it again. It mainly introduces the source code in each chapter. This is the 14 simulation application examples in Chapter 10 of this book. Don't say much. Let's start!

1. Data and function path setting

Some functions often called in the book (self compiled functions or functions from other application toolkits) have been concentrated in basic_ In the TBX toolbox, please set the toolbox (set with set path) under the working path before running the program of this book;

When you want to run EMD processing, set the EMD toolbox under the working path;

When the subject extension pitch detection is to be run, pitch_ The ztlib toolbox is set under the working path;

When time domain pitch synchronous superposition speech synthesis is to be performed, PSOLA_ The Lib toolbox is set under the working path;

When you want to apply the voice data provided in this book, you'd better put speech_signal is set under the working path.

All functions and programs in this book are debugged under MATLAB R2009a version. (I use MATLAB2015b. Some functions have been updated, so I will modify them to pass the debugging)

The path setting method is as follows:

Open MATLAB, click "home page" to find the setting path

Add all the above folder paths to the MATLAB search path

After adding, save and start the simulation.

2. MATLAB simulation 1: overlapping phase addition speech synthesis

% pr10_1_1  
% hold conv Additive convolution with overlap conv_ovladd Compare functions
%
clear all; clc; close all;
y=load('data1.txt')';                   % read in data
M=length(y);                            % Data length
t=0:M-1;                        
h=fir1(100,0.125);                      % Design FIR wave filter
x=conv(h,y);                            % use conv Function for digital filtering
x=x(51:1050);                           % Take the filter output without delay
z=conv_ovladd1(y,h,256);                % Convolution is calculated by overlapping phase addition
% Mapping
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-140)]);
plot(t,y,'k');
title('Original waveform of signal')
xlabel('Sample point'); ylabel('amplitude');
%xlabel('n'); ylabel('Am');
figure(2)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-100)]);
hline1 = plot(t,z,'k','LineWidth',1.5); 
hline2 = line(t,x,'LineWidth',5,'Color',[.6 .6 .6]);
set(gca,'Children',[hline1 hline2]);
title('Comparison of convolution and overlapping additive convolution')
xlabel('Sample point'); ylabel('amplitude');
legend('convolution conv','Overlapping additive convolution',2)
ylim([-0.8 1]);

function z=conv_ovladd1(x,h,L)
% Convolution calculation by overlapping phase addition 
x=x(:)';                                % hold x Convert to one line
NN=length(x);                           % calculation x long
N=length(h);                            % calculation h long
N1=L-N+1;                               % hold x Length of segment 
x1=[x zeros(1,L)];
H=fft(h,L);                             % seek h of FFT by H
J=fix((NN+L)/N1);                       % Find the number of blocks
y=zeros(1,NN+2*L);                      % initialization
for k=1 : J                             % Processing of each section
    xx=zeros(1,L);
    MI=(k-1)*N1;                        % The first i Duan Zai x Start position on-1
    nn=1:N1;
    xx(nn)=x1(nn+MI);                   % Take a paragraph xi
    XX=fft(xx,L);                       % do FFT
    YY=XX.*H;                           % Convolution by multiplication
    yy=ifft(YY,L);                      % do FFT Inverse transformation yi
% Overlapping addition between adjacent segments
    if k==1                             % The first block does not need to be overlapped and added                          
        for j=1 : L
            y(j)=y(j)+real(yy(j));
        end
    elseif k==J                         % The last piece is only made N1 Overlap and add data points
        for j=1 : N1
            y(MI+j)=y(MI+j)+real(yy(j));                
        end
    else        
        for j=1 : L                     % Starting from block 2, each block should be overlapped and added
            y(MI+j)=y(MI+j)+real(yy(j));                
        end
    end
end
nn=floor(N/2);
z=y(nn+1:NN+nn);                        % Ignore delay,Take output and input x Equal length


3. MATLAB simulation II: overlapping storage method speech synthesis

%
% pr10_1_2   
% hold conv Convolution with overlapping storage method conv_ovlsav1 Compare functions
clear all; clc; close all;
y=load('data1.txt');                    % read in data
M=length(y);                            % Data length
t=0:M-1;                            
h=fir1(100,0.125);                      % Design FIR wave filter
x=conv(h,y);                            % use conv Function for digital filtering
x=x(51:1050);                           % Take the filter output without delay
z=conv_ovlsav1(y,h,256);                % Convolution is calculated by overlapping storage method
% Mapping
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-140)]);
plot(t,y,'k');
title('Original waveform of signal')
xlabel('Sample point'); ylabel('amplitude');
figure(2)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-100)]);
hline1 = plot(t,z,'k','LineWidth',1.5); 
hline2 = line(t,x,'LineWidth',5,'Color',[.6 .6 .6]);
set(gca,'Children',[hline1 hline2]);
title('Comparison between direct convolution and overlapping storage convolution')
xlabel('Sample point'); ylabel('amplitude');
legend('Direct convolution conv','Overlapping storage convolution',2)
ylim([-0.8 1]);
function y=conv_ovlsav1(x,h,L)
% Convolution calculation with overlapping storage method
x=x(:)';
Lenx = length (x);          % calculation x long
N=length(h);                % calculation h long 
N1=N-1;
M=L-N1;
H=fft(h, L);                % seek h of FFT by H
x=[zeros(1,N1), x, zeros(1, L-1)]; % Fill zero before and after
K = floor((Lenx+ N1-1)/M);  % Find the number of frames
Y=zeros(K+1, L);            % initialization
for k=0 : K                 % Process each frame
   Xk=fft(x(k*M+1:k*M+L));  % do FFT
   Y(k+1,:)=real(ifft(Xk.*H));  % Convolution by multiplication
end
Y=Y(:, N:L)';               % Keep only the last in each frame M Data
nm=fix(N/2);
y=Y(nm+1:nm+Lenx )';        % Ignore delay,And turn two dimensions into one,Take output and input x Equal length


4. MATLAB simulation 3: linear proportional overlapping addition speech synthesis

%
% pr10_1_3 
clear all; clc; close all;

load Labdata1                 % Read in experimental data and extended data
N=length(xx);                 % Data length
time=(0:N-1)/fs;              % Time scale
T=10091-8538+1;               % Length of missing data interval
x1=xx(1:8537);                % Previous data
x2=xx(10092:29554);           % Later segment data
y1=ydata(:,1);                % Extended data 1
xx1=[x1; y1; x2];             % Synthesis with extended data 1
y2=ydata(:,2);                % Extended data 2
xx2=[x1; y2; x2];             % Synthesized with extended data 2
% The extended data 1 and extended data 2 are overlapped and added in a linear proportion
Wt1=(0:T-1)'/T;               % Form oblique triangular window function w1
Wt2=(T-1:-1:0)'/T;            % Form oblique triangular window function w2
y=y1.*Wt2+y2.*Wt1;            % Linear scale overlapping addition
xx3=[x1; y; x2];              % Synthetic data
%Mapping
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)-240]);
plot(time,xx,'k'); axis([0 29.6 -15 10]); 
title('Waveform of original signal'); xlabel('time/s'); ylabel('amplitude')
line([8.537 8.537],[-15 10],'color','k','linestyle','-');
line([10.092 10.092],[-15 10],'color','k','linestyle','--');

figure(2)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)+50]);
subplot 221; plot(time,xx1,'k'); axis([9.5 10.5 -10 5]);
line([10.091 10.091],[-15 10],'color','k','linestyle','-.');
title('Waveform synthesized after the first segment extension'); xlabel(['time/s' 10 '(a)']); ylabel('amplitude')
subplot 222; plot(time,xx2,'k'); axis([8 9.5 -10 5]); 
line([8.538 8.538],[-15 10],'color','k','linestyle','-.');
title('Waveform synthesized after the extension of the second segment'); xlabel(['time/s' 10 '(b)']); ylabel('amplitude')
subplot 212; plot(time,xx3,'k');  axis([0 29.6 -15 10]); 
line([8.537 8.537],[-15 10],'color','k','linestyle','-');
line([10.092 10.092],[-15 10],'color','k','linestyle','--');
title('Waveform synthesized after linear proportional overlap and addition'); xlabel(['time/s' 10 '(c)']); ylabel('amplitude')  


5. MATLAB simulation 4: spectrum parameter method speech synthesis

%
% pr10_3_1 
clear all; clc; close all;

filedir=[];                             % set up path
filename='colorcloud.wav';              % File name setting
fle=[filedir filename];                 % Make up the complete path and file name
% [x, fs, bits] = wavread(fle);           % Read in data file
[x, fs] = audioread(fle);           % Read in data file
x=x-mean(x);                            % Eliminate DC component
x=x/max(abs(x));                        % Amplitude normalization
xl=length(x);                           % Data length
time=(0:xl-1)/fs;                       % Calculate the time scale
p=12;                                   % LPC The order of is 12
wlen=200; inc=80;                       % Frame length and frame shift
msoverlap = wlen - inc;                 % Length of overlap per frame
y=enframe(x,wlen,inc)';                 % Framing
fn=size(y,2);                           % Frame count
% Speech analysis:Find the of each frame LPC Coefficient and prediction error
for i=1 : fn                            
    u=y(:,i);                           % Get a frame
    A=lpc(u,p);                         % LPC Obtain the coefficient
    aCoeff(:,i)=A;                      % Store in aCoeff In array
    errSig = filter(A,1,u);             % Calculate the prediction error sequence
    resid(:,i) = errSig;                % Store in resid In array
end
% speech synthesis:The synthetic speech of each frame is superimposed into a continuous speech signal
for i=1:fn                              
    A = aCoeff(:,i);                    % Obtain the prediction coefficient of the frame
    residFrame = resid(:,i);            % Obtain the prediction error of the frame
    synFrame = filter(1, A', residFrame); % Prediction error excitation,Synthetic speech
    
    outspeech((i-1)*inc+1:i*inc)=synFrame(1:inc);  % Overlapping storage method for storing data
% If it's the last frame,hold inc Add the following data
    if i==fn                            
        outspeech(fn*inc+1:(fn-1)*inc+wlen)=synFrame(inc+1:wlen);
    end

end;
ol=length(outspeech);
if ol<xl                                % hold outspeech Zero filling,Make and x Equal length
    outspeech=[outspeech zeros(1,xl-ol)];
end
% Vocalization
% wavplay(x,fs);
audioplayer(x,fs);
pause(1)
% wavplay(outspeech,fs);
audioplayer(outspeech,fs);
% Mapping
subplot 211; plot(time,x,'k');
xlabel(['time/s' 10 '(a)']); ylabel('amplitude'); ylim([-1 1.1]);
title('Original voice signal')
subplot 212; plot(time,outspeech,'k');
xlabel(['time/s' 10 '(b)']); ylabel('amplitude'); ylim([-1 1.1]);
title('Synthetic speech signal')

6. MATLAB simulation 5: linear prediction coefficient and prediction error method speech synthesis

%
% pr10_4_1  
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
x=xx/max(abs(xx));                        % normalization
N=length(x);                              % Data length
time=(0:N-1)/fs;                          % Time scale
wlen=240;                                 % Frame length
inc=80;                                   % Frame shift
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
wind=hanning(wlen);                       % Window function
X=enframe(x,wind,inc)';                   % Framing
fn=size(X,2);                             % Number of frames
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 15];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Calculate the timescale for each frame
for i=1 : fn                              % Calculate the linear prediction coefficient and gain of each frame
    u=X(:,i);
    [ar,g]=lpc(u,p);
    AR_coeff(:,i)=ar;
    Gain(i)=g;
end
% pitch detection 
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
   Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0);

tal=0;                                    % Initialize leading zero
zint=zeros(p,1);
for i=1:fn; 
    ai=AR_coeff(:,i);                     % Get page i Prediction coefficient of frame
    sigma_square=Gain(i);                 % Get page i Gain coefficient of frame
    sigma=sqrt(sigma_square);
    
    if SF(i)==0                           % Wordless frame
        excitation=randn(wlen,1);         % Generate white noise
        [synt_frame,zint]=filter(sigma,ai,excitation,zint); % Synthesizing speech with white noise
    else                                  % Talking frame
        PT=round(Dpitch(i));              % Take periodic value
        exc_syn1 =zeros(wlen+tal,1);      % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0) = 1;  % A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                 % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                  % Calculate the leading zero point of the next frame
        else                              % Frame shift inc There are pulses in the interval
            eal=length(index);            % How many pulses are there in the calculation
            tal=inc-index(eal);           % Calculate the leading zero point of the next frame
        end
        gain=sigma/sqrt(1/PT);            % gain
        [synt_frame,zint]=filter(gain, ai,excitation,zint); % Speech synthesis using impulse
    end
        if i==1                           % If frame 1
            output=synt_frame;            % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);             % Processing synthetic data by overlapping and adding in linear proportion
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
ol=length(output);                        % Output output Extended to and input signal xx Equal length
if ol<N
    output1=[output; zeros(N-ol,1)];
else
    output1=output(1:N);
end
bn=[0.964775   -3.858862   5.788174   -3.858862   0.964775]; % Filter coefficient
an=[1.000000   -3.928040   5.786934   -3.789685   0.930791];
output=filter(bn,an,output1);             % High pass filtering
output=output/max(abs(output));           % Amplitude normalization

% Pronunciation through sound card,Compare original speech and synthetic speech
% wavplay(x,fs);
audioplayer(x,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%Mapping
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)])
subplot 411; plot(time,xx,'k'); axis([0 max(time) -1 1.2]);
xlabel('(a)'); 
title('Signal waveform'); ylabel('amplitude')
subplot 412; plot(frameTime,Ef,'k'); hold on
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude')
xlabel('(b)'); 
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.65,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); 
axis([0 max(time) 0 110]);title('Pitch period');ylabel('Sample value')
xlabel( '(c)'); 
subplot 414; plot(frameTime,Dfreq,'k'); 
axis([0 max(time) 0 250]);title('Pitch frequency');ylabel('frequency/Hz')
xlabel(['time/s' 10 '(d)']); 

figure(2)
subplot 211; plot(time,x,'k'); title('Original speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time,output,'k');  title('Synthetic speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')


7. MATLAB simulation VI: pitch and formant speech synthesis I

%
%  pr10_5_1 
clear all; clc; close all;

F0 = [700 900 2600];
Bw = [130 70 160];
fs=8000;

[An,Bn]=formant2filter4(F0,Bw,fs);    % Call the function to calculate the filter coefficient
for k=1 : 4                           % Make frequency response curves for four second-order bandpass filters
    A=An(:,k);                        % Obtain filter coefficients
    B=Bn(k);
    fprintf('B=%5.6f   A=%5.6f   %5.6f   %5.6f\n',B,A);
    [H(:,k),w]=freqz(B,A);            % Obtain the response curve
end
freq=w/pi*fs/2;                       % Frequency axis scale
% Mapping
plot(freq,abs(H(:,1)),'k',freq,abs(H(:,2)),'k',freq,abs(H(:,3)),'k',freq,abs(H(:,4)),'k');
axis([0 4000 0 1.05]); grid;
line([F0(1) F0(1)],[0 1.05],'color','k','linestyle','-.');
line([F0(2) F0(2)],[0 1.05],'color','k','linestyle','-.');
line([F0(3) F0(3)],[0 1.05],'color','k','linestyle','-.');
line([3500 3500],[0 1.05],'color','k','linestyle','-.');
title('Response curve of second-order bandpass filter with three formants and a fixed frequency')    
ylabel('amplitude'); xlabel('frequency/Hz')

8. MATLAB simulation VII: pitch and formant speech synthesis II

%
% pr10_5_2 
clear all; clc; close all;

Fs=8000;                                  % sampling frequency 
Fs2=Fs/2;

fp=60; fs=20;                             % Passband ripple and stopband frequency
wp=fp/Fs2; ws=fs/Fs2;                     % Convert to normalized frequency
Rp=1; Rs=40;                              % Passband and stopband attenuation
[n,Wn]=cheb2ord(wp,ws,Rp,Rs);             % Calculate filter order

[b1,a1]=cheby2(n,Rs,Wn,'high');           % Calculate filter coefficients
fprintf('b=%5.6f   %5.6f   %5.6f   %5.6f   %5.6f\n',b1);
fprintf('a=%5.6f   %5.6f   %5.6f   %5.6f   %5.6f\n',a1);
fprintf('\n');
[db,mag,pha,grd,w]=freqz_m(b1,a1);        % Find the frequency response of the filter
a=[1 -0.99];
db1=freqz_m(1,a);                         % Calculate the frequency response of low-pass filter  
A=conv(a,a1);                             % Calculate series filter coefficients
B=b1;
db2=freqz_m(B,A);
fprintf('B=%5.6f   %5.6f   %5.6f   %5.6f   %5.6f\n',B);
fprintf('A=%5.6f   %5.6f   %5.6f   %5.6f   %5.6f   %5.6f\n',A);
% Mapping
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),pos(4)+100]);
subplot 221; plot(w/pi*Fs2,db1,'k'); 
title('Amplitude frequency response curve of low-pass filter')
ylim([-50 0]); ylabel('amplitude/dB'); xlabel(['frequency/Hz' 10 '(a)']);
subplot 222; plot(w/pi*Fs2,db,'k');
title('Amplitude frequency response curve of high pass filter')
axis([0 500 -50 5]); ylabel('amplitude/dB'); xlabel(['frequency/Hz' 10 '(b)']);
subplot 212; semilogx(w/pi*Fs2,db2,'k');
title('Amplitude frequency response curve of bandpass filter'); ylabel('amplitude/dB'); 
xlabel(['frequency/Hz' 10 '(c)']); axis([10 3500 -40 5]); grid

9. MATLAB simulation 8: pitch and formant speech synthesis 3

%
% pr10_5_3 
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
x1=xx/max(abs(xx));                       % normalization
x=filter([1 -.99],1,x1);                  % Pre aggravation
N=length(x);                              % Data length
time=(0:N-1)/fs;                          % Time scale of signal
wlen=240;                                 % Frame length
inc=80;                                   % Frame shift
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
wind=hanning(wlen);                       % Window function
X=enframe(x,wlen,inc)';                   % Framing
fn=size(X,2);                             % Number of frames
Etemp=sum(X.*X);                          % Calculate the energy per frame
Etemp=Etemp/max(Etemp);                   % Energy normalization
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 15];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Time scale corresponding to each frame
Doption=0;                                

% Use subject-Extended pitch detection
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
   Ext_F0ztms(x1,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,Doption);

%% Formant extraction
Frmt=Formant_ext2(x,wlen,inc,fs,SF,Doption);
Bwm=[150 200 250];                        % Set fixed bandwidth                    
Bw=repmat(Bwm',1,fn);

%% speech synthesis
zint=zeros(2,4);                          % initialization
tal=0;
for i=1 : fn
    yf=Frmt(:,i);                         % Fetch i Three formant frequencies and bandwidth of the frame
    bw=Bw(:,i);
    [an,bn]=formant2filter4(yf,bw,fs);    % Convert to four second-order filter coefficients
    synt_frame=zeros(wlen,1);
    
    if SF(i)==0                           % Wordless frame
        excitation=randn(wlen,1);         % Generate white noise
        for k=1 : 4                       % Parallel input of four filters
            An=an(:,k);
            Bn=bn(k);
            [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k));
            synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together
        end
    else                                  % Talking frame
        PT=round(Dpitch(i));              % Take cycle value
        exc_syn1 =zeros(wlen+tal,1);      % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                 % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                  % Calculate the leading zero point of the next frame
        else                              % Frame shift inc There are pulses in the interval
            eal=length(index);            % How many pulses are there in the calculation
            tal=inc-index(eal);           % Calculate the leading zero point of the next frame
        end
        for k=1 : 4                       % Parallel input of four filters
            An=an(:,k);
            Bn=bn(k);
            [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k));
            synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together
        end
    end
    Et=sum(synt_frame.*synt_frame);       % Synthesis of speech using energy normalization
    rt=Etemp(i)/Et;
    synt_frame=sqrt(rt)*synt_frame;
        if i==1                           % If frame 1
            output=synt_frame;            % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);             % Processing by linear addition of overlapping data
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
ol=length(output);                        % Output output Extended to and input signal xx Equal length
if ol<N
    output=[output; zeros(N-ol,1)];
end
% Synthetic speech passes through a band-pass filter
out1=output;
out2=filter(1,[1 -0.99],out1);
b=[0.964775   -3.858862   5.788174   -3.858862   0.964775];
a=[1.000000   -3.928040   5.786934   -3.789685   0.930791];
output=filter(b,a,out2);
output=output/max(abs(output));
% Pronunciation through sound card,Compare original speech and synthetic speech
% wavplay(xx,fs);
audioplayer(xx,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%% Mapping
figure(1)
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)])
subplot 411; plot(time,x1,'k'); axis([0 max(time) -1 1.1]);
title('Signal waveform'); ylabel('amplitude')
subplot 412; plot(frameTime,Ef,'k'); hold on
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]); ylabel('amplitude')
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.65,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); 
axis([0 max(time) 0 110]);title('Pitch period'); ylabel('Sample value')
subplot 414; plot(frameTime,Dfreq,'k'); 
axis([0 max(time) 0 250]);title('Pitch frequency'); ylabel('frequency/Hz')
xlabel('time/s'); 

figure(2)
subplot 211; plot(time,x1,'k'); title('Original speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time,output,'k');  title('Synthetic speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')

figure(3)
out1=out1/max(out1);
subplot 311; plot(time,out1,'k');
title('Waveform before passing the filter out1')
ylabel('amplitude'); ylim([-0.5 1]); xlabel('(a)');
out2=out2/max(out2);
subplot 312; plot(time,out2,'k');
title('Waveform after passing through low-pass filter out2')
ylabel('amplitude'); ylim([-0.5 1]); xlabel('(b)');
subplot 313; plot(time,output,'k');
title('Waveform after passing through high pass filter output')
xlabel(['time/s' 10 '(c)']); ylabel('amplitude')



10. MATLAB simulation IX: pitch and formant speech synthesis IV

%
%  pr10_5_4 
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
x1=xx/max(abs(xx));                       % normalization
x=filter([1 -.99],1,x1);                  % Pre aggravation
N=length(x);                              % Data length
time=(0:N-1)/fs;                          % Time scale of signal
wlen=240;                                 % Frame length
inc=80;                                   % Frame shift
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
wind=hanning(wlen);                       % Window function
X=enframe(x,wlen,inc)';                   % Framing
fn=size(X,2);                             % Number of frames
Etemp=sum(X.*X);                          % Calculate energy per frame
Etemp=Etemp/max(Etemp);                   % Energy normalization
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 12];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Time scale corresponding to each frame

% pitch detection 
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
   Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0);

const=fs/(2*pi);                            % constant 
Frmt=ones(3,fn)*nan;                        % initialization
Bw=ones(3,fn)*nan;
zint=zeros(2,4);                            % initialization
tal=0;
cepstL=6;                                   % Width of window function on inverted frequency
wlen2=wlen/2;                               % Take half the frame length
df=fs/wlen;                                 % Calculate the frequency scale in the frequency domain

for i=1 : fn
%% Extraction of formant and bandwidth
    u=X(:,i);                               % Take a frame of data
    u2=u.*wind;		                        % Signal windowing function
    U=fft(u2);                              % Press formula(10-5-6)calculation FFT
    U2=abs(U(1:wlen2)).^2;                  % Calculated power spectrum
    U_abs=log(U2);                          % Press formula(10-5-7)Calculate logarithm
    Cepst=ifft(U_abs);                      % Press formula(10-5-8)calculation IDFT
    cepst=zeros(1,wlen2);           
    cepst(1:cepstL)=Cepst(1:cepstL);        % Press formula(10-5-9)Windowing function
    cepst(end-cepstL+2:end)=Cepst(end-cepstL+2:end);
    spect=real(fft(cepst));                 % Press formula(10-5-10)calculation DFT
    [Loc,Val]=findpeaks(spect);             % Find peak
    Spe=exp(spect);                         % Press formula(10-5-11)Calculate the linear power spectrum value
    ll=min(3,length(Loc));
    for k=1 : ll
        m=Loc(k);                           % set up m-1,m and m+1
        m1=m-1; m2=m+1;
        pp=Spe(m);                          % set up P(m-1),P(m)and P(m+1)
        pp1=Spe(m1); pp2=Spe(m2);
        aa=(pp1+pp2)/2-pp;                  % Press formula(10-5-13)calculation
        bb=(pp2-pp1)/2;
        cc=pp;
        dm=-bb/2/aa;                        
        Pp=-bb*bb/4/aa+cc;                  % Press formula(10-5-14)calculation
        m_new=m+dm;
        bf=-sqrt(bb*bb-4*aa*(cc-Pp/2))/aa;  
        F(k)=(m_new-1)*df;                  % Press formula(10-5-12)Calculate formant frequency
        bw(k)=bf*df;                        % Press formula(10-5-12)Calculate formant bandwidth
    end
    Frmt(:,i)=F;                            % Store formant frequency in Frmt in
    Bw(:,i)=bw;                             % Store bandwidth in Bw in
    
end

%% speech synthesis
for i=1 : fn
    yf=Frmt(:,i);                           % Fetch i Three formant frequencies and bandwidth of the frame
    bw=Bw(:,i);
    [an,bn]=formant2filter4(yf,bw,fs);      % Convert to four second-order filter coefficients
    synt_frame=zeros(wlen,1);
    
    if SF(i)==0                             % Wordless frame
        excitation=randn(wlen,1);           % Generate white noise
        for k=1 : 4                         % Parallel input of four filters
            An=an(:,k);
            Bn=bn(k);
            [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k));
            synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together
        end
    else                                    % Talking frame
        PT=round(Dpitch(i));                % Take periodic value
        exc_syn1 =zeros(wlen+tal,1);        % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0)=1;  % A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc);   % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                   % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                    % Calculate the leading zero point of the next frame
        else                                % Frame shift inc There are pulses in the interval
            eal=length(index);              % How many pulses are there in the calculation
            tal=inc-index(eal);             % Calculate the leading zero point of the next frame
        end
        for k=1 : 4                         % Parallel input of four filters
            An=an(:,k);
            Bn=bn(k);
            [out(:,k),zint(:,k)]=filter(Bn(1),An,excitation,zint(:,k));
            synt_frame=synt_frame+out(:,k); % The four filter outputs are superimposed together
        end
    end
    Et=sum(synt_frame.*synt_frame);         % Synthesis of speech using energy normalization
    rt=Etemp(i)/Et;
    synt_frame=sqrt(rt)*synt_frame;
        synt_speech_HF(:,i)=synt_frame;
        if i==1                             % If frame 1
            output=synt_frame;              % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);               % Processing synthetic data by overlapping and adding in linear proportion
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
ol=length(output);                          % Output output Extended to and input signal xx Equal length
if ol<N
    output=[output; zeros(N-ol,1)];
end
out1=output;
b=[0.964775   -3.858862   5.788174   -3.858862   0.964775];  % Filter coefficient
a=[1.000000   -4.918040   9.675693   -9.518749   4.682579   -0.921483];
output=filter(b,a,out1);                    % Filter connected in series through low-pass and high pass
output=output/max(abs(output));             % Amplitude normalization
% Pronunciation through sound card,Compare original speech and synthetic speech
% wavplay(xx,fs);
audioplayer(xx,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%% Mapping
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)])
subplot 411; plot(time,x1,'k'); axis([0 max(time) -1 1.1]);
title('Signal waveform'); ylabel('amplitude'); 
subplot 412; plot(frameTime,Ef,'k'); hold on
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]);  ylabel('amplitude');
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.65,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); 
axis([0 max(time) 0 110]);title('Pitch period'); ylabel('Sample value');
subplot 414; plot(frameTime,Dfreq,'k'); 
axis([0 max(time) 0 250]);title('Pitch frequency'); ylabel('frequency/Hz');
xlabel('time/s'); 

figure(2)
subplot 211; plot(time,x1,'k'); title('Original speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time,output,'k');  title('Synthetic speech waveform');
axis([0 max(time) -1 1.1]); xlabel('time/s'); ylabel('amplitude')


11. MATLAB simulation X: speed change of voice signal

%
% pr10_6_1  
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
x=xx/max(abs(xx));                        % normalization
N=length(x);                              % Data length
time=(0:N-1)/fs;                          % Time scale of signal
wlen=240;                                 % Frame length
inc=80;                                   % Frame shift
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
X=enframe(x,wlen,inc)';                   % Framing
fn=size(X,2);                             % Number of frames
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 15];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Calculate the timescale for each frame
in=input('Please enter that the time length of the telescopic voice is a multiple of the original voice time length:','s');%Enter the extension length scale
rate=str2num(in);

for i=1 : fn                              % Calculate the prediction coefficient and gain of each frame
    u=X(:,i);
    [ar,g]=lpc(u,p);
    AR_coeff(:,i)=ar;
    Gain(i)=g;
end

% pitch detection 
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
    Ext_F0ztms(x,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0);

tal=0;                                    % initialization
zint=zeros(p,1); 
%% LSP Parameter extraction
for i=1 : fn
    a2=AR_coeff(:,i);                     % Get the prediction coefficient of this frame
    lsf=ar2lsf(a2);                       % call ar2lsf Function solution lsf
    Glsf(:,i)=lsf;                        % hold lsf store in Glsf In array
end

% Shorten or lengthen the corresponding array by interpolation
fn1=floor(rate*fn);                        % Sets the total number of new frames fn1
Glsfm=interp1((1:fn),Glsf',linspace(1,fn,fn1))';% hold LSF Coefficient interpolation
Dpitchm=interp1(1:fn,Dpitch,linspace(1,fn,fn1));% Interpolate the pitch period
Gm=interp1((1:fn),Gain,linspace(1,fn,fn1));%Interpolate the gain coefficient
SFm=interp1((1:fn),SF,linspace(1,fn,fn1)); %hold SF Coefficient interpolation

%% speech synthesis
for i=1:fn1; 
    lsf=Glsfm(:,i);                       % Get the of this frame lsf parameter
    ai=lsf2ar(lsf);                       % call lsf2ar Function handle lsf Convert to prediction coefficient ar 
    sigma=sqrt(Gm(i));

    if SFm(i)==0                          % Wordless frame
        excitation=randn(wlen,1);         % Generate white noise
        [synt_frame,zint]=filter(sigma,ai,excitation,zint);
    else                                  % Talking frame
        PT=round(Dpitchm(i));             % Take cycle value
        exc_syn1 =zeros(wlen+tal,1);      % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                 % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                  % Calculate the leading zero point of the next frame
        else                              % Frame shift inc There are pulses in the interval
            eal=length(index);            % How many pulses are there in the calculation
            tal=inc-index(eal);           % Calculate the leading zero point of the next frame
        end
        gain=sigma/sqrt(1/PT);            % gain
        [synt_frame,zint]=filter(gain,ai,excitation,zint);%Speech synthesis using excitation pulses
    end
    
    if i==1                               % If frame 1
            output=synt_frame;            % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);             % Treatment of overlapping parts
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
output(find(isnan(output)))=0;
bn=[0.964775   -3.858862   5.788174   -3.858862   0.964775]; % Filter coefficient
an=[1.000000   -3.928040   5.786934   -3.789685   0.930791];
output=filter(bn,an,output);             % High pass filtering
output=output/max(abs(output));          % Amplitude normalization
% Pronunciation through sound card,Compare original speech and synthetic speech
% wavplay(x,fs);
audioplayer(x,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%% Mapping
figure(1)
ol=length(output);                        % Output data length
time1=(0:ol-1)/fs;                        % Find the time series of the output sequence
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)])
subplot 411; plot(time,x,'k'); xlim([0 max(time)]);
title('Signal waveform'); ylabel('amplitude');
subplot 412; plot(frameTime,Ef,'k'); hold on
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]);  ylabel('amplitude');
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.6,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); 
axis([0 max(time) 0 100]);title('Pitch period'); ylabel('Sample value');
subplot 414; plot(frameTime,Dfreq,'k'); 
axis([0 max(time) 0 450]);title('Pitch frequency'); ylabel('frequency/Hz');
xlabel('time/s'); 

figure(2)
subplot 211; plot(time,x,'k'); title('Original speech waveform'); 
axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time1,output,'k');  title('Synthetic speech waveform');
xlim([0 max(time1)]); xlabel('time/s'); ylabel('amplitude')
fle =

colorcloud.wav

Please enter that the time length of the telescopic voice is a multiple of the original voice time length:1.5


12. MATLAB simulation Xi: tone change of voice signal

%
% pr10_6_2   
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
x=xx/max(abs(xx));                        % Amplitude normalization
lx=length(x);                             % Data length
time=(0:lx-1)/fs;                         % Find the corresponding time series
wlen=240;                                 % Set frame length
inc=80;                                   % Sets the length of the frame shift  
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
X=enframe(x,wlen,inc)';                   % Framing according to parameters
fn=size(X,2);                             % Total frames
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 15];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Time scale corresponding to each frame
in=input('Please enter the multiple of pitch frequency rise and fall:','s');  % Input pitch frequency increase / decrease ratio
rate=str2num(in);

for i=1 : fn                              % Calculate the prediction coefficient and gain of each frame
    u=X(:,i);
    [ar,g]=lpc(u,p);
    AR_coeff(:,i)=ar;
    Gain(i)=g;
end

% pitch detection 
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
    Ext_F0ztms(x,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0);

if rate>1, sign=1; else sign=-1; end
lmin=floor(fs/450);                       % Minimum pitch period
lmax=floor(fs/60);                        % Maximum pitch period
deltaOMG = sign*100*2*pi/fs;              % The amount of clockwise or counterclockwise rotation of the root value dθ
Dpitchm=Dpitch/rate;                      % Pitch period after increase or decrease
Dfreqm=Dfreq*rate;                        % Pitch frequency after increase or decrease

tal=0;                                    % initialization
zint=zeros(p,1); 
for i=1 : fn
    a=AR_coeff(:,i);                      % Get the of this frame AR coefficient
    sigma=sqrt(Gain(i));                  % Obtain the gain coefficient of this frame

    if SF(i)==0                           % Wordless frame
        excitation=randn(wlen,1);         % Generate white noise
        [synt_frame,zint]=filter(sigma,a,excitation,zint);
    else                                  % Talking frame
        PT=floor(Dpitchm(i));             % Change the period value to an integer
        if PT<lmin, PT=lmin; end          % Judge whether the modified cycle value exceeds the limit
        if PT>lmax, PT=lmax; end
        ft=roots(a);                      % Root the prediction coefficient
        ft1=ft;
%Increase or decrease the formant frequency to find a new root value
        for k=1 : p
            if imag(ft(k))>0, 
                ft1(k) = ft(k)*exp(1j*deltaOMG);
	        elseif imag(ft(k))<0 
                ft1(k) = ft(k)*exp(-1j*deltaOMG);
	        end
        end
        ai=poly(ft1);                     % Reconstitute the prediction coefficient from the new root value

        exc_syn1 =zeros(wlen+tal,1);      % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0)=1;% A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                 % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                  % Calculate the leading zero point of the next frame
        else                              % Frame shift inc There are pulses in the interval
            eal=length(index);            % Calculate the number of pulses
            tal=inc-index(eal);           % Calculate the leading zero point of the next frame
        end
        gain=sigma/sqrt(1/PT);            % gain
        [synt_frame,zint]=filter(gain,ai,excitation,zint);%Speech synthesis using excitation pulses
    end
    
    if i==1                               % If frame 1
            output=synt_frame;            % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);             % Treatment of overlapping parts
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
output(find(isnan(output)))=0;
ol=length(output);                        % Output output Extended to and input signal xx Equal length
if ol<lx
    output1=[output; zeros(lx-ol,1)];
else
    output1=output(1:lx);
end
bn=[0.964775   -3.858862   5.788174   -3.858862   0.964775]; % Filter coefficient
an=[1.000000   -3.928040   5.786934   -3.789685   0.930791];
output=filter(bn,an,output1);             % High pass filtering
output=output/max(abs(output));           % Amplitude normalization
% Pronunciation through sound card,Compare original speech and synthetic speech
% wavplay(x,fs);
audioplayer(x,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%% Mapping
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-150,pos(3),(pos(4)+100)])
subplot 411; plot(time,xx,'k'); xlim([0 max(time)]);
title('Signal waveform'); ylabel('amplitude'); xlabel('(a)');
subplot 412; plot(frameTime,Ef,'k'); hold on; xlabel('(b)')
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]);  ylabel('amplitude');
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.6,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); hold on; xlabel('(c)')
line(frameTime,Dpitchm,'color',[.6 .6 .6]);
axis([0 max(time) 0 120]); title('Pitch period'); ylabel('Sample value');
subplot 414; plot(frameTime,Dfreq,'k');  hold on
line(frameTime,Dfreqm,'color',[.6 .6 .6]);
axis([0 max(time) 0 450]); title('Pitch frequency'); ylabel('frequency/Hz');
xlabel(['time/s' 10 '(d)']); 

figure(2)
subplot 211; plot(time,x,'k'); title('Original speech waveform'); 
axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time,output,'k');  title('Synthetic speech waveform');
xlim([0 max(time)]); xlabel('time/s'); ylabel('amplitude')
fle =

colorcloud.wav

Please input the rising and falling multiple of pitch frequency:2


13. MATLAB simulation XII: speed change and tone change of voice signal

%
% pr10_6_3  
clear all; clc; close all;

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs,nbits]=wavread(fle);               % read file
[xx,fs]=audioread(fle);               % read file
xx=xx-mean(xx);                           % Remove DC component
x=xx/max(abs(xx));                        % Amplitude normalization
lx=length(x);                             % Data length
time=(0:lx-1)/fs;                         % Find the corresponding time series
wlen=320;                                 % Set frame length
inc=80;                                   % Sets the length of the frame shift  
overlap=wlen-inc;                         % Overlap length
tempr1=(0:overlap-1)'/overlap;            % Oblique triangular window function w1
tempr2=(overlap-1:-1:0)'/overlap;         % Oblique triangular window function w2
n2=1:wlen/2+1;                            % Subscript value of positive frequency
X=enframe(x,wlen,inc)';                   % Framing according to parameters
fn=size(X,2);                             % Total frames
T1=0.1; r2=0.5;                           % Endpoint detection parameters
miniL=10;                                 % Minimum number of frames of a segment
mnlong=5;                                 % Minimum frame number of vowel body
ThrC=[10 15];                             % threshold
p=12;                                     % LPC Order
frameTime=frame2time(fn,wlen,inc,fs);     % Time scale corresponding to each frame
in=input('Please enter that the time length of the telescopic voice is a multiple of the original voice time length:','s');%Enter the extension length scale
rate=str2num(in);
in=input('Please enter the multiple of pitch frequency rise and fall:','s');  % Input pitch frequency increase / decrease ratio
rate1=1/str2num(in);

Dpitch=zeros(1,fn); Dfreq=zeros(1,fn);
for i=1 : fn                              % Calculate the prediction coefficient and gain of each frame
    u=X(:,i);
    [ar,g]=lpc(u,p);
    AR_coeff(:,i)=ar;
    Gain(i)=g;
end

% pitch detection 
[Dpitch,Dfreq,Ef,SF,voiceseg,vosl,vseg,vsl,T2]=...
    Ext_F0ztms(xx,fs,wlen,inc,T1,r2,miniL,mnlong,ThrC,0);
Dpitchm=Dpitch*rate1;
if rate1==1, sign=0; elseif rate1>1, sign=1; else sign=-1; end
lmin=floor(fs/450);                       % Minimum pitch period
lmax=floor(fs/60);                        % Maximum pitch period
deltaOMG = sign*100*2*pi/fs;              % The pole is left-handed or right-handed dθ

tal=0;                                    % initialization
zint=zeros(p,1); 
for i=1 : fn
    a2=AR_coeff(:,i);                     % Get the prediction coefficient of this frame
    lsf=ar2lsf(a2);                       % call ar2lsf Function solution lsf
    Glsf(:,i)=lsf;                        % hold lsf store in Glsf In array
end

% Shorten or lengthen the corresponding array by interpolation
fn1=floor(rate*fn);                        % Sets the total number of new frames fn1
Glsfm=interp1((1:fn)',Glsf',linspace(1,fn,fn1)')';   %hold LSF Coefficient interpolation
Dpitchm=interp1(1:fn,Dpitchm,linspace(1,fn,fn1));    %Interpolate the pitch period
Gm=interp1((1:fn),Gain,linspace(1,fn,fn1));          %Interpolate the gain coefficient
SFm=interp1((1:fn),SF,linspace(1,fn,fn1));           %hold SF Coefficient interpolation

for i=1:fn1;
    lsf=Glsfm(:,i);                       % Get the of this frame lsf parameter
    a=lsf2ar(lsf);                        % call lsf2ar Function handle lsf Convert to prediction coefficient ar 
    sigma=sqrt(Gm(i));

    if SFm(i)==0                          % Wordless frame
        excitation=randn(wlen,1);         % Generate white noise
        [synt_frame,zint]=filter(sigma,a,excitation,zint);
    else                                  % Talking frame
        PT=floor(Dpitchm(i));             % Change the period value to an integer
        if PT<lmin, PT=lmin; end
        if PT>lmax, PT=lmax; end
        ft=roots(a);
        ft1=ft;
%Increase or decrease formant frequency
        for k=1 : p
            if imag(ft(k))>0, 
                ft1(k) = ft(k)*exp(1j*deltaOMG);
	        elseif imag(ft(k))<0 
                ft1(k) = ft(k)*exp(-1j*deltaOMG);
	        end
        end
        ai=poly(ft1);                     % Composed of new root values AR coefficient

        exc_syn1 =zeros(wlen+tal,1);      % Initialization pulse generation area
        exc_syn1(mod(1:tal+wlen,PT)==0) = 1;  % A pulse is generated at the position of the pitch period with an amplitude of 1
        exc_syn2=exc_syn1(tal+1:tal+inc); % Calculate frame shift inc Number of pulses in the interval
        index=find(exc_syn2==1);
        excitation=exc_syn1(tal+1:tal+wlen);% The excitation pulse source of this frame
        
        if isempty(index)                 % Frame shift inc There is no pulse in the interval
            tal=tal+inc;                  % Calculate the leading zero point of the next frame
        else                              % Frame shift inc There are pulses in the interval
            eal=length(index);            % Calculate the number of pulses
            tal=inc-index(eal);           % Calculate the leading zero point of the next frame
        end
        gain=sigma/sqrt(1/PT);            % gain
        [synt_frame,zint]=filter(gain, ai,excitation,zint); % Speech synthesis with impulse
    end
    
    if i==1                               % If frame 1
            output=synt_frame;            % Overlapping addition is not required,Preserve composite data
        else
            M=length(output);             % Treatment of overlapping parts
            output=[output(1:M-overlap); output(M-overlap+1:M).*tempr2+...
                synt_frame(1:overlap).*tempr1; synt_frame(overlap+1:wlen)];
        end
end
output(find(isnan(output)))=0;
ol=length(output);                        % Output data length
time1=(0:ol-1)/fs;                        % Find the time series of the output sequence
bn=[0.964775   -3.858862   5.788174   -3.858862   0.964775]; % Filter coefficient
an=[1.000000   -3.928040   5.786934   -3.789685   0.930791];
output=filter(bn,an,output);              % High pass filtering
output=output/max(abs(output));           % Amplitude normalization
% wavplay(x,fs);
audioplayer(x,fs);
pause(1)
% wavplay(output,fs);
audioplayer(output,fs);
%% Mapping
figure(1)
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)+85)])
subplot 411; plot(time,xx,'k'); xlim([0 max(time)]);
title('Signal waveform'); ylabel('amplitude');
subplot 412; plot(frameTime,Ef,'k'); hold on; 
axis([0 max(time) 0 1.2]); plot(frameTime,T2,'k','linewidth',2);
line([0 max(time)],[T1 T1],'color','k','linestyle','-.');
title('Energy entropy ratio diagram'); axis([0 max(time) 0 1.2]);  ylabel('amplitude');
text(3.2,T1+0.05,'T1');
for k=1 : vsl
        line([frameTime(vseg(k).begin) frameTime(vseg(k).begin)],...
        [0 1.2],'color','k','Linestyle','-');
        line([frameTime(vseg(k).end) frameTime(vseg(k).end)],...
        [0 1.2],'color','k','Linestyle','--');
    if k==vsl
        Tx=T2(floor((vseg(k).begin+vseg(k).end)/2));
    end
end
text(2.6,Tx+0.05,'T2');
subplot 413; plot(frameTime,Dpitch,'k'); 
axis([0 max(time) 0 100]);title('Pitch period'); ylabel('Sample value');
subplot 414; plot(frameTime,Dfreq,'k'); 
axis([0 max(time) 0 450]);title('Pitch frequency'); ylabel('frequency/Hz');
xlabel('time/s'); 

figure(2)
subplot 211; plot(time,x,'k'); title('Original speech waveform'); 
axis([0 max(time) -1 1]); xlabel('time/s'); ylabel('amplitude')
subplot 212; plot(time1,output,'k');  title('Synthetic speech waveform');
xlim([0 max(time1)]); xlabel('time/s'); ylabel('amplitude')

fle =

colorcloud.wav

Please enter that the time length of the telescopic voice is a multiple of the original voice time length:1.8
 Please enter the multiple of pitch frequency rise and fall:1.6


14. MATLAB simulation XIII: time domain pitch synchronous superposition (TD-PSOLA) speech synthesis I

%
% pr10_7_1 
clear all; clc; close all;

filedir=[];                                % Set the path of the data file
filename='bluesky3.wav';                   % Set the name of the data file
fle=[filedir filename]                     % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                      % read file
[xx,fs]=audioread(fle);                      % read file
xx=xx-mean(xx);                            % Remove DC component
x=xx/max(abs(xx));                         % normalization
N=length(x);                               % Data length
time=(0:N-1)/fs;                           % Find the corresponding time series

[LowPass] = LowPassFilter(x, fs, 500);     % Low pass filtering
p = PitchEstimation(LowPass, fs);		   % Calculate pitch frequency
[u, v] = UVSplit(p);                       % Find out the information with and without segments
lu=size(u,1); lv=size(v,1);                % initialization

pm = [];
ca = [];
for i = 1 : length(v(:,1))
    range = (v(i, 1) : v(i, 2));           % Take a segment information
    in = x(range);                         % Fetch segment data
% Find pitch pulse annotation for a segment
    [marks, cans] = VoicedSegmentMarking(in, p(range), fs);

    pm = [pm  (marks + range(1))];         % Save pitch pulse mark position
    ca = [ca;  (cans + range(1))];         % Save pitch pulse annotation candidate list
end
% Mapping
figure(1)
pos = get(gcf,'Position');
set(gcf,'Position',[pos(1), pos(2)-150,pos(3),pos(4)+100]);
subplot 211; plot(time,x,'k'); axis([0 max(time) -1 1.2]);
for k=1 : lv
    line([time(v(k,1)) time(v(k,1))],[-1 1.2 ],'color','k','linestyle','-')
    line([time(v(k,2)) time(v(k,2))],[-1 1.2 ],'color','k','linestyle','--')
end
title('Voice signal waveform and endpoint detection');
xlabel(['time/s' 10 '(a)']); ylabel('amplitude');
%figure(2)
%pos = get(gcf,'Position');
%set(gcf,'Position',[pos(1), pos(2)-100,pos(3),(pos(4)-160)]);
subplot 212; plot(x,'k'); axis([0 N -1 0.8]);
line([0 N],[0 0],'color','k')
lpm=length(pm);
for k=1 : lpm
    line([pm(k) pm(k)],[0 0.8],'color','k','linestyle','-.')
end
xlim([3000 4000]);
title('Some speech signal waveforms and corresponding pitch pulses are marked');
xlabel(['Sample point' 10 '(b)']); ylabel('amplitude');

15. MATLAB simulation 14: time domain pitch synchronous superposition (TD-PSOLA) speech synthesis II

%
% pr10_7_2 
clear all; clc; close all;

global config;                          % global variable config
config.pitchScale           = 2.0;	    % Set fundamental frequency modification factor
config.timeScale            = 1.5;	    % Set duration modification factor
config.resamplingScale      = 1;		% Resampling
config.reconstruct          = 0;		% If true, perform low-pass spectrum reconstruction
config.cutOffFreq           = 500;	    % Cut off frequency of low-pass filter

global data;            % global variable data,Initialize first
data.waveOut = [];		% Synthetic speech output adjusted by fundamental frequency modification factor and duration modification factor
data.pitchMarks = [];	% Pitch pulse marking of input speech signal
data.Candidates = [];	% Candidate list for pitch pulse marking of input speech signal

filedir=[];                               % Set the path of the data file
filename='colorcloud.wav';                % Set the name of the data file
fle=[filedir filename]                    % The string that makes up the path and file name
% [xx,fs]=wavread(fle);                     % read file
[xx,fs]=audioread(fle);                     % read file
xx=xx-mean(xx);                           % Remove DC component
WaveIn=xx/max(abs(xx));                   % normalization
N=length(WaveIn);                         % Data length
time=(0:N-1)/fs;                          % Find the corresponding time series

[LowPass] = LowPassFilter(WaveIn, fs, config.cutOffFreq); % Filter the low-pass signal
PitchContour = PitchEstimation(LowPass, fs);% Find the pitch track of speech signal
PitchMarking1(WaveIn, PitchContour, fs);	% Mark and record pitch pulse PSOLA synthesis
output=data.waveOut;
N1=length(output);                        % Output data length
time1=(0:N1-1)/fs;                        % Find the time series of the output sequence
% wavplay(xx,fs);
audioplayer(xx,fs);
pause(1)
% wavplay(output,fs)
audioplayer(output,fs)
% Mapping
subplot 211; plot(time,xx,'k'); 
xlabel('time/s'); ylabel('amplitude');
axis([0 max(time) -1 1.2 ]); title('Original voice')
subplot 212; plot(time1,output,'k'); 
xlabel('time/s'); ylabel('amplitude');
axis([0 max(time1) -1 1.5 ]); title('PSOLA Synthetic speech')

Summary

Speech signal synthesis is a widely used speech signal processing technology. Now there are many non artificial video dubbing and audio through speech synthesis in many audio software or video clips. At present, there is still a certain distance from human voice in terms of sentence. Because the intonation and emotion of different sentences in different scenes are different in the natural state of human voice, this difficulty still has a long way to go.

If you are interested in the content of this chapter or want to fully learn and understand it, it is recommended to study the content of Chapter 10 in the book. In the later stage, some knowledge points will be discussed and supplemented on the basis of their own understanding. You are welcome to study and exchange together.

About Mr. Song: Song Zhiyong -- an old man who silently teaches MATLAB and signal processing knowledge

Posted by neuro4848 on Mon, 18 Apr 2022 08:14:27 +0930