^{1}

^{*}

^{1}

^{1}

In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.

Speaker verification is a subtask of speaker recognition, whose purpose is to verify whether a segment of speech is spoken by a designated speaker [

As an application of probabilistic principal component analysis (PPCA), total variability factor analysis only analyzes the speech data from a global perspective [

In view of the above shortcomings of LPP, we apply the locality preserving discriminant projection (LPDP) algorithm in speaker verification. LPDP can bring in the speaker label information from the speech data and, through optimization, preserve the inherent local manifold structure of the speech data samples from the same speaker to reduce the distance between them. At the same time, the distance between the speech data samples from different speakers is enlarged to enhance the discriminative ability of the embedding space.

The remainder of this paper is organized as follows. The LPP algorithm based on i-vector is introduced in Section 2. The LPDP algorithm is proposed in Section 3. The experiment and results are presented in Section 4. The conclusion is given in Section 5.

Based on the total variability space, the GMM mean supervector containing speaker and channel information in the speech data can be expressed as

M = m + T w (1)

where m is the mean supervector of the universal background model (UBM) independent of the speaker and channel; T is the total variability space which is defined by the total variability matrix; and w is a low-dimensional latent variable that obeys the normal distribution, known as the total variability factor vector, or identity vector (i-vector). Total variability factor analysis can be regarded as a feature-extraction module. It projects the speech data into the low-rank total variability space T to obtain the i-vector w. The training method of T and the extraction process of the i-vector have been described previously [

The intersession compensation can be carried out in a low-dimensional space where the i-vector lies. The linear discriminant analysis (LDA) approach [

The speaker verification system framework, in which the LPP algorithm based on i-vector is used, is presented in

On the basis of i-vector, the LPP algorithm is used to achieve an effective combination of the total variability factor analysis technique and the LPP algorithm that retains both the global and local neighborhood structures of the speech data, thereby significantly improving system performance [

performance of the system.

LPDP is an effective manifold learning method that has been successfully applied in face recognition [

The idea of applying LPDP to speaker verification is similar to that of LPP as shown in ^{D }in the feature-space R^{K} (K < D). In the R^{K} space, the speech data point x_{i} is transformed to y i = A T w i . The steps to train the locality preserving discriminant projection space matrix A are as follows.

Step 1: Determine the neighborhood of the i-vector w_{i}, which consists of all the i-vectors whose similarity with w_{i} is less than its average similarity, i.e.,

M S ( w i ) = 1 N ∑ j = 1 N w i T w j ‖ w i ‖ 2 ‖ w j ‖ 2 (2)

N B ( w i ) = { w j | w i T w j ‖ w i ‖ 2 ‖ w j ‖ 2 > M S ( w i ) } (3)

where MS (w_{i}) is the average similarity of all the N i-vectors for the training speech data with i-vector w_{i}, and NB (w_{i}) represents the neighborhood i-vectors of w_{i}.

Step 2: Construct two subgraphs of the neighborhood graph: the in-class graph G^{in }and out-of-class graph G^{out}. In both the in-class graph G^{in }and the out-of-class graph G^{out}, the i-th node corresponds to the i-vector w_{i.} For the in-class graph G^{in}, we put a directed edge from node i to j if i-vector w_{j} is in the neighborhood of i-vector w_{i} and is from the same class as i-vector w_{i}. For the out-class graph G^{out}, we put a directed edge from node i to j if i-vector w_{j} is in the neighborhood of i-vector w_{i} but is from the different class of w_{i}.

Step 3: Calculate the weights of the edges in G^{in }and G^{out}, and obtain their respective weight matrices, W^{in} and W^{out}.

1) Denote the weight of the edge between i-vector w_{i} and i-vector w_{j} in G^{in} as W i j in and choose its value as

W i j in = { exp ( − ‖ w i − w j ‖ 2 t ) s p k ( w i ) = s p k ( w j ) , w j ∈ N ( w i ) or w i ∈ N ( w j ) 0 other (4)

2) Denote the weight of the edge between i-vector w_{i} and i-vector w_{j} in G^{out} as W i j out and choose its value as

W i j out = { 1 s p k ( w i ) ≠ s p k ( w j ) , w j ∈ N ( w i ) or w i ∈ N ( w j ) 0 other (5)

Here, spk (w_{i}) represents the speaker label information of i-vector w_{i}, and t is the mean distance of all the i-vectors for the training speech data.

Step 4: Calculate the locality preserving discriminant projection matrix A. The idea of LPDP is that, in the embedding space, the i-vectors from the same speaker have the smallest in-class divergence after projection, i.e., the distance between the same speaker’s i-vectors is as small as possible. Conversely, the i-vectors from different speakers have the largest between-class divergence after projection, i.e., they are as far from each other as possible. To achieve these goals, they are integrated into the following two optimization problems [

min ∑ i , j ‖ y i − y j ‖ 2 W i j in = min t r ( A T X L in X T A ) (6)

max ∑ i , j ‖ y i − y j ‖ 2 W i j out = max t r ( A T X L out X T A ) (7)

where L in = D in − W in is a Laplace operator for the in-class graph, D^{in} is a diagonal matrix, D i i in = ∑ j W i j in , L out = D out − W out is a Laplace operator for the out-of-class graph, D^{out} is a diagonal matrix, and D i i out = ∑ j W i j out .

Using the constraint condition A T X D out X T A = I , (6) and (7) can be integrated into one optimization problem,

min [ α t r ( A T X L in X T A ) − β t r ( A T X L out X T A ) ] H = α L in − β L out } ⇕ min t r ( A T X H X T A ) (8)

which can be further transformed to a generalized eigenvalue problem,

X H X T A = λ X D out X T A (9)

By solving Equation (9), the locality-preserving discriminant projection space matrix A = [ a 1 , a 2 , ⋯ , a K ] can be obtained, where a 1 , a 2 , ⋯ , a K are the eigenvectors corresponding to the largest K eigenvalues of the above problem.

Experiments were carried out on the core test set of the NIST SRE 2010 telephone training and telephone testing dataset. Equal error rate (EER) and minimum detect cost function (minDCF) were used as metrics for system evaluation [

In the experiments, 36-dimensional Mel Frequency Cepstral Coefficient (MFCC) including 18 MFCC coefficients and their first order derivatives were utilized. Each frame of a speech utterance was processed by a 20 ms Hamming window with 10 ms shift. To mitigate channel effects, feature warping, cepstral mean subtraction (CMN) and cepstral variance normalization (CVN) were applied to the features.

Two gender dependent universal background models (UBM) with a Gauss number of 1024 were trained using the NIST SRE 2004 1-side dataset. The gender related total variability matrix T, LPP matrix, LPDP matrix, WCCN, and LDA matrix were trained by the NIST SRE 2004, 2005, and 2006 corpus. The background data for SVM were also selected from the data of NIST SRE 2004, 2005 and 2006 datasets. The SVM Light toolkit was used for SVM modeling [

To verify the performance of the proposed LPDP algorithm, we experimentally compared it with the traditional total variability factor analysis and LPP algorithms.

System | Male | Female | ||
---|---|---|---|---|

EER (%) | minDCF | EER (%) | minDCF | |

Total variability factor analysis | 8.42 | 0.0672 | 9.84 | 0.0832 |

LPP | 5.99 | 0.0606 | 8.66 | 0.0738 |

LPDP | 5.01 | 0.0527 | 6.12 | 0.0674 |

System | Male | Female | ||
---|---|---|---|---|

EER (%) | minDCF | EER (%) | minDCF | |

Total variability factor analysis + LDA | 5.07 | 0.0516 | 7.40 | 0.0723 |

LPP + LDA | 5.55 | 0.0492 | 7.65 | 0.0622 |

LPDP + LDA | 4.23 | 0.0437 | 5.61 | 0.0585 |

System | Male | Female | ||
---|---|---|---|---|

EER (%) | minDCF | EER (%) | minDCF | |

Total variability factor analysis + WCCN | 6.77 | 0.0532 | 9.32 | 0.0752 |

LPP + WCCN | 5.08 | 0.0456 | 6.22 | 0.0540 |

LPDP + WCCN | 4.48 | 0.0426 | 5.71 | 0.0512 |

System | Male | Female | ||
---|---|---|---|---|

EER (%) | minDCF | EER (%) | minDCF | |

Total variability factor analysis + LDA + WCCN | 4.61 | 0.0502 | 6.14 | 0.0607 |

LPP + LDA + WCCN | 4.43 | 0.0462 | 5.85 | 0.0577 |

LPDP + LDA + WCCN | 4.02 | 0.0409 | 5.21 | 0.0517 |

On the basis of LPP, this paper introduced LPDP to speaker verification. LPDP makes full use of the speaker label information of the speech data to categorize and differentiate the neighborhood. It can overcome the shortcomings of the total variability factor analysis method and maintain the intrinsic local neighborhood relationship of in-class (same speaker) speech data and more comprehensively reflect the global and local structure of the speech data. It can also address the inadequacy of LPP and maximize the distance between out-of-class (different speakers) speech data to obtain the most discriminative feature vector and enhance the discriminative ability of the projection space, thereby improving the recognition performance of the system. Our future work will be devoted to enhance the discrimination of the embedding space and further improve the recognition performance of the system.

This work was supported by the National Natural Science Foundation of China (No.11704229).

The authors declare no conflicts of interest regarding the publication of this paper.

Liang, C.Y., Cao, W. and Cao, S.X. (2020) Locality Preserving Discriminant Projection for Speaker Verification. Journal of Computer and Communications, 8, 14-22. https://doi.org/10.4236/jcc.2020.811002