|
At present, deep neural networks are often used to extract speaker embeddings, such as x-vector and d-vector, and combine the speaker embeddings with clustering to implement a speaker segmentation system. The robustness of the speaker embedding determines the performance of the speaker segmentation system. Recently, ECAPA-TDNN embeddings have shown better performance than x-vector in speaker classification systems. In the work of this paper, the embedding extracted from each session is converted into a graph, and the embedding is used as a node of the graph, and two points whose similarity is greater than a set threshold are connected. Sampling and aggregating features from the local neighborhood of each node in the graph, using the structural information in the graph to reconstruct new speaker embeddings for each session through supervised learning. This embedding is then used for speaker segmentation using spectral clustering. The system proposed in this paper achieves the state-of-the-art results on the AMI dataset. |
|
Keywords:Signal and Information Processing; Speaker Diarization; Graph Neural Network; Clustering |
|