Skip to content

Latest commit

 

History

History
424 lines (361 loc) · 32.6 KB

README_en.md

File metadata and controls

424 lines (361 loc) · 32.6 KB

简体中文 | English

Voiceprint recognition system based on Pytorch

python version GitHub forks GitHub Repo stars GitHub 支持系统

Disclaimer, this document was obtained through machine translation, please check the original document here.

This branch is version 1.1, if you want to use the previous version 1.0 please 1.0 branch. This project uses a variety of advanced voiceprint recognition models such as EcapaTdnn, ResNetSE, ERes2Net, CAM++, etc. It is not excluded that more models will be supported in the future. At the same time, this project also supports MelSpectrogram, Spectrogram, MFCC, Fbank and other data preprocessing methods, using ArcFace Loss, ArcFace loss: Additive Angular Margin Loss, corresponding to AAMLoss in the project, normalizes the feature vectors and weights, and adds an Angle margin m to θ. The Angle margin has a more direct effect on the Angle than the cosine margin. In addition, Various loss functions such as AMLoss, ARMLoss, CELoss are also supported.

Environment:

  • Anaconda 3
  • Python 3.11
  • Pytorch 2.4.0
  • Windows 11 or Ubuntu 22.04

Project Features

  1. Supporting models: EcapaTdnn、TDNN、Res2Net、ResNetSE、ERes2Net、CAM++
  2. Supporting pooling: AttentiveStatsPool(ASP)、SelfAttentivePooling(SAP)、TemporalStatisticsPooling(TSP)、TemporalAveragePooling(TAP)、TemporalStatsPool(TSTP)
  3. Supporting Loss: AAMLoss、SphereFace2、AMLoss、ARMLoss、CELoss、SubCenterLoss、TripletAngularMarginLoss
  4. Support preprocessing methods: MelSpectrogram、Spectrogram、MFCC、Fbank、Wav2vec2.0、WavLM

Model Paper:

Download Model

Model Params(M) Dataset train speakers threshold EER MinDCF
ERes2NetV2 6.6 CN-Celeb 2796 0.20089 0.08071 0.45705
ERes2Net 6.6 CN-Celeb 2796 0.20014 0.08132 0.45544
CAM++ 6.8 CN-Celeb 2796 0.23323 0.08332 0.48536
ResNetSE 7.8 CN-Celeb 2796 0.19066 0.08544 0.49142
EcapaTdnn 6.1 CN-Celeb 2796 0.23646 0.09259 0.51378
TDNN 2.6 CN-Celeb 2796 0.23858 0.10825 0.59545
Res2Net 5.0 CN-Celeb 2796 0.19526 0.12436 0.65347
CAM++ 6.8 更大数据集 2W+ 0.33 0.07874 0.52524
ERes2Net 55.1 其他数据集 20W+ 0.36 0.02936 0.18355
ERes2NetV2 56.2 其他数据集 20W+ 0.36 0.03847 0.24301
CAM++ 6.8 其他数据集 20W+ 0.29 0.04765 0.31436

Explain:

  1. CN-Celeb Test, which contains 196 speakers.
  2. Triple the classification size using speech rate augmentationspeed_perturb_3_class: True.
  3. The preprocessing method used is Fbank and the loss function is AAMLoss.
  4. The number of parameters does not include the number of parameters of the classifier.

VoxCeleb1&2数据

模型 Params(M) Dataset train speakers threshold EER MinDCF
CAM++ 6.8 VoxCeleb1&2 7205 0.23 0.02659 0.18604
ERes2Net 6.6 VoxCeleb1&2 7205 0.23 0.03648 0.25508
ResNetSE 7.8 VoxCeleb1&2 7205
EcapaTdnn 6.1 VoxCeleb1&2 7205
TDNN 2.6 VoxCeleb1&2 7205
Res2Net 5.0 VoxCeleb1&2 7205
CAM++ 6.8 更大数据集 2W+ 0.28 0.03182 0.23731
ERes2Net 55.1 其他数据集 20W+ 0.53 0.08904 0.62130
ERes2NetV2 56.2 其他数据集 20W+ 0.52 0.08649 0.64193
CAM++ 6.8 其他数据集 20W+ 0.49 0.10334 0.71200

Explain:

  1. VoxCeleb1&2 Test, which contains 158 speakers.
  2. Triple the classification size using speech rate augmentationspeed_perturb_3_class: True.
  3. The preprocessing method used is Fbank and the loss function is AAMLoss.
  4. The number of parameters does not include the number of parameters of the classifier.

Effect comparison experiment of preprocessing methods

Preprocessing method Dataset train speakers threshold EER MinDCF
Fbank CN-Celeb 2796 0.14574 0.10988 0.58955
MFCC CN-Celeb 2796 0.14868 0.11483 0.61275
Spectrogram CN-Celeb 2796 0.14962 0.11613 0.60057
MelSpectrogram CN-Celeb 2796 0.13458 0.12498 0.60741
w2v-bert-2.0 CN-Celeb 2796
wav2vec2-large-xlsr-53 CN-Celeb 2796
wavlm-base-plus CN-Celeb 2796
wavlm-large CN-Celeb 2796

Explain:

  1. CN-Celeb Test,which contains 196 speakers.
  2. The experimental data is CN-Celeb, the experimental model is CAM++, and loss function is AAMLoss.
  3. The data is pre-extracted using 'extract_features.py', which means no data augmentation is used for the audio during training.

Loss function effect comparison experiment

Preprocessing method Dataset train speakers threshold EER MinDCF
AAMLoss CN-Celeb 2796 0.14574 0.10988 0.58955
SphereFace2 CN-Celeb 2796 0.20377 0.11309 0.61536
TripletAngularMarginLoss CN-Celeb 2796 0.28940 0.11749 0.63735
SubCenterLoss CN-Celeb 2796 0.13126 0.11775 0.56995
ARMLoss CN-Celeb 2796 0.14563 0.11805 0.57171
AMLoss CN-Celeb 2796 0.12870 0.12301 0.63263
CELoss CN-Celeb 2796 0.13607 0.12684 0.65176

Explain:

  1. CN-Celeb Test,which contains 196 speakers.
  2. The experimental data is CN-Celeb, the experimental model is CAM++, and the preprocessing method is Fbank.
  3. The data is pre-extracted using 'extract_features.py', which means no data augmentation is used for the audio during training.

Installation Environment

  • The GPU version of Pytorch will be installed first, please skip it if you already have it installed.
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
  • Install ppvector.

Install it using pip with the following command:

python -m pip install mvector -U -i https://pypi.tuna.tsinghua.edu.cn/simple

Source installation is recommended, which ensures that the latest code is used.

git clone https://github.com/yeyupiaoling/VoiceprintRecognition_Pytorch.git
cd VoiceprintRecognition_Pytorch/
python setup.py install

Create Data

The author used CN-Celeb for this tutorial. This dataset has a total of about 3000 people's voice data, and there are 65W+ voice data. After downloading, you need to unzip the dataset to the 'dataset' directory. Also need to download CN-Celeb Test. If you have other better datasets, you can mix them up, but it's best to use python's aukit tool module for audio processing, noise reduction, and de-muting.

The format of the data list is <voice_file_path\tspeech_classification_label>. The creation of this list is mainly for the convenience of later reading, but also for the convenience of reading and using other speech data sets. Speech classification label refers to the unique ID of the speaker. Put these data sets in the same data list.

Execute create_data.py to prepare the data.

python create_data.py

After executing the above program, the following data format will be generated, and if you want to customize the data, refer to the following list of data, which is preceded by the relative path of the audio and followed by the label of the speaker for that audio, just like classification. A note on custom datasets, the test list ID doesn't have to be the same as the training ID, meaning the test speaker doesn't have to be in the training set, just make sure the test list has the same ID for the same person.

dataset/CN-Celeb2_flac/data/id11999/recitation-03-019.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-023.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-025.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-04-014.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-030.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-032.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-028.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-031.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-05-003.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-04-017.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-016.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-09-001.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-05-010.flac      2795

Change preprocessing methods

By default, the Fbank preprocessing method is used in the configuration file. If you want to use other preprocessing methods, you can modify the following installation in the configuration file, and the specific value can be modified according to your own situation. If it's not clear how to set the parameters, you can remove that section and just use the default values.

# 数据预处理参数
preprocess_conf:
  # 是否使用HF上的Wav2Vec2类似模型提取音频特征
  use_hf_model: False
  # 音频预处理方法,也可以叫特征提取方法
  # 当use_hf_model为False时,支持:MelSpectrogram、Spectrogram、MFCC、Fbank
  # 当use_hf_model为True时,指定的是HuggingFace的模型或者本地路径,比如facebook/w2v-bert-2.0或者./feature_models/w2v-bert-2.0
  feature_method: 'Fbank'
  # 当use_hf_model为False时,设置API参数,更参数查看对应API,不清楚的可以直接删除该部分,直接使用默认值。
  # 当use_hf_model为True时,可以设置参数use_gpu,指定是否使用GPU提取特征
  method_args:
    sample_frequency: 16000
    num_mel_bins: 80

Train

Using train.py to train the model, this project supports multiple audio preprocessing methods, which can be specified by the preprocess_conf.feature_method parameter in the configs/ecapa_tdnn.yml configuration file. MelSpectrogram for MEL spectrum, Spectrogram for spectrogram, MFCC for MEL spectrum cepstral coefficient, etc. The data augmentation can be specified using the augment_conf_path argument. During the training process, VisualDL will be used to save the training logs. You can view the training results at any time by starting VisualDL with the command visualdl --logdir=log --host 0.0.0.0

# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py
# Multi GPU training
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

Train log:

[2023-08-05 09:52:06.497988 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-08-05 09:52:06.498094 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-08-05 09:52:06.498149 INFO   ] utils:print_arguments:15 - do_eval: True
[2023-08-05 09:52:06.498191 INFO   ] utils:print_arguments:15 - local_rank: 0
[2023-08-05 09:52:06.498230 INFO   ] utils:print_arguments:15 - pretrained_model: None
[2023-08-05 09:52:06.498269 INFO   ] utils:print_arguments:15 - resume_model: None
[2023-08-05 09:52:06.498306 INFO   ] utils:print_arguments:15 - save_model_path: models/
[2023-08-05 09:52:06.498342 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-08-05 09:52:06.498378 INFO   ] utils:print_arguments:16 - ------------------------------------------------
[2023-08-05 09:52:06.513761 INFO   ] utils:print_arguments:18 - ----------- 配置文件参数 -----------
[2023-08-05 09:52:06.513906 INFO   ] utils:print_arguments:21 - dataset_conf:
[2023-08-05 09:52:06.513957 INFO   ] utils:print_arguments:24 -         dataLoader:
[2023-08-05 09:52:06.513995 INFO   ] utils:print_arguments:26 -                 batch_size: 64
[2023-08-05 09:52:06.514031 INFO   ] utils:print_arguments:26 -                 num_workers: 4
[2023-08-05 09:52:06.514066 INFO   ] utils:print_arguments:28 -         do_vad: False
[2023-08-05 09:52:06.514101 INFO   ] utils:print_arguments:28 -         enroll_list: dataset/enroll_list.txt
[2023-08-05 09:52:06.514135 INFO   ] utils:print_arguments:24 -         eval_conf:
[2023-08-05 09:52:06.514169 INFO   ] utils:print_arguments:26 -                 batch_size: 1
[2023-08-05 09:52:06.514203 INFO   ] utils:print_arguments:26 -                 max_duration: 20
[2023-08-05 09:52:06.514237 INFO   ] utils:print_arguments:28 -         max_duration: 3
[2023-08-05 09:52:06.514274 INFO   ] utils:print_arguments:28 -         min_duration: 0.5
[2023-08-05 09:52:06.514308 INFO   ] utils:print_arguments:28 -         noise_aug_prob: 0.2
[2023-08-05 09:52:06.514342 INFO   ] utils:print_arguments:28 -         noise_dir: dataset/noise
[2023-08-05 09:52:06.514374 INFO   ] utils:print_arguments:28 -         num_speakers: 3242
[2023-08-05 09:52:06.514408 INFO   ] utils:print_arguments:28 -         sample_rate: 16000
[2023-08-05 09:52:06.514441 INFO   ] utils:print_arguments:28 -         speed_perturb: True
[2023-08-05 09:52:06.514475 INFO   ] utils:print_arguments:28 -         target_dB: -20
[2023-08-05 09:52:06.514508 INFO   ] utils:print_arguments:28 -         train_list: dataset/train_list.txt
[2023-08-05 09:52:06.514542 INFO   ] utils:print_arguments:28 -         trials_list: dataset/trials_list.txt
[2023-08-05 09:52:06.514575 INFO   ] utils:print_arguments:28 -         use_dB_normalization: True
[2023-08-05 09:52:06.514609 INFO   ] utils:print_arguments:21 - loss_conf:
[2023-08-05 09:52:06.514643 INFO   ] utils:print_arguments:24 -         args:
[2023-08-05 09:52:06.514678 INFO   ] utils:print_arguments:26 -                 easy_margin: False
[2023-08-05 09:52:06.514713 INFO   ] utils:print_arguments:26 -                 margin: 0.2
[2023-08-05 09:52:06.514746 INFO   ] utils:print_arguments:26 -                 scale: 32
[2023-08-05 09:52:06.514779 INFO   ] utils:print_arguments:24 -         margin_scheduler_args:
[2023-08-05 09:52:06.514814 INFO   ] utils:print_arguments:26 -                 final_margin: 0.3
[2023-08-05 09:52:06.514848 INFO   ] utils:print_arguments:28 -         use_loss: AAMLoss
[2023-08-05 09:52:06.514882 INFO   ] utils:print_arguments:28 -         use_margin_scheduler: True
[2023-08-05 09:52:06.514915 INFO   ] utils:print_arguments:21 - model_conf:
[2023-08-05 09:52:06.514950 INFO   ] utils:print_arguments:24 -         backbone:
[2023-08-05 09:52:06.514984 INFO   ] utils:print_arguments:26 -                 embd_dim: 192
[2023-08-05 09:52:06.515017 INFO   ] utils:print_arguments:26 -                 pooling_type: ASP
[2023-08-05 09:52:06.515050 INFO   ] utils:print_arguments:24 -         classifier:
[2023-08-05 09:52:06.515084 INFO   ] utils:print_arguments:26 -                 num_blocks: 0
[2023-08-05 09:52:06.515118 INFO   ] utils:print_arguments:21 - optimizer_conf:
[2023-08-05 09:52:06.515154 INFO   ] utils:print_arguments:28 -         learning_rate: 0.001
[2023-08-05 09:52:06.515188 INFO   ] utils:print_arguments:28 -         optimizer: Adam
[2023-08-05 09:52:06.515221 INFO   ] utils:print_arguments:28 -         scheduler: CosineAnnealingLR
[2023-08-05 09:52:06.515254 INFO   ] utils:print_arguments:28 -         scheduler_args: None
[2023-08-05 09:52:06.515289 INFO   ] utils:print_arguments:28 -         weight_decay: 1e-06
[2023-08-05 09:52:06.515323 INFO   ] utils:print_arguments:21 - preprocess_conf:
[2023-08-05 09:52:06.515357 INFO   ] utils:print_arguments:28 -         feature_method: MelSpectrogram
[2023-08-05 09:52:06.515390 INFO   ] utils:print_arguments:24 -         method_args:
[2023-08-05 09:52:06.515426 INFO   ] utils:print_arguments:26 -                 f_max: 14000.0
[2023-08-05 09:52:06.515460 INFO   ] utils:print_arguments:26 -                 f_min: 50.0
[2023-08-05 09:52:06.515493 INFO   ] utils:print_arguments:26 -                 hop_length: 320
[2023-08-05 09:52:06.515527 INFO   ] utils:print_arguments:26 -                 n_fft: 1024
[2023-08-05 09:52:06.515560 INFO   ] utils:print_arguments:26 -                 n_mels: 64
[2023-08-05 09:52:06.515593 INFO   ] utils:print_arguments:26 -                 sample_rate: 16000
[2023-08-05 09:52:06.515626 INFO   ] utils:print_arguments:26 -                 win_length: 1024
[2023-08-05 09:52:06.515660 INFO   ] utils:print_arguments:21 - train_conf:
[2023-08-05 09:52:06.515694 INFO   ] utils:print_arguments:28 -         log_interval: 100
[2023-08-05 09:52:06.515728 INFO   ] utils:print_arguments:28 -         max_epoch: 30
[2023-08-05 09:52:06.515761 INFO   ] utils:print_arguments:30 - use_model: EcapaTdnn
[2023-08-05 09:52:06.515794 INFO   ] utils:print_arguments:31 - ------------------------------------------------
······
===============================================================================================
Layer (type:depth-idx)                        Output Shape              Param #
===============================================================================================
Sequential                                    [1, 9726]                 --
├─EcapaTdnn: 1-1                              [1, 192]                  --
│    └─Conv1dReluBn: 2-1                      [1, 512, 98]              --
│    │    └─Conv1d: 3-1                       [1, 512, 98]              163,840
│    │    └─BatchNorm1d: 3-2                  [1, 512, 98]              1,024
│    └─Sequential: 2-2                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-3                 [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-4             [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-5                 [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-6                   [1, 512, 98]              262,912
│    └─Sequential: 2-3                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-7                 [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-8             [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-9                 [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-10                  [1, 512, 98]              262,912
│    └─Sequential: 2-4                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-11                [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-12            [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-13                [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-14                  [1, 512, 98]              262,912
│    └─Conv1d: 2-5                            [1, 1536, 98]             2,360,832
│    └─AttentiveStatsPool: 2-6                [1, 3072]                 --
│    │    └─Conv1d: 3-15                      [1, 128, 98]              196,736
│    │    └─Conv1d: 3-16                      [1, 1536, 98]             198,144
│    └─BatchNorm1d: 2-7                       [1, 3072]                 6,144
│    └─Linear: 2-8                            [1, 192]                  590,016
│    └─BatchNorm1d: 2-9                       [1, 192]                  384
├─SpeakerIdentification: 1-2                  [1, 9726]                 1,867,392
===============================================================================================
Total params: 8,012,992
Trainable params: 8,012,992
Non-trainable params: 0
Total mult-adds (M): 468.81
===============================================================================================
Input size (MB): 0.03
Forward/backward pass size (MB): 10.36
Params size (MB): 32.05
Estimated Total Size (MB): 42.44
===============================================================================================
[2023-08-05 09:52:08.084231 INFO   ] trainer:train:388 - 训练数据:874175
[2023-08-05 09:52:09.186542 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [0/13659], loss: 11.95824, accuracy: 0.00000, learning rate: 0.00100000, speed: 58.09 data/sec, eta: 5 days, 5:24:08
[2023-08-05 09:52:22.477905 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [100/13659], loss: 10.35675, accuracy: 0.00278, learning rate: 0.00100000, speed: 481.65 data/sec, eta: 15:07:15
[2023-08-05 09:52:35.948581 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [200/13659], loss: 10.22089, accuracy: 0.00505, learning rate: 0.00100000, speed: 475.27 data/sec, eta: 15:19:12
[2023-08-05 09:52:49.249098 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [300/13659], loss: 10.00268, accuracy: 0.00706, learning rate: 0.00100000, speed: 481.45 data/sec, eta: 15:07:11
[2023-08-05 09:53:03.716015 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [400/13659], loss: 9.76052, accuracy: 0.00830, learning rate: 0.00100000, speed: 442.74 data/sec, eta: 16:26:16
[2023-08-05 09:53:18.258807 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [500/13659], loss: 9.50189, accuracy: 0.01060, learning rate: 0.00100000, speed: 440.46 data/sec, eta: 16:31:08
[2023-08-05 09:53:31.618354 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [600/13659], loss: 9.26083, accuracy: 0.01256, learning rate: 0.00100000, speed: 479.50 data/sec, eta: 15:10:12
[2023-08-05 09:53:45.439642 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [700/13659], loss: 9.03548, accuracy: 0.01449, learning rate: 0.00099999, speed: 463.63 data/sec, eta: 15:41:08

VisualDL: VisualDL页面

Eval

After training, the prediction model will be saved, and we will use the prediction model to predict the audio features in the test set, and then use the audio features for pairwise comparison to calculate EER and MinDCF.

python eval.py

The output will look like this:

······
------------------------------------------------
W0425 08:27:32.057426 17654 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:27:32.065165 17654 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2023-03-16 20:20:47.195908 INFO   ] trainer:evaluate:341 - 成功加载模型:models/EcapaTdnn_Fbank/best_model/model.pth
100%|███████████████████████████| 84/84 [00:28<00:00,  2.95it/s]
开始两两对比音频特征...
100%|███████████████████████████| 5332/5332 [00:05<00:00, 1027.83it/s]
评估消耗时间:65s,threshold:0.26,EER: 0.14739, MinDCF: 0.41999

Voiceprint Contrast

Let's start implementing the voiceprint comparison, create the infer_contrast.py program, and write the infer() function. When we write the model, the model will have two outputs, the first is the classification output of the model, and the second is the audio feature output. So the output here is the characteristic value of audio, with the characteristic value of audio, you can do voiceprint recognition. We input two voices and get their feature data through the prediction function. Using this feature data, we can find their diagonal cosine value, and the result can be used as their acquaintance degree. This familiarity threshold threshold can be modified according to the accuracy requirements of your project.

python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav

The output will look like this:

[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - audio_path1: dataset/a_1.wav
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - audio_path2: dataset/b_2.wav
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - model_path: models/EcapaTdnn_Fbank/best_model/
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - threshold: 0.6
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:16 - ------------------------------------------------
······································································
W0425 08:29:10.006249 21121 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:29:10.008555 21121 device_context.cc:465] device: 0, cuDNN Version: 7.6.
成功加载模型参数和优化方法参数:models/EcapaTdnn_Fbank/best_model/model.pth
audio/a_1.wav 和 audio/b_2.wav 不是同一个人,相似度为:-0.09565544128417969

Voiceprint Recognition

Based on the above voiceprint comparison, we create infer_recognition.py for voiceprint recognition. We use the same infer() prediction function from the voiceprint comparison above to get the speech feature data from both of them. The difference is that the author added load_audio_db() and register(), and recognition(). The first function is to load the speech data in the voiceprint library. These audio are equivalent to registered users, and their registered speech data will be stored here. If a user needs to log in through voiceprint, it is necessary to get the user's voice and the voice in the speech database for voiceprint comparison. If the comparison is successful, it is equivalent to successful login and obtain the information data of the user registration. The second function, register(), saves the recording to the voiceprint library and takes the features of the audio and adds them to the data features to be compared. Finally, the recognition() function compares the input speech to the speech in the database. With the function of voiceprint recognition above, readers can complete the voiceprint recognition according to the needs of their own projects. For example, the following is provided by the author to complete the voiceprint recognition through recording. First of all, we must load the speech in the speech library, the speech library folder is audio_db, and then record for 3 seconds after the user enters the car, and then the program will automatically record, and use the recorded audio for voiceprint recognition to match the speech in the speech library and obtain the user's information. In this way, the reader can also modify to complete the voiceprint recognition through the service request, for example, provide an API for the APP to call, the user logs in through the voiceprint on the APP, send the recorded voice to the back-end to complete the voiceprint recognition, and then return the result to the APP, provided that the user has registered with the voice, And successfully stored the speech data in the audio_db folder.

python infer_recognition.py

The output will look like this:

[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - audio_db_path: audio_db/
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - model_path: models/EcapaTdnn_Fbank/best_model/
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - record_seconds: 3
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - threshold: 0.6
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:16 - ------------------------------------------------
······································································
W0425 08:30:13.257884 23889 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:30:13.260191 23889 device_context.cc:465] device: 0, cuDNN Version: 7.6.
成功加载模型参数和优化方法参数:models/ecapa_tdnn/model.pth
Loaded 沙瑞金 audio.
Loaded 李达康 audio.
请选择功能,0为注册音频到声纹库,1为执行声纹识别:0
按下回车键开机录音,录音3秒中:
开始录音......
录音已结束!
请输入该音频用户的名称:夜雨飘零
请选择功能,0为注册音频到声纹库,1为执行声纹识别:1
按下回车键开机录音,录音3秒中:
开始录音......
录音已结束!
识别说话的为:夜雨飘零,相似度为:0.920434

Other Versions

Reference

  1. https://github.com/PaddlePaddle/PaddleSpeech
  2. https://github.com/yeyupiaoling/PaddlePaddle-MobileFaceNets
  3. https://github.com/yeyupiaoling/PPASR
  4. https://github.com/alibaba-damo-academy/3D-Speaker
  5. https://github.com/wenet-e2e/wespeaker