Speaker Identification

Table of Contents

To verify a speaker, Phonexia Voice Verify uses the Speaker Identification technology. Speaker Identification uses the power of voice biometry to recognize speakers automatically by their voice with extremely high accuracy.

The technology is based on the fact that the speech organs and speech habits of every person are unique. As a result, the characteristics of the speech signal captured in a recording are also unique, thus the technology can be language-, accent-, text-, and channel-independent.

How does Speaker Identification work?

Automatic speaker recognition systems extract the features from a voice to a voiceprint. A voiceprint is a small numerical representation of the voice, capturing the most unique characteristics of a speaker’s voice.

The whole voice verification process consists of two parts:

speaker enrollment – a speaker that is not already in the database has to be enrolled – this is done by creating a voiceprint from his voice sample and saving it to the database
speaker verification – two voice samples are compared to decide if their voices belong to the same speaker – this is done by voiceprint comparison

The result from this comparison is a score value expressing the likelihood of the voice samples being from the same speaker. At score = 0, the system is uncertain. When score > 2, the comparison is positive, meaning the recordings are, with very high probability, from the same speaker. The greater the value, the more certain the system is.

There is the possibility of a score shift for every environment, meaning that the score = 0 baseline shifts in one direction, which leads to an incorrect threshold setting. This is solved individually by calibration (see below).

Accuracy

How do we measure the accuracy of our technology?

To measure the accuracy of our Speaker Identification technology we use the standard ways of expressing such as Equal Error Rate (EER), Detection Error Tradeoff (DET) curve or False Rejection Rate (FRR) for fixed False Acceptance Rate (FAR) curve. If you are familiar with their meaning, feel free to skip the next several paragraphs of their explanations and see all the accuracy measurements below.

Understanding the accuracy measurements

The basic process of voice verification consists of these steps:

a voice sample of the specific person is taken
a voiceprint is made out of this sample (more info at Speaker Identification)
the voiceprint is stored in the database (this process is called enrollment)
another voiceprint can be compared to this voiceprint to compute their similarity score
if the similarity score is greater then some given threshold, these voiceprints should belong to the same person

The purpose of voice verification is therefore to tell if the two voiceprints being compared come from the same person. If they do, it is called a match.

Unfortunately, every biometrics system can make mistakes dependent on the capabilities of the biometrics system or the amount of useful information available in samples (for example in a very noisy environment a lot of information about the voice itself is lost).

Two possible cases of a verification attempt have to be distinguished:

a target attempt – a verification attempt by a user to match his own stored voiceprint
a non-target attempt – the opposite – a user’s voiceprint is compared against someone else’s voiceprint

Therefore two possible mistakes can be made:

a target verification attempt is incorrectly evaluated as a non-target (this is called false rejection) – this means that the verification should be successful, but isn’t
a non-target verification attempt is incorrectly evaluated as a match (false acceptance) – the verification should be denied, but isn’t

These mistakes are expressed as:

False Rejection Rate (FRR)
FRR=(N_FR/N_NT)*100,
where N_FR is the number of false rejections and N_NT is the number of non-target attempts
False Acceptance Rate (FAR)
FAR=(N_FA/N_T)*100,
where N_FR is the number of false acceptances and N_T is the number of target attempts

What we can do now is change the threshold value mentioned above. If we make the threshold lower, more attempts will be evaluated as a match. This means that FAR will increase and FRR will decrease. If we make the threshold higher, the opposite will happen. A common way of displaying these changes is plotting FRR (y-axis) vs. FAR (x-axis). This graph is called Detection Error Tradeoff (DET) graph:

(please note that the axis are using a logarithmic scale)

In this graph you can see the DET curves for six different lengths of net speech used. The blue line is corresponding with 3 seconds of net speech and as you can see, the more net speech is used, the more accurate results are achieved.

The higher the FAR is, the lower is the FRR (and the other way around). There is always only one value, where FAR=FRR. This value is call an Equal Error Rate (EER) and is another standard metric of accuracy of biometrics technologies. In the next graph you can see how EER changes in relation to seconds of net speech used for different versions of our Speaker Identification. The current version is XL4.

In different use cases there are different requirements. Sometimes a small FAR is required (if, for example, security was a top priority for a particular bank). A voice biometrics technology can achieve any FAR, the question is how high will the FRR be. Now we can plot FRR in relation to net speech length used for different fixed FARs.

For example if the requirement was FAR 3% and we would need to be quick so we want to verify customers after 4 seconds of net speech, we can expect around 11% FRR.

Datasets

Results will be slightly different for every customer/call center as the environment changes and impacts both data and accuracy. These measurements were done on NIST telephone data so that they are as close to the call center scenario as possible. Therefore these values can really be expected in real-world usage.

EERs for different speech length combinations are listed below for two sample sets (columns represent enrolment length, rows represent test recording speech length):

Customer’s dataset – Dataset1:

	20 seconds	30 seconds	60 seconds
3 seconds	4,71	4,71	4,71
5 seconds	4,24	3,99	4,05
7 seconds	2,59	2,35	2,59
10 seconds	2,35	2,35	2,35
20 seconds	2,17	2,09	2,12
30 seconds	2,07	1,88	1,88

Internal dataset – Dataset2:

	20 seconds	30 seconds	60 seconds
3 seconds	8,02	7,73	7,29
5 seconds	5,99	5,73	5,33
7 seconds	5,11	4,87	4,57
10 seconds	4,48	4,26	3,94
20 seconds	3,84	3,54	3,21
30 seconds	3,63	3,36	3,08

The above tables show the accuracy of an out-of-box solution measured for the enrollment-verification, time-length combinations. The numbers show the real-world use case, as the datasets are selected to be as close to real-world usage as possible. Accuracy can be further improved by calibration.

During the development of the algorithms, Phonexia achieved the best results of 0.9% EER on the NIST SRE 2010 dataset, following the evaluation metrics specified by NIST. This dataset is similar to telephony data.

Datasets information

	Recordings	Speakers
Dataset1	483	280
Dataset2	5449	100

Results enhancements

One thing to keep in mind is that all of these results are achieved with no custom calibration at all.

Evaluation and Calibration

Calibration of the system means setting up the technology in such a way that it would provide the maximum value to the Client depending on the use case.

Overview of the calibration steps:

Setting up the threshold
Every two voice captures (streams) are similar to some extent. The similarity is explained by a number. The higher the number, the higher the similarity.
The first step of calibration is setting up a threshold, a border, above which the voice streams are considered as the same.
To set up the threshold, performance of the system needs to be assessed, or evaluated.
1. Gathering an annotated evaluation set (20-50 speakers, 3+ recordings of each) (Client)
2. Evaluation (done by Phonexia)
3. Setting up a threshold based on the customer’s use case (security versus Customer convenience) (done by Phonexia)
Mean normalization
Every customer’s data are different. Even though Phonexia Speaker Identification technology is trained on a variety of data coming from multiple environments, it is worth making Mean normalization, which adapts the technology to the data of a Customer.
Existence of an evaluation set is a prerequisite for Mean normalization.
1. Gathering an adaptation set (100-150 random recordings from the Customer’s environment, real production data; 60+ second of net speech for each recording)
2. Preparing the Audio Source Profile (ASP) (done by Phonexia)
3. Setting up a proper threshold – point 1 (done by Phonexia)
Adaptation for False Acceptance Rate (FAR)
The final step in the calibration. If the customer wants to tune up the technology to the maximum, and has the capacity to a proper prepared dataset, it is possible to adapt the technology to the customer’s data and desired FAR. With this type of calibration, the system performs better for the selected FAR. It can however perform worse for a different operating point (Customer convenience versus security).
Existence of an evaluation set is a prerequisite for FAR Adaptation.
1. Gathering an adaptation set (1000 recordings, 60+ seconds net speech each, different speakers)
2. Preparing the Audio Source Profile (ASP) (done by Phonexia)
3. Setting up a proper threshold – point 1 (done by Phonexia)

Audio Quality Estimation

Speaker Identification

How does Speaker Identification work?

Accuracy