Skip to contentSkip to main navigation Skip to footer

Speaker Identification: Results Enhancement

Speaker Identification (SID) Results Enhancement is a process that adjusts the score threshold for detecting/rejecting speakers by removing the effect of speech length and audio quality.
This is achieved by use of Audio Source Profiles, that represent as closely as possible the source of the speech recording (device, acoustic channel, distance from microphone, language, gender, etc.). Although the out-of-the-box system is robust in such factors, several result enhancement procedures can provide even better results and stronger evidence.

Audio Source Profile

An Audio Source Profile is a representation of the speech source, e.g., device, acoustic channel, distance from microphone, language, gender, etc.
Technically, an Audio Source Profile is an entity that contains all information required for any system calibration or result normalization. The Audio Source Profile includes:

  • a set of calibration voiceprints extracted from the source files
    NOTE: these voiceprints are not compatible with standard voiceprints created using the /technologies/speakerid4/extractvp REST API endpoint or the vpextract4 command-line tool!
  • information about profile version, creation date and version of voiceprints
  • hash of all data in the profile and hash of its parent profile (if applicable)
  • parameters for all calibration types (shifts, mean vectors etc.)
  • optional user comments

Creating Audio Source Profile

An Audio Source Profile can be created either via the aspcreate4 command-line tool or via Speech Engine using the /technologies/speakerid4/audiosourceprofiles/{name} endpoint. Profiles created from the same data are identical, regardless of the interface used to create them. On the creation, content of the new profile is defined by specifying a directory of recordings to be used and calibration modes that should be performed when using the profile.

Once the profile is created, its parameters cannot be changed. Instead, there is the possibility of creating a new “child” profile based on a previously created profile by specifying the original as “parent” during the creation of the new one. You can selectively change the parameters in the child profile, e.g., add more files, switch calibration modes on/off , change FAR percentage, or user calibration shift and scale). This allows to easily create a specific profile for any use case.

Using Audio Source Profiles

An Audio Source Profile can be used during the comparison of speaker voiceprints. In general, both sides (enroll and test) can be calibrated by either the same profile (in case they come from the same source) or by two different profiles in case they have a different source.

Examples:

  • We are selling SID to Wakanda’s Ministry of Defense. They like to monitor telephone network of their happy people speaking the Wakandan language. We have never seen this data during SID training so it is a sensible thing to calibrate the system.
    Since there is only a single source of data (telephony) and only a single language (Wakandan), one can assume that it is enough to create a single profile and use it for both sides of the comparison.
  • We are monitoring Czech social media and trying to find PSTN recordings of many famous Czech YouTubers.
    In this case, we have a single language (Czech), but two sources of data (PSTN and wideband YouTube) and therefore we create two profiles – one for each source of data – and use each separate profile for the corresponding side of the comparison (i.e. the PSTN profile on the side of the PSTN voiceprint, and the YouTube profile on the side of the YouTube voiceprint).

Results Enhancement Modes

There are currently 3 modes of result enhancement:

  • Mean Normalization
  • False Acceptance Calibration
  • User Calibration

These modes can be used either separately, or in combinations.
Possible combinations are:

  • Mean Normalization + False Acceptance Calibration
  • Mean Normalization + User Calibration

Mean Normalization

Requirements: 100+ audio recordings from different speakers representing the source data, minimum 60 seconds net speech in each. Ideally, the set shouldn’t contain duplicates or target speaker recordings.

Mean Normalization makes data coming from different domains comparable by compensating for the differences in channel, language etc. Mean Normalization is extremely lightweight and has little to no effect on the duration of comparison. It improves the system’s discriminability which decreases the Equal Error Rate (EER, for more information, have a look at SID Evaluation). Due to its easy use, it is highly recommended to use Mean Normalization in every SID implementation.

False Acceptance Calibration

Requirements: Number of audio recordings: 1000 divided by the desired False Acceptance Rate – recordings from different speakers representing the source data, minimum 60 seconds net speech in each. The set must not contain duplicates or target speaker recordings.

With FAR Calibration, the system is calibrated to a specific False Acceptance Rate (e.g., FAR = 1%) for each reference voiceprint (speaker model). Only one side (the enroll) is calibrated, using data representing the other side (the test). The strength of FAR Calibration lies in the fact that it ensures that the system produces only a specific amount of False Acceptances with the given data and it is calibrated to individual enroll voiceprints so it can take into account whether a person’s voice is rather average or more unique.

Getting the amount of data required for FAR Calibration may prove more difficult than for Mean Normalization, but if guaranteed FAR is the goal, this calibration is the way to achieve it. It can be – and should be – combined with Mean Normalization, since any profile ready for FAR Calibration can be used for Mean Normalization as well.

User Calibration

Requirements: SID Evaluation performed

User Calibration enables the user to manually add custom variables “shift” and “scale” to the profile. The final score of any comparison using this profile is then adjusted according to this formula:

adjusted_score = scale × (original_score + shift)

To be able to determine proper shift and scale factors, SID Evaluation must be performed by a trained Phonexia Voice Biometrics user (for more information, have a look at SID Evaluation).

Audio Source Profile caching (SPE only)

In order to maximize the speed of continuous Speaker Identification with calibration, we have implemented a caching mechanism with two separate caches (which themselves are separate from the basic SPE result cache).

Audio Source Profiles cache

Persistence: no (in-memory only)
Cache Type: LRU cache (Least Recently Used cache)
Configurable: via server.audio_source_profiles_cache_size in phxspe.properties. The value represents the number of profiles that can be stored in the cache (use 0 to turn off)
Default Value: 64 profiles

Each profile in the SPE is loaded from disk into memory only once (until the cache capacity is reached) and then it is stored in the cache. If the cache capacity is reached the least recently used profile is removed from the cache. When an Audio Source Profile is removed from the SPE the profile is discarded from the cache as well. The cache is in-memory only and therefore it is cleared when the SPE is terminating.

False Acceptance cache

Persistence: no (in-memory only)
Cache Type: LRU cache (Least Recently Used cache)
Configurable: via server.bsapi_comparator_fa_cache_size in phxspe.properties. The value represents the number of score coefficients that can be stored in the cache (use 0 to turn off)
Default Value: 100000 score coefficients

Each voiceprint comparison with a profile that has FAR Calibration enabled stores the FAR Calibration coefficients into the cache. The cache contains coefficients for combinations of a profile (identified by its hash) and a voiceprint (identified by the SHA-256 hash). If the same voiceprint with the same profile is compared against any other voiceprint(s) again, the coefficients are loaded from the cache instead of being computed again. If the capacity of the cache is reached the least recently used coefficients are removed from the cache. The cache is in-memory only and therefore it is cleared when the SPE is terminating. The cache is shared between all SID4 comparators.