General
This documentation is valid for Voice Verify version 1.7.6.
Introduction
This section provides information for the integrator and administrator roles of Phonexia Voice Verify. The reader should receive information on the deployment, integration and maintenance of the Phonexia Voice Verify solution. This document does not describe all the knowledge required for the correct implementation of voice biometrics to a call center.
The following terms are used in this section:
- Client – the company installing, using and integrating Phonexia Voice Verify
- Customer – the person or company utilizing voice biometrics technology for voiceprint enrollment or identity verification
Voice Verify utilizes Speaker Identification technology as a voice biometrics system.
To solve the speaker verification problem, two processes are used:
- Enrollment or Customer Registration
- Verification
During Enrollment (1), the voiceprint of a speaker is created and saved to a database. This baseline voiceprint is later used during all subsequent verifications of the same speaker. For this reason, being sure of a speaker’s identity during enrollment is crucial and needs to be verified by other means. Other parameters like utterance richness or the quality of speech used for enrollment also need to be checked.
The Verification (2) phase takes place multiple times. After the Customer has gone through Enrollment, every subsequent call can verify the Customer through the use of voice biometrics. For this, the Customer’s voice is used again to create an additional voiceprint, which is then compared to the baseline (Enrollment) voiceprint. A comparison of these two voiceprints then results in confirmation of whether they have come from one and same speaker or not.
Single-server vs Multi-server comparison
Voice Verify can be deployed in two variants:
The main differences are summarized in the following table:
Feature | Single-server | Multi-server |
Number of servers/VMs needed | 1 | 10+ |
Number of concurrent calls | 1-150 | unlimited |
Vertical scaling | √ | √ |
Horizontal Scaling | ✕ | √ |
Accepts SIP calls | √ | ✕ |
Accepts HTTP streaming | √ | √ |
Accepts WebSocket streaming | √ | √ |
Batch import of voice recordings | √ | √ |
WebHook support | √ | √ |
Typically, multi-server version is suitable for bigger deployments with more than 150 calls processed in parallel.
API
Once Phonexia Voice Verify is deployed, all its functionality is accessible via API. API integration into the software used in the call center consists of two parts:
- giving instructions to Voice Verify (enroll a person, verify, create a back-up,…)
- representation of the verification result (verified, not sure, not verified)
Phonexia Voice Verify provides API functionality for several processes:
- Main functionality – voice enrollment/verification
- Support administrative process – PBX connectivity, voice streaming, reports, logging, …
- Maintenance – back-ups, restore
Current offline version of API documentation can be downloaded here:
Voice Verify 1.7.6 API reference
To view it, please unzip the file and import the downloaded JSON to https://editor.swagger.io/
A running instance of Phonexia Voice Verify provides interactive API documentation via http://voiceverify.<mydomain.com>:8000/swagger/
, where mydomain.com
is the domain dedicated to it or assigned in the hosts
file – see Networking requirements.
Access Management
Phonexia Voice Verify provides limited rest-auth access management based on a token. Only one user account exists in the system. The Client can login to the system using access credentials POST /rest-auth/login/
. The returned token is used in all follow-up queries.
The token has to be added to the HTTP header as "Authorization: Token <ACCESS_TOKEN>"
. In case you are using Swagger (more information in API reference chapter) to send your request, click the “Authorize” button located as shown in the following picture:
Now insert the token in the same format Token <ACCESS_TOKEN>
and confirm by the “Authorize” button.
The Client can change the password for the user account POST /rest-auth/password/change/
. The initial access credentials are delivered by Phonexia to the Client. After successful deployment of Phonexia Voice Verify, the Client must change the password. In case of a forgotten password, Phonexia can reset the password GET /maintenance/reset_password
; access to the system is necessary (e.g. VPN).
Voice input
The next step, after deployment of Voice Verify and loading the Swagger API interface, is providing a voice input. For this reason, these options are available:
- real-time streams
- SIP calls (not yet available in the multi-server version)
- HTTP streams
- WebSocket streams
- offline audio file processing
- voice recordings (for enrollment only)
Each real-time stream is internally bound to a unique uuid
. Using this uuid
, enrollment or verification can be called upon a specific voice stream. All voice input options can be combined.
Dependless on voice option used, there are requirments for net speech lenght for enrollments and verifications:
- Voice Verify needs at least 15 seconds of net speech for enrollment; if the audio is longer, up to 60 seconds of net speech is used
- Voice Verify needs 3 – 5 seconds of net speech for verification; longer audio and speech length serve for detecting speaker change during whole call
SIP calls
Phonexia Voice Verify uses SIP protocol to register to a PBX and acts as a standard SIP endpoint (softphone). Voice Verify is then able to accept SIP calls and process the incoming voice stream.
The PBX (for example Genesys, CISCO, Avaya, Asterisk,…) must be configured to provide a copy of a voice stream coming from a customer and initiate a call to Phonexia. The parameter UUID of each stream serves as an identifier used later for making enrollment and verification requests on this voice stream.
Configuration of the PBX depends on the vendor of the PBX. The Integrator of Phonexia Voice Verify is responsible for this part of configuration.
PBX Connectivity
To allow Phonexia Voice Verify access to live streams, it needs to be registered to a PBX. Phonexia Voice Verify registers/connects as a SIP endpoint to the PBX. Phonexia Voice Verify keeps a list of available PBX instances in a database. It can connect or disconnect to any of them via an API request. A PBX instance entry needs to be created in the Phonexia Voice Verify database before such a PBX can be connected. A PBX instance entry is created by POST /api/v2/pbx/
, listed GET api/v2/pbx/{ID}
or removed DELETE api/v2/pbx/{ID}
. All PBX entries can be listed as well via GET api/v2/pbx/
.
When a PBX instance entry exists in Phonexia Voice Verify, the connection can be started (= Voice Verify is registered to this PBX) (POST api/v2/pbx/{ID}/start
) or closed POST api/v2/pbx/{ID}/stop
.
When the PBX is connected, Phonexia Voice Verify listens and receives SIP calls. Once such a call is received, its binary content is then redirected to the processing unit. From that point, a so-called stream is created. Phonexia Voice Verify works with an internal stream identifier – stream_uuid
. Such stream_uuid
is generated by Voice Verify. After the call is connected, all following API requests related to streams work with this identifier. As PBX has different means of call identification (callid, caller or callee), Call Center SW can ask for stream details to obtain stream_uuid
POST api/v2/streams
.
Audio requirements
Inside the SIP call, audio (voices) are transmitted via RTP protocol. For more information, see RFC 3550.
Supported RTP Payload types are:
- 0 (PCMU, Little-Endian, 8000 Hz, 1 channel)
HTTP streams
Sending voice to Voice Verify can also be done via HTTP streaming. HTTP streaming consists of three steps:
1) Opening a stream – done by POST /api/v2/stream/HTTP
endpoint.
Default sampling frequency is 8 000 Hz (a different frequency has to be specified by frequency
parameter).
Any “key”:”value” pair(s) can be added into the info
field. All of these are optional. Specifically, for WebHook subscription, it is possible to include source_uuid
– identifier for WebHook subscription, as follows:
{ "info": {"source_uuid": "<string>"}, "frequency": 8000 }
In the response, uuid
(unique ID of the stream) is returned. This uuid
will later be used for sending voice and enrollment/verification.
2) Sending data (voice) to the stream – using POST /api/v2/stream/HTTP/data/{uuid}
.
Only mono-channel streaming is supported. A stream is automatically closed if no data is sent for more than 10 seconds. During streaming, enrollments/verifications can be requested.
3) Closing the stream – DELETE /api/v2/stream/HTTP/{uuid}
.
Endpoint GET /api/v2/status
can be used to:
- check how many HTTP streams are currently running
- check the maximum count of HTTP streams running at the same time
Audio requirements
HTTP – RAW s16le – frequency and number of channels are defined by API request
WebSockets
In order to send voice stream to Voice Verify via WebSocket, a connection has to be made first. All WebSocket messages are sent to ws://websocket-connector.mydomain.com
in case of single server version or to wss://websocket-connector.mydomain.com
in case of multi-server deployment. Also, for the multi-server scenario, webSockets are accessible on a standard HTTPS / 443 port by default. Please note that custom port can be defined during deployment.
As the first step, a WebSocket connection has to be established. This can be achieved by sending the following WebSocket JSON request:
{ "event": "connected", "protocol": "Call", "version": "1.0.0" }
Voice Verify currently accepts only version 1.0.0.
After successful connection, the following message will appear as a response:
{ "version": "1.0.0", "event": "connected", "msg": "Connection established.", "status": 200 }
For the sake of keeping the connection alive, Voice Verify sends the following WebSocket message every 5 seconds:
{ "event": "ping", "msg": "Ping.", "status": 100 }
Now, a WebSocket stream can be started. Multiple streams can be sent through one WebSocket connection.
To start a WebSocket stream, send the following request:
{ "event": "start", "sequenceNumber": "1", "start": { "accountSid": "", "streamSid": "<streamSid>", "callSid": "", "tracks": ["inbound"], "mediaFormat": { "encoding": "audio/x-mulaw", "sampleRate": 8000, "channels": 1 }, "customParameters": { "uuid": "<uuid>", "source_uuid": "<string>", "additional_parameter_1": "value_1" } }, "streamSid": "<streamSid>" }
Where:
sequenceNumber
should always be “1”streamSid
is a 34 character long string [a-zA-Z0-9] used for pairing purposesmediaFormat
contains information about the expected voice stream – these values cannot be changed- supports one channel audio with an 8k sample rate
customParameters
contains “key”:”value” pairs, either required or optionaluuid
– required – Voice Verify internally recognizes the voice stream by this unique ID in UUID format, must be given and compliant with UUID4 standardsource_uuid
– optional – identifier for WebHook subscriptions- optional – any other information can be added as a “key”:”value” pair – this information will be stored in the stream object and could be used for the searching of this stream
accountSid
,streamSid
, andcallSid
parameters are optional and are there just for the sake of Twilio protocol support. Leave them blank or ensure they are 34 characters long, otherwise a Validation Scheme Exception will be raised
After successful stream creation, a similar message will appear as a response:
{ "id": "<streamSid>", "uuid": "<uuid>", "retries": 0, "event": "start", "msg": "Stream started.", "status": 201 }
The stream times out after 10 seconds of inactivity (no packets with streamSid
received). The following message will be sent:
{ "id": "<streamSid>", "uuid": "<uuid>", "event": "media", "msg": "Request timeout.", "status": 408 }
Possible error messages when starting a stream:
Response:
{ "id": "<streamSid>", "uuid": "<uuid>", "event": "start", "msg": "<MSG>", "status": <STATUS_CODE> }
Error codes:
STATUS_CODE | MSG | NOTE |
---|---|---|
201 | Stream started. | |
403 | Resource is unavailable. | This could happen due to licensing issues (not enough stream slots). |
404 | Stream not found. | |
408 | Request timeout. | |
409 | Stream already exists. | Occurs when starting multiple streams with the same UUID. |
423 | Maintenance mode set. | Voice verify is in maintenance mode. |
500 | Internal server error. | |
502 | Connection error. | |
503 | Resource is unavailable. | This could happen due to licensing issues (not enough technology slots). |
After the stream is created successfully, the voice stream should be sent via a similar message:
{ "event": "media", "sequenceNumber": "2", "media": { "track": "inbound", "chunk": "1", "timestamp": "128", "payload": "<DATA>" }, "streamSid": "<streamSid>" }
Payload <DATA>
are Base64 encoded RTP data (PCMU, Little-Endian, 8000 Hz, 1 channel).
The following voice stream message would look like this:
{ "event": "media", "sequenceNumber": "3", "media": { "track": "inbound", "chunk": "2", "timestamp": "158", "payload": "<DATA>" }, "streamSid": "<streamSid>" }
etc. No message is sent back to the client when data is successfully transmitted, to avoid bandwidth overload.
To stop a stream, the following message is expected:
{ "event": "stop", "sequenceNumber": "<N>", "streamSid": "<streamSid>", "stop": { "accountSid": "", "callSid": "" } }
Where <N>
corresponds to the next logically expected number (last media
message would have sequenceNumber
= <N-1>
. accountSid
and callSid
are optional for Twilio compatibility only. One may leave them blank or provide 34 character long string IDs.
Successful stream stopping results in the following response:
{ "id": "<streamSid>", "uuid": "<uuid>", "event": "stop", "msg": "Stream stopped.", "status": 204 }
Voice recordings
Voice recordings can be used for enrollment.
- you can enroll many recordings in one API request
- more than 15 seconds of speech is required in each recording
- one speaker only is required in one recording
- good audio quality
Audio requirements
For calibration and enrollment from a pre-existing database, recordings should be used in these formats:
- WAVE (*.wav) container including any of:
- unsigned 8-bit PCM (u8)
- unsigned 16-bit PCM (u16le)
- IEEE float 32-bit (f32le)
- A-law (alaw)
- μ-law (mulaw)
- ADPCM
- FLAC codec inside FLAC (*.flac) container
- OPUS coden inside OGG (*.opus) container
Other formats are converted using ffmpeg, but it cannot be guaranteed, that the quality of these recordings will be sufficient.
One recording should contain only one speaker.
Enrollment/verification
Once the deployment is finished successfully and voice streams are connected, voice biometrics can be used. Both enrollment and verification actions can be performed on SIP, HTTP or Websocket streams. In addition, enrollment can be done offline using voice recordings. Any enrollment or verification action using streams requires stream_uuid
and external_id
as shown in the example below:
{ "stream_uuid": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "external_id": "abc123" }
Stream_uuid
can be obtained by listing all the streams GET api/v2/streams
and selecting the desired one or it can obtained via endpoint POST api/v2/streams
which returns specific stream(s) that match given parameters such as callid, caller, callee, etc.
external_id
is an arbitrary string of maximum length of 256 characters, which is defined by the Client. It is a unique identifier of the voiceprint and corresponds to the identity of the enrolled/verified user. For example if user John Doe
has unique identifier abcd123
in CRM, then this value should also be used as external_id
.
Enrollment
It is recommend to check the existence of the voiceprint via endpoint POST api/v2/voiceprint
first. This action requires external_id
. Once it is confirmed that voiceprint with provided external_id
doesn’t exist, enrollment procedure can be initiated. User’s voice can be enrolled by either polling (e.g. calling endpoint POST api/v2/enroll
) or by WebHooks. There is a third option to enroll user’s voice using audio recordings via endpoint POST api/v2/import
.
Polling
Endpoint POST api/v2/enroll
requires stream_uuid
and external_id.
parameters as mentioned before. Once called, the attempt is made to extract the voiceprint from the ongoing stream which also returns the amount of NET speech. If the amount of NET speech is between 0-15 seconds, message is produced stating that Voice Verify requires more NET speech to complete the creation of voiceprint. After 15 seconds, the voiceprint is created and saved in the database. It can be improved by gathering more NET speech by calling the endpoint multiple times. Each time, the amount of NET speech is returned. There is a hard limit of 60 seconds of NET speech after which the voiceprint cannot be improved and it is advised to end the enrollment process.
WebHooks
Enrollment process using WebHooks is described here.
Audio Recordings
Requirements for audio recording are described in this section. Endpoint POST/api/v2/import
accepts only mono files converted to base64 text format in the body of the request as shown in the example below:
{ "audiofiles": [ { "external_id": "abc123", "recording": "UklGRgynFwBXQVZFZm10IBAAAAABAAEAQB8AAIA+AAACABAATElTVBoAAABJTkZPSVNGVA4AAABMYXZmNTguMTIuMTAwAGRhdGHGphcA+/8BAPT/MgDz/x8ADADd/ . . . AQAAAAAAAAAAAP//AgD+/wMA/P8DAP7/AQAAAAAA//8BAP//AAACAP3/AwD8/wQA", "overwrite": false } ] }
Please note that Audio Quality Estimation is not yet available for voiceprints created from the audio recordings.
Audio Quality Estimation
Can be optionally turned on and provides details about the quality of the audio with each created voiceprint. More info can be found here.
Voiceprint management
Voiceprint(s) can be removed via endpoint DELETE api/v2/leave
. As the voiceprints are bound to the external_id
of a customer, this parameter needs to be provided. All the voiceprints can be extracted via GET api/v2/snapshot/generate
or imported to Phonexia Voice Verify via POST api/v2/snapshot/import
. Import of snapshots allows existing voiceprints to be overwritten or skipped. The export includes information on voiceprint creation time, external_id
, stream_uuid
and more. It is useful for automated statistics and confronting enrollment database with other Client systems.
Please note that Phonexia Voice Verify keeps neither recordings nor information on the speech content and all the audio is discarded immediately after the voiceprint is created.
Verification
When a user is enrolled, his/her identity can be verified on a current stream. Phonexia Voice Verify always expects the external_id
as part of the request, to know whose voice to verify against.
Verification can be done by:
- Polling – (
POST api/v2/verify
) - WebHooks
Verification by polling can be requested any time, repeatedly. Especially for passive verification, the frequency of verification can be high e.g. every half of second to detect the change of the speaker.
Verification results interpretation
Inside Phonexia Voice Verify the Speaker Identification technology compares the voice from the incoming stream with the enrolled voiceprint of the same customer every time the verification request is sent to Phonexia Voice Verify. As a result, the status of verification and the verification score are provided in the API response with options:
not_verified
– the questioned voice does not match the enrolled onenot_sure
– the voices are similar enough to reject the Customer for verification, but are not enough similar to be absolutely sureverified
– the questioned voice is the same as the enrolled one
These results are provided, based on the verification score and desired threshold(s). To understand verification scores, please see Speaker Identification chapter.
WebHooks
Voice Verify allows sending HTTP callback (WebHook) for following actions:
- a voice stream starts/stops (only HTTP or WebSocket streams)
- an enrollment attempt is made on a voice stream
- a verification attempt is made on a voice stream
WebHook callback for starting/stopping of a voice stream
To setup a WebHook callback for starting/stopping of a specific voice stream with source_uuid
, use POST
/api/v2/subscription
with the following request body:
{ "channel_uuid": "<source_uuid>", "webhook": "<webhook_URL>", "type": "streams", "info": {} }
The URL of the callback must be specified in <webhook_URL>
parameter.
It is possible to attach start/stop WebHook callbacks to any HTTP and WebSocket voice stream. Start/stop WebHook callback cannot be used for SIP calls, as it is not possible to define source_uuid
in this process.
Upon starting/stopping of the voice stream, all subscribers to the specific source_uuid
are notified on the WebHook with following message:
{ "channel_uuid": "<source_uuid>", "type": "stream", "status": <status>, "payload": { "action": "<action>", "stream_uuid": "<uuid>" } }
Where <action>
can either be added
or removed
, <status>
returns a numerical value of HTTP response code and <uuid>
is the unique internal stream identifier.
WebHook callback for enrollment on a voice stream
To setup a WebHook callback for enrollment on a specific voice stream with uuid
, use
/api/v2/subscription with the following request body:POST
{ "channel_uuid": "<uuid>", "webhook": "<webhook_URL>", "type": "enroll", "info": { "external_id": "<external_id>" } }
Where <uuid>
is the internal unique ID of the voice stream, the URL of the callback must be specified in the <webhook_URL>
parameter and <external_id
> is the unique ID of the created voiceprint.
Voice Verify will then send the enrollment WebHook callback every 1 second:
{ "channel_uuid": "<uuid>", "type": "enroll", "payload": { "stream_uuid": "<uuid>", "external_id": "<external_id>", "detail": "<detail>", "speech_length": <speech_length> }, "status": <status> }
Where <detail>
can state one of the following:
- there is not enough net speech to create the enrollment yet
- enrollment created successfully
- enrollment with this
external_id
already exists
speech_length
returns the number of seconds of net speech present in the enrollment and <status>
returns a numerical value of HTTP response code.
WebHook callback for verification on a voice stream
To setup a WebHook callback for verification on a specific voice stream with uuid
, use POST /api/v2/subscription
with the following request body:
{ "channel_uuid": "<uuid>", "webhook": "<webhook_URL>", "type": "verify", "info": { "external_id": "<external_id>" } }
Where <uuid>
is the internal unique ID of the voice stream, the URL of the callback must be specified in the <webhook_URL>
parameter and <external_id
> is the unique ID of the voiceprint to verify the current voice stream against.
Voice Verify will then send the verification WebHook callback every 1 second:
{ "channel_uuid": "<uuid>", "type": "verify", "status": <status>, "payload": { "stream_uuid": "<uuid>", "external_id": "<external_id>", "result": "<verdict>", "speech_length": <speech_length>, "score": <score> } }
Where <verdict>
can state one of the following:
- there is not enough net speech to make the verification yet
- voiceprint with
external_id
does not exist - verified
- not verified
- not sure
speech_length
returns the number of seconds of net speech present in the enrollment, <score>
returns the verification score and <status>
returns a numerical value of HTTP response code.
WebHook callback removal
Any WebHook callback can be deleted by DELETE /api/v2/subscription
.