Skip to contentSkip to main navigation Skip to footer

General

This documentation is valid for Voice Verify version 1.6. For older versions please visit Older versions.

Introduction

This section provides information for the integrator and administrator roles of Phonexia Voice Verify. The reader should receive information on the deployment, integration and maintenance of the Phonexia Voice Verify solution. This document does not describe all the knowledge required for the correct implementation of voice biometrics to a call center.

The following terms are used in this section:

  • Client – the company installing, using and integrating Phonexia Voice Verify
  • Customer – the person or company utilizing voice biometrics technology for voiceprint enrollment or identity verification

Voice Verify utilizes Speaker Identification technology as a voice biometrics system.

To solve the speaker verification problem, two processes are used:

  1. Enrollment or Customer Registration
  2. Verification

During Enrollment (1), the voiceprint of a speaker is created and saved to a database. This baseline voiceprint is later used during all subsequent verifications of the same speaker. For this reason, being sure of a speaker’s identity during enrollment is crucial and needs to be verified by other means. Other parameters like utterance richness or the quality of speech used for enrollment also need to be checked.

The Verification (2) phase takes place multiple times. After the Customer has gone through Enrollment, every subsequent call can verify the Customer through the use of voice biometrics. For this, the Customer’s voice is used again to create an additional voiceprint, which is then compared to the baseline (Enrollment) voiceprint. A comparison of these two voiceprints then results in confirmation of whether they have come from one and same speaker or not.

Voice Verify single-server vs multi-server version comparison

Voice Verify can be deployed in two variants:

  1. Single-server
  2. Multi-server

The main differences are summarized in the following table:

FeatureSingle-serverMulti-server
Number of servers/VMs needed110+
Number of concurrent calls1-150unlimited
Vertical scaling
Horizontal Scaling
Public domain needed*
Secured communication**
Accepts SIP calls
Accepts HTTP streaming
Accepts WebSocket streaming
Batch import of voice recordings
WebHook support

*public domain is needed only for secured communication in case no SSL certificate is owned
**secured communication is optional.

Typically, multi-server version is suitable for bigger deployments with more than 150 calls processed in parallel.

HTTPS

There are two ways of running Voice Verify on HTTPS:

  • Use customer’s wildcard certificate
  • Use online-assigned 100% valid certificate. In this case, there are two requirements
    • Virtual machine/server must have access to the internet
    • SSL domain must have a public DNS

In both cases, its necessary to inform Phonexia’s DevOps team before deployment. Please note that validity of the certificate provided by Phonexia is limited to 3 months.

API

Once Phonexia Voice Verify is deployed, all its functionality is accessible via API. API integration into the software used in the call center consists of two parts:

  1. giving instructions to Voice Verify (enroll a person, verify, create a back-up,…)
  2. representation of the verification result (verified, not sure, not verified)

Phonexia Voice Verify provides API functionality for several processes:

  • Main functionality – voice enrollment/verification
  • Support administrative process – PBX connectivity, voice streaming, reports, logging, …
  • Maintenance – back-ups, restore

Current offline version of API documentation can be downloaded here:

Voice Verify 1.6 API reference

To view it, please import the downloaded JSON file here: https://editor.swagger.io/

A running instance of Phonexia Voice Verify provides interactive API documentation via http://voiceverify.<mydomain.com>:8000/swagger/ , where mydomain.com is the domain dedicated to it or assigned in the hosts file – see Networking requirements.

The multi-server Phonexia Voice Verify API is available via standard HTTPS(either port 443 or custom port) at https://voiceverify.mydomain.com/swagger/.

Access Management

Phonexia Voice Verify provides limited rest-auth access management based on a token. Only one user account exists in the system. The Client can login to the system using access credentials POST /rest-auth/login/. The returned token is used in all follow-up queries.

The token has to be added to the HTTP header as "Authorization: Token <ACCESS_TOKEN>". In case you are using Swagger (more information in API reference chapter) to send your request, click the “Authorize” button located as shown in the following picture:

Now insert the token in the same format Token <ACCESS_TOKEN> and confirm by the “Authorize” button.

The Client can change the password for the user account POST /rest-auth/password/change/. The initial access credentials are delivered by Phonexia to the Client. After successful deployment of Phonexia Voice Verify, the Client must change the password. In case of a forgotten password, Phonexia can reset the password GET /maintenance/reset_password; access to the system is necessary (e.g. VPN).

Voice input

The next step, after deployment of Voice Verify and loading the Swagger API interface, is providing a voice input. For this reason, these options are available:

  • real-time streams
    • SIP calls (not yet available in the multi-server version)
    • HTTP streams
    • WebSocket streams
  • offline audio file processing
    • voice recordings (for enrollment only)

Each real-time stream is internally bound to a unique uuid. Using this uuid, enrollment or verification can be called upon a specific voice stream. All voice input options can be combined.

SIP calls

Phonexia Voice Verify uses SIP protocol to register to a PBX and acts as a standard SIP endpoint (softphone). Voice Verify is then able to accept SIP calls and process the incoming voice stream.

The PBX (for example Genesys, CISCO, Avaya, Asterisk,…) must be configured to provide a copy of a voice stream coming from a customer and initiate a call to Phonexia. The parameter UUID of each stream serves as an identifier used later for making enrollment and verification requests on this voice stream.

Configuration of the PBX depends on the vendor of the PBX. The Integrator of Phonexia Voice Verify is responsible for this part of configuration.

PBX Connectivity

To allow Phonexia Voice Verify access to live streams, it needs to be registered to a PBX. Phonexia Voice Verify registers/connects as a SIP endpoint to the PBX. Phonexia Voice Verify keeps a list of available PBX instances in a database. It can connect or disconnect to any of them via an API request. A PBX instance entry needs to be created in the Phonexia Voice Verify database before such a PBX can be connected. A PBX instance entry is created by POST /api/v2/pbx/, listed GET api/v2/pbx/{ID} or removed DELETE api/v2/pbx/{ID}. All PBX entries can be listed as well via GET api/v2/pbx/.

When a PBX instance entry exists in Phonexia Voice Verify, the connection can be started (= Voice Verify is registered to this PBX) (POST api/v2/pbx/{ID}/start) or closed POST api/v2/pbx/{ID}/stop.

When the PBX is connected, Phonexia Voice Verify listens and receives SIP calls. Once such a call is received, its binary content is then redirected to the processing unit. From that point, a so-called stream is created. Phonexia Voice Verify works with an internal stream identifier – stream_uuid. Such stream_uuid is generated by Voice Verify. After the call is connected, all following API requests related to streams work with this identifier. As PBX has different means of call identification (callid, caller or callee), Call Center SW can ask for stream details to obtain stream_uuid POST api/v2/streams.

Audio requirements

Inside the SIP call, audio (voices) are transmitted via RTP protocol. For more information, see RFC 3550.
Supported RTP Payload types are:

  • 0 (PCMU, Little-Endian, 8000 Hz, 1 channel)
  • 8 (PCMA, Little-Endian, 8000 Hz, 1 channel)

HTTP streams

Sending voice to Voice Verify can also be done via HTTP streaming. HTTP streaming consists of three steps:

1) Opening a stream – done by POST /api/v2/stream/HTTP endpoint.

Default sampling frequency is 8 000 Hz (a different frequency has to be specified by frequency parameter).

Any “key”:”value” pair(s) can be added into the info field. All of these are optional. Specifically, for WebHook subscription, it is possible to include source_uuid – identifier for WebHook subscription, as follows:

{
  "info": {"source_uuid": "<string>"},
  "frequency": 8000
}

In the response, uuid (unique ID of the stream) is returned. This uuid will later be used for sending voice and enrollment/verification.

2) Sending data (voice) to the stream – using POST /api/v2/stream/HTTP/data/{uuid}.

Only mono-channel streaming is supported. A stream is automatically closed if no data is sent for more than 10 seconds. During streaming, enrollments/verifications can be requested.

3) Closing the stream – DELETE /api/v2/stream/HTTP/{uuid}.

Endpoint GET /api/v2/status can be used to:

  1. check how many HTTP streams are currently running
  2. check the maximum count of HTTP streams running at the same time

Audio requirements

HTTP – RAW s16le – frequency and number of channels are defined by API request

WebSockets

In order to send voice stream to Voice Verify via WebSocket, a connection has to be made first. All WebSocket messages are sent to http://websocket.mydomain.com  in case of single server version or to https://websocket.mydomain.com in case of multi-server deployment. Also, for the multi-server scenario, webSockets are accessible on a standard HTTPS / 443 port by default. Please note that custom port can be defined during deployment.

As the first step, a WebSocket connection has to be established. This can be achieved by sending the following WebSocket JSON request:

{
  "event": "connected",
  "protocol": "Call",
  "version": "1.0.0"
}

Voice Verify currently accepts only version 1.0.0.
After successful connection, the following message will appear as a response:

{
  "version": "1.0.0",
  "event": "connected",
  "msg": "Connection established.",
  "status": 200
}

For the sake of keeping the connection alive, Voice Verify sends the following WebSocket message every 5 seconds:

{
  "event": "ping",
  "msg": "Ping.", 
  "status": 100
}

Now, a WebSocket stream can be started. Multiple streams can be sent through one WebSocket connection.
To start a WebSocket stream, send the following request:

{
    "event": "start",
    "sequenceNumber": "1",
    "start": {
        "accountSid": "",
        "streamSid": "<streamSid>",
        "callSid": "",
        "tracks": ["inbound"],
        "mediaFormat": {
            "encoding": "audio/x-mulaw",
            "sampleRate": 8000,
            "channels": 1
        },
        "customParameters": {
            "uuid": "<uuid>",
            "source_uuid": "<string>",
            "additional_parameter_1": "value_1"
        }
    },
    "streamSid": "<streamSid>"
}

Where:

  • sequenceNumber should always be “1”
  • streamSid is a 34 character long string [a-zA-Z0-9] used for pairing purposes
  • mediaFormat contains information about the expected voice stream – these values cannot be changed
    • supports one channel audio with an 8k sample rate
  • customParameters contains “key”:”value” pairs, either required or optional
    • uuid – required – Voice Verify internally recognizes the voice stream by this unique ID in UUID format, must be given and compliant with UUID4 standard
    • source_uuid – optional – identifier for WebHook subscriptions
    • optional – any other information can be added as a “key”:”value” pair – this information will be stored in the stream object and could be used for the searching of this stream
  • accountSid, streamSid, and callSid parameters are optional and are there just for the sake of Twilio protocol support. Leave them blank or ensure they are 34 characters long, otherwise a Validation Scheme Exception will be raised

After successful stream creation, a similar message will appear as a response:

{
    "id": "<streamSid>",
    "uuid": "<uuid>",
    "retries": 0,
    "event": "start",
    "msg": "Stream started.",
    "status": 201
}

The stream times out after 10 seconds of inactivity (no packets with streamSid received). The following message will be sent:

{
    "id": "<streamSid>",
    "uuid": "<uuid>",
    "event": "media",
    "msg": "Request timeout.",
    "status": 408
}

Possible error messages when starting a stream:
Response:

{
    "id": "<streamSid>",
    "uuid": "<uuid>",
    "event": "start",
    "msg": "<MSG>",
    "status":  <STATUS_CODE>
}

Error codes:

STATUS_CODEMSGNOTE
201Stream started. 
403Resource is unavailable.This could happen due to licensing issues (not enough stream slots).
404Stream not found. 
408Request timeout. 
409Stream already exists.Occurs when starting multiple streams with the same UUID.
423Maintenance mode set.Voice verify is in maintenance mode.
500Internal server error. 
502Connection error. 
503Resource is unavailable.This could happen due to licensing issues (not enough technology slots).

After the stream is created successfully, the voice stream should be sent via a similar message:

{
    "event": "media",
    "sequenceNumber": "2",
    "media": {
        "track": "inbound",
        "chunk": "1",
        "timestamp": "128",
        "payload": "<DATA>"
    },
    "streamSid": "<streamSid>"
}

Payload <DATA> are Base64 encoded RTP data (PCMU, Little-Endian, 8000 Hz, 1 channel).
The following voice stream message would look like this:

{
    "event": "media",
    "sequenceNumber": "3",
    "media": {
        "track": "inbound",
        "chunk": "2",
        "timestamp": "158",
        "payload": "<DATA>"
    },
    "streamSid": "<streamSid>"
}

etc. No message is sent back to the client when data is successfully transmitted, to avoid bandwidth overload.
To stop a stream, the following message is expected:

{
    "event": "stop",
    "sequenceNumber": "<N>",
    "streamSid": "<streamSid>",
    "stop": {
        "accountSid": "",
        "callSid": ""
    }
}

Where <N> corresponds to the next logically expected number (last media message would have sequenceNumber = <N-1>. accountSid and callSid are optional for Twilio compatibility only. One may leave them blank or provide 34 character long string IDs.

Successful stream stopping results in the following response:

{
    "id": "<streamSid>",
    "uuid": "<uuid>",
    "event": "stop",
    "msg": "Stream stopped.",
    "status": 204
}

Voice recordings

Voice recordings can be used for enrollment.

  • you can enroll many recordings in one API request
  • more than 15 seconds of speech is required in each recording
  • one speaker only is required in one recording
  • good audio quality

Audio requirements

For calibration and enrollment from a pre-existing database, recordings should be used in these formats:

  • WAVE (*.wav) container including any of:
    • unsigned 8-bit PCM (u8)
    • unsigned 16-bit PCM (u16le)
    • IEEE float 32-bit (f32le)
    • A-law (alaw)
    • μ-law (mulaw)
    • ADPCM
  • FLAC codec inside FLAC (*.flac) container
  • OPUS coden inside OGG (*.opus) container

Other formats are converted using ffmpeg, but it cannot be guaranteed, that the quality of these recordings will be sufficient.

One recording should contain only one speaker.

Enrollment/verification

Once the deployment is finished successfully and voice streams are connected, voice biometrics can be used. Both enrollment and verification actions can be performed on SIP, HTTP or Websocket streams. In addition, enrollment can be done offline using voice recordings. Any enrollment or verification action using streams requires stream_uuid and external_id as shown in the example below:

{
  "stream_uuid": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "external_id": "abc123"
}

Stream_uuid can be obtained by listing all the streams GET api/v2/streams and selecting the desired one or it can obtained via endpoint POST api/v2/streams which returns specific stream(s) that match given parameters such as callid, caller, callee, etc.

external_id is an arbitrary string of maximum length of 256 characters, which is defined by the Client. It is a unique identifier of the voiceprint and corresponds to the identity of the enrolled/verified user. For example if user John Doe has unique identifier abcd123 in CRM, then this value should also be used as external_id.

Enrollment

It is recommend to check the existence of the voiceprint via endpoint POST api/v2/voiceprint first. This action requires external_id. Once it is confirmed that voiceprint with provided external_id doesn’t exist, enrollment procedure can be initiated.  User’s voice can be enrolled by either polling (e.g. calling endpoint POST api/v2/enroll ) or by WebHooks. There is a third option to enroll user’s voice using audio recordings via endpoint POST api/v2/import.

Polling

Endpoint  POST api/v2/enroll requires stream_uuid and external_id. parameters as mentioned before. Once called, the attempt is made to extract the voiceprint from the ongoing stream which also returns the amount of NET speech. If the amount of NET speech is between 0-15 seconds, message is produced stating that Voice Verify requires more NET speech to complete the creation of voiceprint. After 15 seconds, the voiceprint is created and saved in the database. It can be improved by gathering more NET speech by calling the endpoint multiple times. Each time, the amount of NET speech is returned. There is a hard limit of 60 seconds of NET speech after which the voiceprint cannot be improved and it is advised to end the enrollment process.

WebHooks

Enrollment process using WebHooks is described here.

Audio Recordings

Requirements for audio recording are described in this section. Endpoint POST/api/v2/import accepts only mono files converted to base64 text format in the body of the request as shown in the example below:

{
  "audiofiles": [
    {
      "external_id": "abc123",
      "recording": "UklGRgynFwBXQVZFZm10IBAAAAABAAEAQB8AAIA+AAACABAATElTVBoAAABJTkZPSVNGVA4AAABMYXZmNTguMTIuMTAwAGRhdGHGphcA+/8BAPT/MgDz/x8ADADd/
      .
      .
      .
      AQAAAAAAAAAAAP//AgD+/wMA/P8DAP7/AQAAAAAA//8BAP//AAACAP3/AwD8/wQA",
      "overwrite": false
    }
  ]
}

Please note that Audio Quality Estimation is not yet available for voiceprints created from the audio recordings.

Audio Quality Estimation

Can be optionally turned on and provides details about the quality of the audio with each created voiceprint. More info can be found here.

Voiceprint management

Voiceprint(s) can be removed via endpoint DELETE api/v2/leave. As the voiceprints are bound to the external_id of a customer, this parameter needs to be provided. All the voiceprints can be extracted via GET api/v2/snapshot/generate or imported to Phonexia Voice Verify  via POST api/v2/snapshot/import. Import of snapshots allows existing voiceprints to be overwritten or skipped. The export includes information on voiceprint creation time, external_id, stream_uuid and more. It is useful for automated statistics and confronting enrollment database with other Client systems.

Please note that Phonexia Voice Verify keeps neither recordings nor information on the speech content and all the audio is discarded immediately after the voiceprint is created.

Verification

When a user is enrolled, his/her identity can be verified on a current stream. Phonexia Voice Verify always expects the external_id as part of the request, to know whose voice to verify against.

Verification can be done by:

  1. Polling – (POST api/v2/verify)
  2. WebHooks

Verification by polling can be requested any time, repeatedly. Especially for passive verification, the frequency of verification can be high e.g. every half of second to detect the change of the speaker.

Verification results interpretation

Inside Phonexia Voice Verify the Speaker Identification technology compares the voice from the incoming stream with the enrolled voiceprint of the same customer every time the verification request is sent to Phonexia Voice Verify. As a result, the status of verification and the verification score are provided in the API response with options:

  • not_verified – the questioned voice does not match the enrolled one
  • not_sure – the voices are similar enough to reject the Customer for verification, but are not enough similar to be absolutely sure
  • verified – the questioned voice is the same as the enrolled one

These results are provided, based on the verification score and desired threshold(s). To understand verification scores, please see Speaker Identification chapter.

WebHooks

Voice Verify allows sending HTTP callback (WebHook) for following actions:

  • a voice stream starts/stops (only HTTP or WebSocket streams)
  • an enrollment attempt is made on a voice stream
  • a verification attempt is made on a voice stream

WebHook callback for starting/stopping of a voice stream

To setup a WebHook callback for starting/stopping of a specific voice stream with source_uuid, use POST/api/v2/subscription with the following request body:

{
    "channel_uuid": "<source_uuid>",
    "webhook": "<webhook_URL>",
    "type": "streams",
    "info": {}
}

The URL of the callback must be specified in <webhook_URL> parameter.

It is possible to attach start/stop WebHook callbacks to any HTTP and WebSocket voice stream. Start/stop WebHook callback cannot be used for SIP calls, as it is not possible to define source_uuid in this process.

Upon starting/stopping of the voice stream, all subscribers to the specific source_uuid are notified on the WebHook with following message:

{
    "channel_uuid": "<source_uuid>",
    "type": "stream",
    "status": <status>,
    "payload": {
        "action": "<action>",
        "stream_uuid": "<uuid>"
    }
}

Where <action> can either be added or removed, <status> returns a numerical value of HTTP response code and <uuid> is the unique internal stream identifier.

WebHook callback for enrollment on a voice stream

To setup a WebHook callback for enrollment on a specific voice stream with uuid, use POST /api/v2/subscription with the following request body:

{
    "channel_uuid": "<uuid>",
    "webhook": "<webhook_URL>",
    "type": "enroll",
    "info": {
        "external_id": "<external_id>"
    }
}

Where <uuid> is the internal unique ID of the voice stream, the URL of the callback must be specified in the <webhook_URL> parameter and <external_id> is the unique ID of the created voiceprint.

Voice Verify will then send the enrollment WebHook callback every 1 second:

{
    "channel_uuid": "<uuid>",
    "type": "enroll",
    "payload": {
        "stream_uuid": "<uuid>",
        "external_id": "<external_id>",
        "detail": "<detail>",
        "speech_length": <speech_length>
    },
    "status": <status>
}

Where <detail> can state one of the following:

  • there is not enough net speech to create the enrollment yet
  • enrollment created successfully
  • enrollment with this external_id already exists

speech_length returns the number of seconds of net speech present in the enrollment and <status> returns a numerical value of HTTP response code.

WebHook callback for verification on a voice stream

To setup a WebHook callback for verification on a specific voice stream with uuid, use POST /api/v2/subscription with the following request body:

{
    "channel_uuid": "<uuid>",
    "webhook": "<webhook_URL>",
    "type": "verify",
    "info": {
        "external_id": "<external_id>"
    }
}

Where <uuid> is the internal unique ID of the voice stream, the URL of the callback must be specified in the <webhook_URL> parameter and <external_id> is the unique ID of the voiceprint to verify the current voice stream against.

Voice Verify will then send the verification WebHook callback every 1 second:

{
    "channel_uuid": "<uuid>",
    "type": "verify",
    "status": <status>,
    "payload": {
        "stream_uuid": "<uuid>",
        "external_id": "<external_id>",
        "result": "<verdict>",
        "speech_length": <speech_length>,
        "score": <score>
    }
}

Where <verdict> can state one of the following:

  • there is not enough net speech to make the verification yet
  • voiceprint with external_id does not exist
  • verified
  • not verified
  • not sure

speech_length returns the number of seconds of net speech present in the enrollment, <score> returns the verification score and <status> returns a numerical value of HTTP response code.

WebHook callback removal

Any WebHook callback can be deleted by DELETE /api/v2/subscription.