A description of the DAVID-internal JSON format, used by various applications for handling speech to text transcription metadata.

While we offer a direct integration with Speechmatics, with a DLL that takes Speechmatics-format metadata directly, some customers use other transcription services, so the existing Speechmatics integration is not useful to them.

Rather than building a new interface for each transcription service our customers might like, we can simply provide this information about our S2T format, and they can perform whatever transformation is required.

Example S2T JSON

Here is an abbreviated example of a valid S2T JSON. See below for descriptions of the various properties.

DAVID S2T JSON Format

{
  "version": "4.0",
  "head": {
    "original_name": "audio-that-was-transcribed.WAV",
    "duration": 15,
    "language": "en",
    "created": "2018-08-20T19:35:02",
    "service": "Speechmatics"
  },
  "speakers": [
    {
      "name": "Mary"
    },
    {
      "name": "Bob"
    }
  ],
  "text": [
    {
      "speaker": "Mary",
      "words": [
        {
          "word": "Hello",
          "duration": 600,
          "confidence": 0.61,
          "time": 190
        },
        {
          "word": "radio",
          "duration": 450,
          "confidence": 0.71,
          "time": 860
        },
        ...
      ]
    },
    {
      "speaker": "Bob",
      "words": [
        {
          "word": "this",
          "duration": 240,
          "confidence": 1.0,
          "time": 10680
        },
        {
          "word": "podcast",
          "duration": 540,
          "confidence": 1.0,
          "time": 11430
        },
        ...
      ]
    }
  ]
}
JS

version

The format version level of the data structure itself.

head

This property contains some general metadata, some of which is optional.

original_name – the name of the transcribed audio file

duration – the total duration of the transcribed audio, in seconds

language – an ISO language code, e.g., en for English, or de for German

created – an ISO timestamp, e.g., 2018-08-20T19:35:02 

service – identifier for the source transcription service

head

{
  "original_name": string,
  "duration": number,
  "language": string,
  "created": string,
  "service": string
}
JS

speakers

The top-level speakers property takes an array of objects, each of which describes a different speaker.

Each speaker identified in the transcription is represented as a member of the speakers  array. The name of a speaker is used when describing members of the text array (see below).

Currently, the speaker object has only one property, name . The name of each member of speakers must be unique, within the S2T.

speakers

{
  "name": string
}
JS

text

The top-level text property takes an array of "blocks"/sequences of words spoken by a single speaker.

Each member of text has two properties:

  • speaker – must match the name property of a member of the top-level speakers array
  • words – an array of objects, each of which described a transcribed word
    • Each member of this array has four properties:
      • word - the transcribed text value
      • duration  – an integer number of milliseconds
      • confidence – a decimal number between 0.0 and 1.0 , describing the confidence of the accuracy of the transcription, where 1.0  indicates 100% confidence
      • time – an integer number of milliseconds, relative to the beginning of the entire transcribed audio file

text

{
  "speaker": string,
  "words": [
    {
      "word": string,
      "duration": number,
      "confidence": number,
      "time": number
    },
    ...
  ]
}
JS