Skip to main content
Skip table of contents

DAVID S2T Transcription Data Format

An example of the DAVID "S2T" data format (JSON), used by various applications for handling speech-to-text transcription metadata

Not all transcription services will provide metadata that directly addresses all properties of this S2T format.

Some will provide semantic equivalents but in a different format or shape.

Fitting the response from a transcription service into this S2T format is a required step, before the data can be imported to and used by DigaSystem applications.

In case the transcription service provides "extra" information, there is no requirement that it be included at all – the S2T format has been designed to express all aspects of a transcription that are necessary and useful for existing applications.

If you find anything that you feel is missing or would be nice to have, please contact DAVID – we're always happy to hear from you!


Example S2T JSON

See below for descriptions of the various properties.

DAVID S2T JSON Format

JS
{
  "version": "4.0",
  "head": {
    "original_name": "audio-that-was-transcribed.WAV",
    "duration": 15,
    "language": "en",
    "created": "2018-08-20T19:35:02",
    "service": "Speechmatics"
  },
  "speakers": [
    {
      "name": "Mary"
    },
    {
      "name": "Bob"
    }
  ],
  "text": [
    {
      "speaker": "Mary",
      "words": [
        {
          "word": "Hello",
          "duration": 600,
          "confidence": 0.61,
          "time": 190
        },
        {
          "word": "radio",
          "duration": 450,
          "confidence": 0.71,
          "time": 860
        },
        ...
      ]
    },
    {
      "speaker": "Bob",
      "words": [
        {
          "word": "this",
          "duration": 240,
          "confidence": 1.0,
          "time": 10680
        },
        {
          "word": "podcast",
          "duration": 540,
          "confidence": 1.0,
          "time": 11430
        },
        ...
      ]
    }
  ]
}

S2T Properties

version

The S2T format version.

head

This property contains some general metadata, some of which is optional

original_name

Name of the transcribed audio file

duration

Total duration of the transcribed audio, in seconds

language

Depending on the service, this is either the language supplied to the service, or the language detected by the service

created

ISO-8601 format timestamp of transcript creation, e.g., 2018-08-20T19:35:02

service

Identifier for the service that created the transcript

head

JS
{
  "original_name": string,
  "duration": number,
  "language": string,
  "created": string,
  "service": string
}

speakers

The top-level speakers property takes an array of objects, each of which describes a different speaker.

Each speaker identified in the transcription is represented as a member of the speakers  array.

name

Currently, the speaker object has only one property: name .

The value of name is arbitrary, but must be unique (within speakers ), and onlythese names can be used as values for the speaker property of "blocks" of words (i.e., members of the text property; see below).

speaker

JS
{
  "name": string
}

text

The top-level text property takes an array of "blocks" of words spoken by a single speaker.

See the example below for the shape of a "block".

Members of text occur in order according to the time position of the words they describe (in the transcribed audio).

Each block can have only one speaker

Transcription services might handle speaker identification differently (from each other), and some might not offer it at all.

Generally, the first "block" should include all words spoken by the first speaker, then a second block should begin when (the first speaker stops) and the second speaker begins, and so on...

speaker

The "name" of the speaker of the words. Must match the name property of a member of the top-level speakers array.

words

An array of "word" objects, each of which described a transcribed word.

Each member of this array has four properties:

word

The transcribed text value, typically just one word (e.g., "hello", or "space") 

duration

An integer number of milliseconds representing how long the speaker takes to speak the word (i.e., where the word begins and ends within the audio)

confidence

(optional)

A decimal number between 0.0 and 1.0 , describing the confidence of the accuracy of the transcription, where 1.0  indicates 100% confidence

time

An integer number of milliseconds, relative to the beginning of the entire transcribed audio file, describing "where" the word was spoken

breaks 

(optional)

A minimum number of line breaks to insert between this word and the next word

Used by SingleTrack Editor "Speech to Text" view

Could be used in a custom application for implementing "paragraph formatting" within single "blocks" (i.e.., instead of creating multiple, successive "blocks" for a single speaker)

text "block"

JS
{
  "speaker": string,
  "words": [
    {
      "word": string,
      "duration": number,
      "confidence": number,
      "time": number,
	  "breaks": number
    },
    ...
  ]
}
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.