Skip to main content
Skip table of contents

DAVID S2T Transcription Data Format

An example of the DAVID "S2T" data format (JSON), used by various applications for handling speech-to-text transcription metadata

Not all transcription services will provide metadata that directly addresses all properties of this S2T format.

Some will provide semantic equivalents but in a different format or shape.

Fitting the response from a transcription service into this S2T format is a required step, before the data can be imported to and used by DigaSystem applications.

In case the transcription service provides "extra" information, there is no requirement that it be included at all – the S2T format has been designed to express all aspects of a transcription that are necessary and useful for existing applications.

If you find anything that you feel is missing or would be nice to have, please contact DAVID – we're always happy to hear from you!

Example S2T JSON

See below for descriptions of the various properties.


  "version": "4.0",
  "head": {
    "original_name": "audio-that-was-transcribed.WAV",
    "duration": 15,
    "language": "en",
    "created": "2018-08-20T19:35:02",
    "service": "Speechmatics"
  "speakers": [
      "name": "Mary"
      "name": "Bob"
  "text": [
      "speaker": "Mary",
      "words": [
          "word": "Hello",
          "duration": 600,
          "confidence": 0.61,
          "time": 190
          "word": "radio",
          "duration": 450,
          "confidence": 0.71,
          "time": 860
      "speaker": "Bob",
      "words": [
          "word": "this",
          "duration": 240,
          "confidence": 1.0,
          "time": 10680
          "word": "podcast",
          "duration": 540,
          "confidence": 1.0,
          "time": 11430

S2T Properties


The S2T format version.


This property contains some general metadata, some of which is optional


Name of the transcribed audio file


Total duration of the transcribed audio, in seconds


Depending on the service, this is either the language supplied to the service, or the language detected by the service


ISO-8601 format timestamp of transcript creation, e.g., 2018-08-20T19:35:02


Identifier for the service that created the transcript


  "original_name": string,
  "duration": number,
  "language": string,
  "created": string,
  "service": string


The top-level speakers property takes an array of objects, each of which describes a different speaker.

Each speaker identified in the transcription is represented as a member of the speakers  array.


Currently, the speaker object has only one property: name .

The value of name is arbitrary, but must be unique (within speakers ), and onlythese names can be used as values for the speaker property of "blocks" of words (i.e., members of the text property; see below).


  "name": string


The top-level text property takes an array of "blocks" of words spoken by a single speaker.

See the example below for the shape of a "block".

Members of text occur in order according to the time position of the words they describe (in the transcribed audio).

Each block can have only one speaker

Transcription services might handle speaker identification differently (from each other), and some might not offer it at all.

Generally, the first "block" should include all words spoken by the first speaker, then a second block should begin when (the first speaker stops) and the second speaker begins, and so on...


The "name" of the speaker of the words. Must match the name property of a member of the top-level speakers array.


An array of "word" objects, each of which described a transcribed word.

Each member of this array has four properties:


The transcribed text value, typically just one word (e.g., "hello", or "space") 


An integer number of milliseconds representing how long the speaker takes to speak the word (i.e., where the word begins and ends within the audio)



A decimal number between 0.0 and 1.0 , describing the confidence of the accuracy of the transcription, where 1.0  indicates 100% confidence


An integer number of milliseconds, relative to the beginning of the entire transcribed audio file, describing "where" the word was spoken



A minimum number of line breaks to insert between this word and the next word

Used by SingleTrack Editor "Speech to Text" view

Could be used in a custom application for implementing "paragraph formatting" within single "blocks" (i.e.., instead of creating multiple, successive "blocks" for a single speaker)

text "block"

  "speaker": string,
  "words": [
      "word": string,
      "duration": number,
      "confidence": number,
      "time": number,
	  "breaks": number
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.