DAVID S2T Transcription Data Format
Not all transcription services will provide metadata that directly addresses all properties of this S2T format.
Some will provide semantic equivalents but in a different format or shape.
Fitting the response from a transcription service into this S2T format is a required step, before the data can be imported to and used by DigaSystem applications.
In case the transcription service provides "extra" information, there is no requirement that it be included at all – the S2T format has been designed to express all aspects of a transcription that are necessary and useful for existing applications.
If you find anything that you feel is missing or would be nice to have, please contact DAVID – we're always happy to hear from you!
Example S2T JSON
See below for descriptions of the various properties.
DAVID S2T JSON Format
{
"version": "4.0",
"head": {
"original_name": "audio-that-was-transcribed.WAV",
"duration": 15,
"language": "en",
"created": "2018-08-20T19:35:02",
"service": "Speechmatics"
},
"speakers": [
{
"name": "Mary"
},
{
"name": "Bob"
}
],
"text": [
{
"speaker": "Mary",
"words": [
{
"word": "Hello",
"duration": 600,
"confidence": 0.61,
"time": 190
},
{
"word": "radio",
"duration": 450,
"confidence": 0.71,
"time": 860
},
...
]
},
{
"speaker": "Bob",
"words": [
{
"word": "this",
"duration": 240,
"confidence": 1.0,
"time": 10680
},
{
"word": "podcast",
"duration": 540,
"confidence": 1.0,
"time": 11430
},
...
]
}
]
}
S2T Properties
version
The S2T format version.
head
This property contains some general metadata, some of which is optional
original_name
Name of the transcribed audio file
duration
Total duration of the transcribed audio, in seconds
language
Depending on the service, this is either the language supplied to the service, or the language detected by the service
created
ISO-8601 format timestamp of transcript creation, e.g., 2018-08-20T19:35:02
service
Identifier for the service that created the transcript
head
{
"original_name": string,
"duration": number,
"language": string,
"created": string,
"service": string
}
speakers
The top-level speakers
property takes an array of objects, each of which describes a different speaker.
Each speaker identified in the transcription is represented as a member of the speakers
array.
name
Currently, the speaker object has only one property: name
.
The value of name
is arbitrary, but must be unique (within speakers
), and onlythese names
can be used as values for the speaker
property of "blocks" of words (i.e., members of the text
property; see below).
speaker
{
"name": string
}
text
The top-level text
property takes an array of "blocks" of words spoken by a single speaker.
See the example below for the shape of a "block".
Members of text
occur in order according to the time position of the words they describe (in the transcribed audio).
Each block can have only one speaker
.
Transcription services might handle speaker identification differently (from each other), and some might not offer it at all.
Generally, the first "block" should include all words spoken by the first speaker, then a second block should begin when (the first speaker stops) and the second speaker begins, and so on...
speaker
The "name" of the speaker of the words. Must match the name
property of a member of the top-level speakers
array.
words
An array of "word" objects, each of which described a transcribed word.
Each member of this array has four properties:
word
The transcribed text value, typically just one word (e.g., "hello", or "space")
duration
An integer number of milliseconds representing how long the speaker takes to speak the word (i.e., where the word begins and ends within the audio)
confidence
(optional)
A decimal number between 0.0
and 1.0
, describing the confidence of the accuracy of the transcription, where 1.0
indicates 100% confidence
time
An integer number of milliseconds, relative to the beginning of the entire transcribed audio file, describing "where" the word was spoken
breaks
(optional)
A minimum number of line breaks to insert between this word and the next word
Used by SingleTrack Editor "Speech to Text" view
Could be used in a custom application for implementing "paragraph formatting" within single "blocks" (i.e.., instead of creating multiple, successive "blocks" for a single speaker)
text "block"
{
"speaker": string,
"words": [
{
"word": string,
"duration": number,
"confidence": number,
"time": number,
"breaks": number
},
...
]
}