DAVID S2T Transcription Data Format
A description of the DAVID-internal JSON format, used by various applications for handling speech to text transcription metadata.
While we offer a direct integration with Speechmatics, with a DLL that takes Speechmatics-format metadata directly, some customers use other transcription services, so the existing Speechmatics integration is not useful to them.
Rather than building a new interface for each transcription service our customers might like, we can simply provide this information about our S2T format, and they can perform whatever transformation is required.
Example S2T JSON
Here is an abbreviated example of a valid S2T JSON. See below for descriptions of the various properties.
DAVID S2T JSON Format
{
"version": "4.0",
"head": {
"original_name": "audio-that-was-transcribed.WAV",
"duration": 15,
"language": "en",
"created": "2018-08-20T19:35:02",
"service": "Speechmatics"
},
"speakers": [
{
"name": "Mary"
},
{
"name": "Bob"
}
],
"text": [
{
"speaker": "Mary",
"words": [
{
"word": "Hello",
"duration": 600,
"confidence": 0.61,
"time": 190
},
{
"word": "radio",
"duration": 450,
"confidence": 0.71,
"time": 860
},
...
]
},
{
"speaker": "Bob",
"words": [
{
"word": "this",
"duration": 240,
"confidence": 1.0,
"time": 10680
},
{
"word": "podcast",
"duration": 540,
"confidence": 1.0,
"time": 11430
},
...
]
}
]
}
version
The format version level of the data structure itself.
head
This property contains some general metadata, some of which is optional.
original_name
– the name of the transcribed audio file
duration
– the total duration of the transcribed audio, in seconds
language
– an ISO language code, e.g., en
for English, or de
for German
created
– an ISO timestamp, e.g., 2018-08-20T19:35:02
service
– identifier for the source transcription service
head
{
"original_name": string,
"duration": number,
"language": string,
"created": string,
"service": string
}
speakers
The top-level speakers
property takes an array of objects, each of which describes a different speaker.
Each speaker identified in the transcription is represented as a member of the speakers
array. The name
of a speaker is used when describing members of the text
array (see below).
Currently, the speaker object has only one property, name
. The name
of each member of speakers
must be unique, within the S2T.
speakers
{
"name": string
}
text
The top-level text
property takes an array of "blocks"/sequences of words spoken by a single speaker.
Each member of text
has two properties:
speaker
– must match thename
property of a member of the top-levelspeakers
arraywords
– an array of objects, each of which described a transcribed word- Each member of this array has four properties:
word
- the transcribed text valueduration
– an integer number of millisecondsconfidence
– a decimal number between0.0
and1.0
, describing the confidence of the accuracy of the transcription, where1.0
indicates 100% confidencetime
– an integer number of milliseconds, relative to the beginning of the entire transcribed audio file
- Each member of this array has four properties:
text
{
"speaker": string,
"words": [
{
"word": string,
"duration": number,
"confidence": number,
"time": number
},
...
]
}