[Matroska-devel] Storage of WebVTT subtitles in Matroska

Moritz Bunkus via Matroska-devel matroska-devel at lists.matroska.org
Mon Mar 14 11:24:41 CET 2016


Hey,

I'm currently looking into storing WebVTT subtitles[1] in Matroska due
to user requests and Youtube's usage of WebVTT. The format is similar
to SRT but unfortunately much more powerful and therefore not as
straightforward to store.

Things we have to specify how to store include:

1. The actual CodecID and encoding
2. Codec private data (all global blocks like STYLE, REGION and NOTEs
   appearing before the first entry)
3. NOTE blocks between entries
4. tags/numbers preceding an entry
5. cue settings
6. cue timestamps in entries

Here's an example including all of the aforementioned problems:

------------------------------------------------------------
WEBVTT

STYLE
::cue {
  background-image: linear-gradient(to bottom, dimgray, lightgray);
  color: papayawhip;
}
/* Style blocks cannot use blank lines nor "dash dash greater than" */

NOTE comment blocks can be used between style blocks.

STYLE
::cue(b) {
  color: peachpuff;
}

REGION
id:bill
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up

NOTE
Notes always span a whole block and can cover multiple
lines. Like this one.
An empty line ends the block.

hello
00:00:00.000 --> 00:00:10.000
Example entry 1: Hello <b>world</b>.

NOTE style blocks cannot appear after the first cue.

00:00:25.000 --> 00:00:35.000
Example entry 2: Another entry.
This one has multiple lines.

4
00:00:40.000 --> 00:00:45.000
Example entry 3: Entries can be numbered (like this one)…

00:00:46.000 --> 00:00:47.000
Example entry 4: …but don't have to be.

00:01:03.000 --> 00:01:06.500 position:90% align:right size:35%
Example entry 5: That stuff to the right of the timestamps are cue settings.

00:02:02.500 --> 00:02:22.500 region:bill align:right
Example entry 6: <v Bill>Hi, I’m Bill. I'm using a region defined above.

00:03:10.000 --> 00:03:20.000 region:bill align:right
Example entry 7: Entries can even include timestamps.
For example:<00:03:15.000>This becomes visible five seconds
after the first part.
------------------------------------------------------------

Implementation details:


1. CodecID and encoding

Easy enough. I propose S_WEBVTT and using UTF-8.


2. Codec private data

WebVTT files consist of blocks. Blocks are separated by blank
lines. Each WebVTT file starts with a block consisting solely of the
line "WEBVTT".

There are several blocks that can only occur before the first
entry. Examples of those blocks are "STYLE" or "REGION".

There are other blocks that may occur both before the first entry and
before any other entry. This is mostly the comment block "NOTE".

I propose to store all blocks that occur before the first entry in
CodecPrivate without modifying them. This includes the "WEBVTT" file
magic block.


3. NOTE blocks between entries

A comment block ("NOTE …") can appear between subtitle entries (see
above between example entries 1 and 2). We have to decide whether or
not we want to keep them when muxing into Matroska. If we do we need
to store them somehow.

I defer a proposal until I've mentioned the remaining points.


4. tags/numbers preceding an entry

Each entry always has a timestamp line, but that line may be preceded
by a line containing an entry number (example entry 3) or a free-form
tag of some kind (example entry 1). Those tags/numbers are irrelevant
for playback, but they may convey information to a person editing the
entries.

We have to decide whether or not we want to keep them when muxing into
Matroska. If we do we need to store them somehow.

I defer a proposal until I've mentioned the remaining points.


5. cue settings

These are the things listed after the timestamps (example entries 5
and 6). They are very relevant to playback and must be included,
either out of band or in band.

I defer a proposal until I've mentioned the remaining points.


6. cue timestamps in entries

Cue timestamps occur within an entry (example entry 7) and tell the
player to process parts of the entry only at a certain point in
time. Cue timestamps are absolute, unfortunately, and not relative to
the start of the entry itself. The example entry 7 contains two parts,
one that is shown at 3:10, and the second part that's shown five
seconds later.

Storing such absolute timestamps in-band makes manipulation at the
container level (e.g. applying some kind of delays or splitting and
joining files) very difficult.

I therefore propose that a muxer MUST change cue timestamps to be
relative to the start of the entry during muxing and that a demuxer
MUST change the cue timestamps back to be absolute during demuxing.
As several other modifications of each entry's content will be
necessary (see below) this should be OK.


Proposed entry storage format:

As can be seen above there are several pieces of information for each
entry that must be kept apart from the actual text to show. Matroska
currently doesn't provide block elements for these kinds of
information. The only element coming close to the purpose is
CodecState. The problem with CodecState is, though, that CodecState is
supposed to _replace_ CodecPrivate from the point of its occurrence in
the file. Therefore it cannot really be used for storing things like
"NOTE" blocks between entries or the entry tag/number lines.

I also favor keeping as much information as possible. This means
keeping the "NOTE" blocks between entries as well as keeping the
tag/number lines.

Therefore I propose to use the following storage format for entries:

1. (optional) all non-global blocks preceding an entry (the "NOTE"
   blocks); each block is followed by a blank line marking its end.

2. (optional) the tag/number line

3. (required) the timestamp line with start and end timestamps removed
   but including the cue settings; leading/trailing whitespaces are
   removed; the entry's start timestamp is stored as the Matroska
   block's start timestamp; the difference between the entry's end and
   start timestamps is stored as the Matroska block's duration

4. (required) the entry's content lines with all cue timestamps being
   shifted to be relative to the Matroska block's start timestamp

A demuxer will have to re-create the entry by inserting the start/end
timestamps and by shifting the embedded cue timestamps back to their
absolute values.

Rationale for removing the start/end timestamps and for shifting
embedded cue timestamps: duplicating container-level information
(start timestamp, end timestamp/duration) in-band makes any kind of
container-level transformation extremely difficult.

Here's how several example entries from above would be modified:

---start---------------------------------------------------------
hello
-->
Example entry 1: Hello <b>world</b>.
---end-----------------------------------------------------------

---start---------------------------------------------------------
NOTE style blocks cannot appear after the first cue.

-->
Example entry 2: Another entry.
This one has multiple lines.
---end-----------------------------------------------------------

---start---------------------------------------------------------
4
-->
Example entry 3: Entries can be numbered (like this one)…
---end-----------------------------------------------------------

---start---------------------------------------------------------
--> region:bill align:right
Example entry 7: Entries can even include timestamps.
For example:<00:00:05.000>This becomes visible five seconds
after the first part.
---end-----------------------------------------------------------

I'm open to all kinds of suggestions.

Kind regards,
mosu

[1]  https://w3c.github.io/webvtt/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.matroska.org/pipermail/matroska-devel/attachments/20160314/b508137a/attachment.sig>


More information about the Matroska-devel mailing list