[Matroska-devel] Clarification on CueRelativePosition

wm4 nfxjfg at googlemail.com
Fri Jul 19 14:32:13 CEST 2013

On Fri, 19 Jul 2013 09:01:27 +0200
Moritz Bunkus <moritz at bunkus.org> wrote:

> - At the beginning: read the header data, read the cues table
> - When a user wants to seek:
>   * Look up corresponding cue point
>   * Seek to absolute cluster position indicated by CueClusterPosition /
>     segment data start position
>   * Read cluster header

Now this must be duplicated in the seek code. This is not good. It
leads to hard to maintain code.

Even worse: it requires two seeks: first to the cluster header, then to
the Block element. With media that have high latency, this can be very
inefficient. Consider http or network filesystems. These are optimized
for linear reading, and seeking to the cluster header just for parsing
the timecode and known the right offset is a big waste.

It's not efficient, neither in terms of code size, nor in terms of what
absolute performance you can achieve.

>   * Read first child element (almost always the cluster timecode)

Almost always? You said the spec requires it to be the timecode.

>   * Calculate wanted absolute block position with the formula above,
>     seek to it, read it, play

You also have to enter the BlockGroup parsing code somehow, which makes
for more awkward code.

Also, you forgot something. The parser code does not know how long this
child element is. For example, the BlockGroup can contain an arbitrary
number of other elements after the Block, including elements unknown to
the parser. So it's in general not clear at all when the parser should
stop reading BlockGroup elements.

Sure, you could make it so that you exit the BlockGroup parsing code as
soon as you see known Cluster-level elements. But that feels like a
hack, and duplicates yet another bit of knowledge.

> > an easier implementation would be to simply force the start of a new
> > Cluster.
> Of course you can force a new cluster before each and every key
> frame. The drawback is higher overhead.

The overhead is negligible. I tried an old 270 MB mkv file (with single
video/audio/sub tracks). Remuxing it with v6.2.0 adds about 11 KB of
overhead (I guess these are due to the new Cue elements).

Remuxing with --cluster-length 1 adds about 800 KB of overhead. This
puts at most one packet of each track into a cluster, making for
extremely short clusters. I think adding 800 KB to a 270 MB file is
VERY negligible. And that's just the absolutely worst case!

You could mux the file so that each cluster starts with a video key
frame, and that each cluster contains exactly one video key frame. I
don't know how to produce such a file with mkvmerge, but the overhead
should be quite a bit lower than these 800 KB. So, while 800 KB were
already not the end of the world, the overhead of clustering by key
frames should be _really_ acceptable to anyone. It's probably also
faster, because you don't have to seek twice on the media for a seek.

And you know what's the best about this? You don't have to change any
software that demuxes Matroska. Everyone can benefit from it. It keeps
the code of Matroska readers simple, making software more robust.

So that makes me really wonder why this awkward CueRelativePosition
element was added, instead of adding a keyframe-clustering mode to
mkvmerge and making it default.

I see only two arguments against this:
1. It requires you to hardcode the assumption to use video key frames
for cluster granularity, and ignore other tracks. So what, big deal?
2. Playing a video file in audio-only mode would not give optimal
seeking. OK, but clusters would still be relatively small, and thus
seeking fast.

More information about the Matroska-devel mailing list