[Matroska-devel] Opus in Matroksa Cont.

Moritz Bunkus moritz at bunkus.org
Tue Mar 26 14:45:21 CET 2013


Hey,

On Fri, Mar 22, 2013 at 10:24 PM, Frank Galligan
<frankgalligan at gmail.com> wrote:

> I want to continue the discussion on adding Opus to Matroska. The current
> Draft [1].

That's great. Because back when I worked on this I noticed that it's
all kind of a huge mess -- meaning a lot of stuff to implement and
spec out involving several unpopular changes to demuxers (and also to
muxers), making all of this rather complicated. I therefore decided to
stop working on it until someone else showed any interest in it (apart
from "hey when does Matroska support Opus?" -- meaining interest in
doing actual work on it).

> 1. Pre-roll (and muxed files). I think we should add a new element in the
> TrackHeader, SeekPreRoll, which uses the same units as the Cluster
> timecode.


I'm generally fine with your proposal and explanation. One detail I'd
like to change is the resolution for this new element. Elements in a
cluster are scaled with TimecodeScale for a very specific reason: to
save space by allowing the use of smaller numbers and therefore fewer
bytes for the variable length encoding. It also allows for the use of
longer clusters (but that's more theoretical: with the default
TimecodeScale of 1ms precision clusters could be as long as ~32
seconds, but they're usually only up to five seconds long in order not
to make seeking too costly).

Values in the track headers, on the other hand, don't have to be
conservative regarding the space they occupy. I therefore opt for the
highest precision possible, which would mean going for nanosecond
precision for SeekPreRoll. Another possibility would be to express
SeekPreRoll in samples, but that would pose a problem for video tracks
as the demuxer doesn't know in advance whether or not a Matroska block
contains a single field or a full frame (progressive video) -- so we
couldn't defined the unit of SeekPreRoll to be "one field" for video
tracks. Therefore I still vote for a time-based value, so nanosecond
precision it should be.

> 2. Pre-skip. I see 4 possibilities for handling pre-skip.

I haven't thought about this fully just yet. I have a couple of
comments so far, though.

> 2.1 The Opus audio stream pre-skip data starts from time 0 and adds the
> pre-skip time to the normal audio time, like how Opus files are
> muxed into ogg files. We would add a new element to the TrackHeader,
> PreSkip, and the decoder would adjust the timestamps of the decoded
> samples by subtracting PreSkip. It would be up to the player on how
> to handle negative audio timestamps.

Pre-skip in Opus context is not simply a value to subtract from the
timecodes (or sample numbers). If I understood correctly it is a
number of samples that have to be skipped after decoding ("at the end
of the decoding chain" would be corresponding terminology form the
usual video codec specs) and that must not be output. So players can
not simply apply the following algorithm:

1. Subtract PreSkip vom BlockTimecode
2. Discard block if resulting timecode is negative

for several reasons:

1. BlockTimecodes might not start at 0 for an audio track. So even
after subtracting PreSkip the resulting timecode might still be
positive.
2. It's not the demuxer's job to discard the samples. I think it's the
player's job, so this value must be communicated as meta data
separately from the data stream. The demuxer must not mess with the
data already.

So this means:

> Pros:
>
> - Might work for future formats that want to add a PreSkip.

This is actually one of the most compelling reasons for me to prefer
your 2.1 to all the other solutions. Coupled with the following
remarks why I don't think the other solutions are good.

> Cons:
>
> - There also could be an issue of when the real audio data starts if the
> Block timecode scale is less than the sample rate. E.g a decoded block has a
> timestamp of -10ms and a duration of 40ms.

Not really an issue as PreSkip should really be either a number of
samples or, if it's a time-based value, employ a resultion that can be
converted to a number of samples unambiguously (e.g. nanosecond
precision). If we chose to use "samples" as the unit then the drawback
of having to be careful about how to define a "sample" for video
tracks. However, this is actually not as bad as it is for PreRoll
above as we're talking about discarding stuff at the end of the
decoding chain. At that place the "unit" of a video block is known.

Therefore I think we should define PreSkip's unit to be "one sample".
For audio this is well-defined, for video we define it as a single
field. For progressive video content each decoded frame counts as two
fields, of course, and PreSkip must be divisable by 2 for progressive
video.

> - Added complexity outside of the decoder.

That is a drawback, but a necessary one, I think. The discussions I
had in #opus on freenode's IRC seemed to indicate that a player really
has to be aware of Opus handling if it wants to implement seeking and
playback properly. It cannot be handled by a demuxer+decoder alone.
Also what I know of existing demuxers always hints that demuxers and
decoders are often loosely coupled layers that cannot implement
complex interaction by themselves -- a third layer keeping overall
control is always required. That's the player.

> - Can all players/frameworks handle negative timestamps on decoded audio?

I don't think so, but all of the solutions we can come up with will
have a severe impact on existing players. Some more so than others,
but we cannot avoid it completely. Therefore I'm in favor of
implementing a more general solution, and that's your 2.1 proposal.
All the other ones are hacks that try to shove Opus support into
existing structures with as little addition disruption as possible --
but that "as little as possible" will still be enough to trip up quite
a lot of players, especially hardware devices. That's my experience
from past modifications to Matroska: no matter what you change, some
players will always throw a fit.

> 2.2 The pre-skip data must be contained in the first audio Block (or maybe
> in the CodecPrivate) with non-pre-skip encoded data.

I don't like this due to it being a hack. It changes the semantics and
the structure of certain audio blocks requiring special-casing in
muxers/demuxers everywhere. Extremely annoying and not extensible,
also a solution for one case only that would have to be adopted for
other similar cases requiring even more special-casing.

It also takes information and control away from the player.

> Can the decoder assume that a packet of timestamp 0 will have to decode and
> throw out pre-skip samples? Then we wouldn’t need to delimit all of the Opus
> Blocks.

Audio timecodes don't have to start at 0.

> 2.3 Place the pre-skip frames in blocks that have the invisible flag set (or we
> could signal with a Block with 0 duration, much like we do with VP8
> altref frames).

Again special-casing, simply on a different level than your proposal 2.2.

> 2.4 Add the pre-skip data with a negative Block timecode. As haThe problem
> with 2.4 is that TimecodeScale may be set so that the pre-skip data
> timecode cannot be represented in the first Cluster.

Nah... again, audio timecodes don't have to start at 0; how do you
then recognize which audio blocks belong to the PreSkip zone and which
don't?

All in all I'm strongly in favor of 2.1. You prefer 2.2 due to it
being the least disruptive, but as I've pointed out above any kind of
disruption will cause existing players to behave strangely/to fail.
That's simply how bad Matroska implementations are. For example,
Matroska has always been based on the idea that demuxers must skip
elements they don't know about (that's why there are two version
number elements in the header: the EBML version and the EBML read
version, and if your player supports "read version" 2 then the muxer
may still use elements from Matroska v4 f they're purely optional for
playback -- and your player should just skip them). However, after
adding e.g. CueDuration and enabling them by default a lot of hardware
players suddenly refused to play such files claiming they were
invalid/unsupported or whatever. Even VLC failed. *sigh*

So we will be disruptive with these changes, one way of the other. It
doesn't matter much, and therefore we should do it right and in a way
that might make future similar cases easier to implement.

Therefore I'm in favor of 2.1.

Kind regards,
mosu


More information about the Matroska-devel mailing list