[Matroska-devel] Opus in Matroksa Cont.

Frank Galligan frankgalligan at gmail.com
Thu Mar 28 00:22:14 CET 2013


On Tue, Mar 26, 2013 at 6:45 AM, Moritz Bunkus <moritz at bunkus.org> wrote:

> Hey,
>
> On Fri, Mar 22, 2013 at 10:24 PM, Frank Galligan
> <frankgalligan at gmail.com> wrote:
>
> > I want to continue the discussion on adding Opus to Matroska. The current
> > Draft [1].
>
> That's great. Because back when I worked on this I noticed that it's
> all kind of a huge mess -- meaning a lot of stuff to implement and
> spec out involving several unpopular changes to demuxers (and also to
> muxers), making all of this rather complicated. I therefore decided to
> stop working on it until someone else showed any interest in it (apart
> from "hey when does Matroska support Opus?" -- meaining interest in
> doing actual work on it).
>
> > 1. Pre-roll (and muxed files). I think we should add a new element in the
> > TrackHeader, SeekPreRoll, which uses the same units as the Cluster
> > timecode.
>
>
> I'm generally fine with your proposal and explanation. One detail I'd
> like to change is the resolution for this new element. Elements in a
> cluster are scaled with TimecodeScale for a very specific reason: to
> save space by allowing the use of smaller numbers and therefore fewer
> bytes for the variable length encoding. It also allows for the use of
> longer clusters (but that's more theoretical: with the default
> TimecodeScale of 1ms precision clusters could be as long as ~32
> seconds, but they're usually only up to five seconds long in order not
> to make seeking too costly).
>
> Values in the track headers, on the other hand, don't have to be
> conservative regarding the space they occupy. I therefore opt for the
> highest precision possible, which would mean going for nanosecond
> precision for SeekPreRoll. Another possibility would be to express
> SeekPreRoll in samples, but that would pose a problem for video tracks
> as the demuxer doesn't know in advance whether or not a Matroska block
> contains a single field or a full frame (progressive video) -- so we
> couldn't defined the unit of SeekPreRoll to be "one field" for video
> tracks. Therefore I still vote for a time-based value, so nanosecond
> precision it should be.
>
Agreed. Unless someone has an objection I think we can consider pre-roll
done.


>
> > 2. Pre-skip. I see 4 possibilities for handling pre-skip.
>
> I haven't thought about this fully just yet. I have a couple of
> comments so far, though.
>
> > 2.1 The Opus audio stream pre-skip data starts from time 0 and adds the
> > pre-skip time to the normal audio time, like how Opus files are
> > muxed into ogg files. We would add a new element to the TrackHeader,
> > PreSkip, and the decoder would adjust the timestamps of the decoded
> > samples by subtracting PreSkip. It would be up to the player on how
> > to handle negative audio timestamps.
>
> Pre-skip in Opus context is not simply a value to subtract from the
> timecodes (or sample numbers).

That is how ogg demuxer calculates the sample position within the stream.
>From http://tools.ietf.org/html/draft-terriberry-oggopus-01#section-4.2
PCM sample position = 'granule position' - 'pre-skip'


If I understood correctly it is a
> number of samples that have to be skipped after decoding ("at the end
> of the decoding chain" would be corresponding terminology form the
> usual video codec specs) and that must not be output. So players can
> not simply apply the following algorithm:
>
> 1. Subtract PreSkip vom BlockTimecode
> 2. Discard block if resulting timecode is negative
>
> for several reasons:
>
> 1. BlockTimecodes might not start at 0 for an audio track. So even
> after subtracting PreSkip the resulting timecode might still be
> positive.
>
I agree, but I wanted to keep it simple for now and assume start time is 0.
But you are correct 2) should be "Discard Block if resulting timecode is <
start timecode".

2. It's not the demuxer's job to discard the samples. I think it's the
> player's job, so this value must be communicated as meta data
> separately from the data stream. The demuxer must not mess with the
> data already.
>
I didn't say it was the demuxer's job. I said it was the player's job (for
specific players that could mean the demuxer), but really this should read
"it is not the decoder's job to discard the samples".

>
> So this means:
>
> > Pros:
> >
> > - Might work for future formats that want to add a PreSkip.
>
> This is actually one of the most compelling reasons for me to prefer
> your 2.1 to all the other solutions. Coupled with the following
> remarks why I don't think the other solutions are good.
>
> > Cons:
> >
> > - There also could be an issue of when the real audio data starts if the
> > Block timecode scale is less than the sample rate. E.g a decoded block
> has a
> > timestamp of -10ms and a duration of 40ms.
>
> Not really an issue as PreSkip should really be either a number of
> samples or, if it's a time-based value, employ a resultion that can be
> converted to a number of samples unambiguously (e.g. nanosecond
> precision). If we chose to use "samples" as the unit then the drawback
> of having to be careful about how to define a "sample" for video
> tracks. However, this is actually not as bad as it is for PreRoll
> above as we're talking about discarding stuff at the end of the
> decoding chain. At that place the "unit" of a video block is known.
>
> Therefore I think we should define PreSkip's unit to be "one sample".
> For audio this is well-defined, for video we define it as a single
> field. For progressive video content each decoded frame counts as two
> fields, of course, and PreSkip must be divisable by 2 for progressive
> video.
>
If we define PreSkip unit to be "a sample defined by the type of stream"
(samples for audio, fields for video, ??? for text), then we may be
limiting the components that can handle PreSkip. Or making the
specification more fragile. E.g. If we choose fields for video then we will
most likely add to the spec that FlagInterlaced MUST reflect if the source
is interlaced. Also are there any containers that store the fields
separately?

I think we should define the PreSkip  unit to be nanosecond. This may
introduce issues converting PreSkip to stream units, but these are known
issues that players have been dealing with for years. This is not a stance
I'm firmly committed too. If everyone wants the units to be "stream
samples", that would be fine with me.


> > - Added complexity outside of the decoder.
>
> That is a drawback, but a necessary one, I think. The discussions I
> had in #opus on freenode's IRC seemed to indicate that a player really
> has to be aware of Opus handling if it wants to implement seeking and
> playback properly. It cannot be handled by a demuxer+decoder alone.
>
Were there specific concerns?

I'm fairly certain we can hide pre-skip from the demuxer and contain it in
the decoder and muxer. See 2.2.3


> Also what I know of existing demuxers always hints that demuxers and
> decoders are often loosely coupled layers that cannot implement
> complex interaction by themselves -- a third layer keeping overall
> control is always required. That's the player.
>
I agree but with 2.1 we have two issues we need to solve. Throwing out the
samples and rewriting timestamps.

2.1.1 DirectShow Implementation example

Marking preskip data to be discarded
Typically this is the responsibility of the demuxer, by marking the DShow
MediaSample as "DirectShow preroll". If the end of the pre-skip data may be
in the middle of a normal Opus packet (can this happen?) then the demuxer
can not do this, because only part of the data must be thrown out.
DirectShow filters can only mark full DShow MediaSamples as "DirectShow
preroll".

If we can't rely on the Demuxer to mark which encoded frames must be thrown
out then it will be the decoder's responsibility. With 2.1 this will at
best be a fragile hack. I.e. the decoder must assume the timestamp of the
first MediaSample is the start of the pre-skip data. But this will break if
playback starts after the first MediaSample. Even if we can fix starting
playback from a non-first frame, I think there will still be an issue
handling seek time within the DirectShow framework because of the shifted
timestamps.

Re-writing timestamps can be done by either the demuxer or the decoder. But
if we create an Opus decoder that rewrites the timestamps, then we force
all demuxers that connect to our Opus decoder to shift Opus timestamps by
PreSkip, without any way to signal that restriction. I think it would be
best to perform the rewrite in the demuxer instead of special casing an
Opus decoder.



> > - Can all players/frameworks handle negative timestamps on decoded audio?
>
> I don't think so, but all of the solutions we can come up with will
> have a severe impact on existing players. Some more so than others,
> but we cannot avoid it completely. Therefore I'm in favor of
> implementing a more general solution, and that's your 2.1 proposal.
> All the other ones are hacks that try to shove Opus support into
> existing structures with as little addition disruption as possible --
> but that "as little as possible" will still be enough to trip up quite
> a lot of players, especially hardware devices. That's my experience
> from past modifications to Matroska: no matter what you change, some
> players will always throw a fit.
>
I agree, making any change that is different from what has been done before
will most likely break some players. But we have to careful what we change
in a general purpose container.  I think 2.1 is  a hack to shoe horn Ogg
Opus into Matroska, and the biggest change out of all of them. I think
shifting all timestamps by some T will break things in ways we cannot even
imagine (and of course some ways we can), and in the long run we will be
hacking "fixes" for a long time in most players. I think
another change will better handle pre-skip and cover more changes like this
in potential future codecs. More on this later.


> > 2.2 The pre-skip data must be contained in the first audio Block (or
> maybe
> > in the CodecPrivate) with non-pre-skip encoded data.
>
> I don't like this due to it being a hack. It changes the semantics and
> the structure of certain audio blocks requiring special-casing in
> muxers/demuxers everywhere. Extremely annoying and not extensible,
> also a solution for one case only that would have to be adopted for
> other similar cases requiring even more special-casing.
>
I agree, depending on the implementation, this is a hack. I see three
methods we could implement 2.2.

2.2.1. Force pre-skip packets to be prepended to the first normal packet in
the first Block.
This will force the decoder to assume the timestamp of the first Block
received contains pre-skip data. From before this will break if playback
starts on a non-first frame.

2.2.2 Add PreSkip to CodecPrivate.
This suffers the same detection issue as 2.2.1.

2.2.3 Create a new codec, OPUS_MKV.
Basically the codec will wrap Opus packets with data telling the decoder
what type of Opus packet it contains. Essentially we would be creating a
new codec to handle pre-skip data within the decoder.

wrt DirectShow, only the encoder and decoder filters would know about
PreSkip. All current demuxers, muxers, and players should work without
modification.

2.2.1 and 2.2.2 have issues we have to hack around for what
are already hacks. 2.2.3 should work but may be an extreme solution to
handle pre-skip data, as well as not being extensible.


> It also takes information and control away from the player.

But why should pre-skip information be available to the player if it truly
doesn't need it? I think exposing superfluous information to components
that don't need it is bad. And worse is exposing superfluous information to
players and forcing them to handle it.




> > Can the decoder assume that a packet of timestamp 0 will have to decode
> and
> > throw out pre-skip samples? Then we wouldn’t need to delimit all of the
> Opus
> > Blocks.
>
> Audio timecodes don't have to start at 0.
>
> > 2.3 Place the pre-skip frames in blocks that have the invisible flag set
> (or we
> > could signal with a Block with 0 duration, much like we do with VP8
> > altref frames).
>
> Again special-casing, simply on a different level than your proposal 2.2.
>
Here is where I disagree. I don't think setting the invisible flag is
special casing. Actually that is specifically what the invisible flag is
meant to convey within the general purpose container.

Let me give a little history of VP8 altref frames. When we first talked
about adding altref frames to WebM/Matroska, our natural first thought
would be to put them in a Block with the invisible bit set. Essentially
altref frames are frames that must be decoded but not shown. But doing this
would force all muxers and multimedia frameworks to understand and/or
handle this new type of frame. E.g. in DirectShow we would be forced to
signal frames were altref frames outside of the framework. Demuxers didn't
have to handle the invisible flag because the VP8 decoder knew which VP8
frames were altref frames. The VP8 decoder would decode the altref frames
but not produce any output. So we decided to not force muxers to set the
invisible flag for altref frames as the decoder could handle them. Even
with making minimal changes to handle altref frames, we had issues in
frameworks (I remember some in FFmpeg) that we needed to address that we
didn't think of at first. I still think we made the right decision with
not mandating that the invisible flag bet set on altref frames and forcing
muxer, demuxers, decoders, and players to handle the new invisible frame
type. Aside #1 some muxers today still set the invisible flag on altref
frames, but all (most?) players ignore this bit. Aside #2 there was some
push back and issues we had to address on VP8 altref frames, where there
was no real clear benefit handling the altref frames outside of the
decoder. In VP9 the default behavior is to handle VP9 altref frames within
the decoder.

Now lets step back and look at pre-skip data. Essentially pre-skip data is
data that needs to be decoded but not rendered. At a high level this is the
same as altref frames. They both have a temporal order when the data must
be decoded, implicitly assigning the data a timestamp. They both must be
decoded but not rendered. In the WebM spec we propose muxers SHOULD not
give altref frames a duration, as we do not want to change the time and
duration of the source frames (but some muxers do give altref frames a
duration, not chaning the overall duration). Ogg Opus says muxers MUST give
a duration to pre-skip data. But does pre-skip data really have a duration?
Does any data that should not be rendered really have a duration? I think
no.

So if we are going to choose to force muxers, deuxers, decoders,
frameworks, and players to have to understand the new concept of data
that should be decoded but not rendered then I think we should implement
2.3, as it is a much less invasive change then 2.1. Also 2.3 is more
general than 2.1, as 2.1 is restricted to the start of the stream.

That being said I still think we should weigh 2.2.3 vs 2.3 as we will be
putting in and maintaining hacks to existing frameworks for 2.3 for a
while. I don't like the idea of creating a new codec to handle one feature
of another codec, but I also don't like forcing the whole world to
change because of one feature.



> > 2.4 Add the pre-skip data with a negative Block timecode. As haThe
> problem
> > with 2.4 is that TimecodeScale may be set so that the pre-skip data
> > timecode cannot be represented in the first Cluster.
>
> Nah... again, audio timecodes don't have to start at 0; how do you
> then recognize which audio blocks belong to the PreSkip zone and which
> don't?
>
> All in all I'm strongly in favor of 2.1. You prefer 2.2 due to it
> being the least disruptive, but as I've pointed out above any kind of
> disruption will cause existing players to behave strangely/to fail.
> That's simply how bad Matroska implementations are. For example,
> Matroska has always been based on the idea that demuxers must skip
> elements they don't know about (that's why there are two version
> number elements in the header: the EBML version and the EBML read
> version, and if your player supports "read version" 2 then the muxer
> may still use elements from Matroska v4 f they're purely optional for
> playback -- and your player should just skip them). However, after
> adding e.g. CueDuration and enabling them by default a lot of hardware
> players suddenly refused to play such files claiming they were
> invalid/unsupported or whatever. Even VLC failed. *sigh*
>
> So we will be disruptive with these changes, one way of the other. It
> doesn't matter much, and therefore we should do it right and in a way
> that might make future similar cases easier to implement.
>
> Therefore I'm in favor of 2.1.
>
As I said above if we truly want to generalize "decode but don't render"
data, I think we should go with 2.3. With what I said above do you
still prefer 2.1 over 2.3?

Also if we can keep the changes local to the Opus codec or, like 2.2.3, a
"new" codec, I think we should explore it.  Anyone have any better ideas to
keep pre-skip internal to the decoder?



> Kind regards,
> mosu
> _______________________________________________
> Matroska-devel mailing list
> Matroska-devel at lists.matroska.org
> http://lists.matroska.org/cgi-bin/mailman/listinfo/matroska-devel
> Read Matroska-Devel on GMane:
> http://dir.gmane.org/gmane.comp.multimedia.matroska.devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.matroska.org/pipermail/matroska-devel/attachments/20130327/56f6d4d8/attachment-0001.html>


More information about the Matroska-devel mailing list