[Matroska-devel] ebml viewer with a small analysis capabilities
jcsston at jory.info
Thu Jul 3 01:15:34 CEST 2008
Impressive research :)
Thank you for sharing
Oleg Estekhin wrote:
> Hi everybody.
> As a part of research of different ways to store binary data i created the EBML viewer-analyser application, written in Java.
> Initially i just wanted to check how much more space the real-world data will take if the storage format supports only signed integers compared to the format with both signed and unsigned integers.
> It turned out that the EBML in general and Matroska in particular is the only format that has both of the following properties:
> 1) the format has signed and unsigned integers;
> 2) i have a big collection of data in that format (mkv video in this case).
> In the end instead of writing simple analyser i wrote a program which can view the contents of the EBML files and calculate some statistics on the number of used elements and so on.
> The source code and executable Java jar file is available from http://code.google.com/p/ebml-viewer/.
> To start the program double-click the ebml-viewer-1.0.jar inside the unpacked /lib of ebml-viewer-1.0-bin.zip.
> The manual is absent, so just a couple of comments on available menu commands:
> 1) "open file" will both parse the file and open a tree with the file contents.
> 2) "parse file" will only parse the file for the sake of collecting statistics.
> 3) "edit/element list" allows to configure whether the content of some container element should be displayed in the tree view. Regardless of these settings all containers will be parsed, so this is just the filter for a file structure tree.
> Creating the complete tree view of a 250mb file requires 20 to 30 mb of memory, which leads to the fact that the viewer can display only a limited number of files before throwing OutOfMemoryError. Deselecting Cluster (0x1f43b675), Cues (0x1c53bb6b) and SeekHead (0x113d9b74) greatly diminishes the amount of memory required to display the file structure if you are not interested in the contents of these containers.
> 4) "view/file list" shows the list of processed files
> 5) "analyse/type statistics" shows the statistics for EBML data types.
> 6) "analyse/element statistics" shows the statistics for EBML elements that were present at least once in some file.
> 7) "analyse/inspections" shows a limited number of inspections.
> "Inefficient encoding" is when the value can be encoded with a smaller number of bytes.
> "Signed integer encoding" is what happens if the value is encoded as a signed integer.
> Below is a summary of one of my experiments:
> The input is 338 MKV files with a total size of about 85 Gb.
> The type statistics sorted by the number of instances:
> CONTAINER: 17261279 (24 different elements)
> BINARY: 16757498 (7 different elements)
> SIGNED_INTEGER: 12605228 (only 1 element of the signed integer type, ReferenceBlock to be exact)
> UNSIGNED_INTEGER: 1605713 (37 different elements)
> ASCII_STRING: 5521 (6 different elements)
> UTF_8_STRING: 4607 (8 different elements)
> FLOAT: 1965 (4 different elements)
> DATE: 338 (1 element of the Date type)
> and 31 instances of unknown elements (2 different unknown identifiers). It seems that http://www.matroska.org/technical/specs/index.html does not fully reflect the latest ebml/matroska versions.
> The type statistics table is rather big, with BlockGroup and Block elements at the top with about 16 millions of instances each.
> The inspections:
> 1) 339 instances of the inefficient element size encodings (most of the cases are encoding of the Cluster size, which is almost always 8 bytes long). Approximately 1 Kb can be saved, which is 0,0000013% of the total size of the processed files.
> 2) 106 instances of inefficient signed integer encoding and 7321 instance of inefficient unsigned integer encoding. Most of the cases are encoding of the zero value as a one byte, which, at least according to EBML RFC, can be encoded as zero bytes. Approximately 7 Kb can be saved, which is 0,0000082% of the total size of the processed files.
> 3) there are 109738 instances of unsigned integers which will take more space if encoded as a signed integer. For example, 0x80..0xFF will take 1 byte as unsigned and 2 bytes as a signed. If all these values will be encoded as a signed integers, then the total size of processed files will increase by 109738 bytes, which is 0,00012% of the total size of the processed files.
> 4) all encountered string values can be encoded in the UTF-8 without any changes in the file size. Leaving only one kind of strings (Unicode) in the specification will not affect the files i used for this experiment.
> 5) the Date type is a small waste of file size and a big waste of specification, as there is only one element of that type in the specification and there is at most one instance of that element in each processed file. Encoding all date elements as signed seconds could save 4 bytes per file and removing the Date type from the specification could simplify the specification a little.
> The main result is the fact, that the size of this particular set of files will grow only 0,00012% if all unsigned integers will be encoded as a signed integers. Not that it matters for the current versions of EBML and Matroska, but it could be taken into account when developing new versions or completely new data encoding formats not related to Matroska (which is one of the goals of the original research that led me to study ebml and write that program).
> Matroska-devel mailing list
> Matroska-devel at lists.matroska.org
> Read Matroska-Devel on GMane: http://dir.gmane.org/gmane.comp.multimedia.matroska.devel
More information about the Matroska-devel