You are on page 1of 81

Table Of Contents

1. INTRODUCTION ..............................................................................................................................................2
2. MPEG-A MULTIMEDIA APPLICATION FORMAT ...................................................................................2
2.1. Introduction ....................................................................................................................................................2
2.2. Creating MAF .................................................................................................................................................3
2.3. Overview of Technologies used in MAF .......................................................................................................4
3. WORK ITEMS ................................................................................................................................................. 12
3.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format ............................................ 12
3.2. MPEG-A Part 4: Musical Slide Show Application Format ...................................................................... 18
3.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format .................................. 27
3.4. MPEG-A Part 10: Video Surveillance Application Format...................................................................... 44
4. IMPLEMENTATION ...................................................................................................................................... 52
4.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format ............................................ 52
4.2. MPEG-A Part 4: Musical Slide Show Application Format ...................................................................... 58
4.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format .................................. 58
4.4. MPEG-A Part 10: Video Surveillance Application Format...................................................................... 67
5. ACHIEVEMENTS ............................................................................................................................................ 75
5.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format ............................................ 75
5.2. MPEG-A Part 4: Musical Slide Show Application Format ...................................................................... 77
5.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format .................................. 78
5.4. MPEG-A Part 10: Video Surveillance Application Format...................................................................... 79
6. CONCLUSIONS ............................................................................................................................................... 80
7. REFERENCES.................................................................................................................................................. 81
1. Introduction

This document is the project report of development of MPEG-A Multimedia Application Format (MAF)
standardization project. The MPEG-A standardization project includes the development of the followings:
- MPEG-A Part 3 2nd Edition (ISO/IEC 23000-3 2nd Edition) Protected Music Player Application Format,
- MPEG-A Part 4 (ISO/IEC 23000-4) Musical Slide Show Application Format,
- MPEG-A Part 4 2nd Edition (ISO/IEC 23000-4 2nd Edition) Protected Musical Slide Show Application
Format,
- MPEG-A Part 10 (ISO/IEC 23000-10) Video Surveillance Application Format

The document is structured as follows: Section 2 will describe the overview of MAF; Section 3 will describe the
specification of each MAF in form of file format, system architecture, and metadata schema; while the
implementations and reference software are presented in Section 4. Section 5 will describe the results of
achievements of the technical implementation of each MAF: MPEG input contribution documents, MPEG
output documents, and research papers. Finally, Section 6 will conclude this report.

2. MPEG-A Multimedia Application Format

2.1. Introduction

MPEG-A (ISO/IEC 23000) is a new standard that have been developed by the Moving Picture Experts
Group by selecting existing technologies from all published MPEG standards, as well as technologies from
other body of standards such as JPEG and 3GPP, and combining them into so-called “Multimedia
Application Formats” or MAFs.

Selecting readily tested and verified tools available from the MPEG standards reduces the need for time-
consuming research, development and testing of new technologies. If MPEG cannot provide the required
technology, then additional technologies originating from other body of standards can be included by
reference in order to facilitate the creation of MAF. In other words, a MAF is created by cutting
horizontally through all MPEG standards, selecting existing parts and profiles appropriately for the new
application.

The aforementioned concept of MAF is illustrated in Figure 1. Boxes on the right represent MPEG
standards while boxes on the left represent other body‟s standards. The parts and profiles of the
technologies are represented by bold square boxes, and their combinations are used by the particular MAFs
as shown in the center box. An example shown in Figure 1 is Protected Musical Slide Show MAF that uses
parts and profiles from MPEG-1, MPEG-4, MPEG-7 and MPEG-21 from MPEG; and JPEG and 3GPP
Timed Text from other standard.
MUSIC PLAYER

3GPP TIMED TEXT (TS 26.245)


PROTECTED MUSIC PLAYER

MPEG-21 (ISO/IEC 21000)


MPEG-1 (ISO/IEC 11172)

MPEG-4 (ISO/IEC 14496)

MPEG-7 (ISO/IEC 15938)


JPEG (ISO/IEC 10918)
MUSICAL SLIDE SHOW

●●● PROTECTED MUSICAL SLIDE SHOW ●●●

VIDEO SURVEILLANCE

●●●
••• OTHER MAF

Other standards MPEG-A (ISO/IEC 23000) MPEG Technologies

Figure 1 – Conceptual overview of MPEG-A

2.2. Creating MAF

The work items for creating MAF, as shown in Figure 2, takes into account the specific conceptual nature
of MAF. The first step of work in creating MAF starts with a submission that gives evidence that there
exists a need to develop and standardize the MAF by providing documentation that describes an anticipated
application scenario that benefits from the existence of an appropriately designed standard for a MAF. The
submission shall include an assessment of the positioning of the proposal in the technology landscape,
pointing to solutions that may already exist, whether they are standards-based or proprietary. Then MPEG
requests documentation of industry support to successfully complete the work and to deploy the candidate
format as important aspect to decide whether the MAF shall be continued or not. If it is decided to be
continued, the MAFs under consideration for standardization will be regularly updated and published on
the MPEG web site to gather input, comments and feedbacks as well as contributions from interested
parties. If MPEG determines that there is enough demand to warrant a MAF, then the pertaining application
scenario description is used to derive requirements based on which a new MAF standard is drafted.

From this part, the parties developing the MAF should express their commitment by creating and releasing
reference software as the initial implementation of the MAF, and support is documented in the form of
registered MPEG input contributions. Based on such documentation, the proponents, with the help of
knowledgeable MPEG experts, select the technologies that the MAF shall employ to arrive at a detailed
technical specification for the MAF. The chosen technologies expected to be advanced enough in their
respective standardization process of Final Draft International Standard (FDIS) stage or beyond, once the
MAF itself reaches the FDIS stage.

To let the world know the MAF under development, the proponents are requested to produce relevant
marketing material, including at least one white paper explaining the benefits of the new MAF. At the final
step, participating experts check the validity of the specification which indicates the completion of the work
for a new MAF. Cross-checking of multimedia standards is done by exchanging bit streams among
different parties to check whether the bits created by one party according to the new specification can be
decoded and executed successfully by another party using a decoder implementation which has been built
according to the specification.
Work Items for
Multimedia Application Formats
Application Scenario Description

Documentation
Value Proposition and Technology Landscape

Requirements

Documented Industry Support

Detailed Technical Specification

Technical Work
Reference Software Implementation

Marketing Material – White Paper

Cross-Checking

MAF – Part X of MPEG-A

Figure 2 – Work Items of the MPEG-A process for creating MAFs

2.3. Overview of Technologies used in MAF

This sub section describes the overview of MPEG technologies used in MAFs related to this project report.
The more detailed specification of each technology can be seen in the corresponding specification
documents. Table 1 shows the MPEG technologies used in the corresponding MAF.

Table 1 — MPEG Technologies

MPEG Technologies Protected Musical Slide Protected Video


Music Player Show AF Musical Slide Surveillance
AF Show AF AF
MPEG-1 Layer III   
MP3 on MP4   
ISO Base File Format    
MPEG-4 File Format    
MPEG-4 AVC 
MPEG-4 LASeR  
MPEG-7 Visual    
MPEG-7 MDS    
MPEG-21 DID  
MPEG-21 IPMP Components  
MPEG-21 REL  
MPEG-21 File Format 

2.3.1. MPEG-1 Layer 3

MPEG-1 Layer 3 (MP3) from MPEG-1 Audio (ISO/IEC 11172-3:1993) specification is one of the
most widely deployed MPEG audio standards ever due to its good compression performance and
simplicity of implementation. Most of compressed music archives use MP3 encoding.

Layer 3 specifies a self-synchronizing transport, making it amenable to both storage in a computer


file and transmission over a channel without byte framing. In the context of transmission channels,
Layer 3 can operate over a constant-rate isochronous link, and has constant-rate headers. However,
Layer 3 is an instantaneously-variable-rate coder, which adapts to the constant-rate channel by
using a “bit buffer” and “back pointers”. Each of the headers signals the start of another block of
audio signal may be in a prior segment of the bit stream, pointed to by the back pointer (in Figure
3, curved arrows pointing to main_data_begin).

Figure 3 – Layer 3 bit stream organization

2.3.2. MPEG-4 “MPEG-1/2 Audio in MPEG-4”

MPEG-4 Audio (ISO/IEC 14496-3:2005) Section 9 “MPEG-1/2 Audio in MPEG-4” specifies a


method for segmenting and formatting Layer 3 bit streams into MPEG-4 Access Units, and
therefore is often referred to as “MP3onMP4”. This consists primarily of re-arranging the
compressed data associated with a given header such that it follows the header, This typically
results in a new segments that are no longer of constant length but that is perfectly in accordance
with the definition of MPEG-4 Access Units. Example is as shown in Figure 4.

Figure 4 – Converting an MPEG-1/2 Layer 3 bit stream into mp3_channel_elements

2.3.3. ISO Base Media File Format

The ISO Base Media File Format is designed to contain timed media information for a
presentation in a flexible, extensible format that facilitates interchange, management, editing, and
presentation of the media. The ISO Base Media File Format is a base format for media file formats.
In particular, the MPEG-4 file format derives from this base file format.

The file structure is object oriented as shown in Figure 5, which means that a file can be
decomposed into constituent objects very simply, and the structure of the objects inferred directly
from their type. The file format is designed to be independent of any particular network protocol
while enabling efficient support for them in general.

ISO file

Movie data Media data


trak (video)
other box Interleaved, time-ordered, video
other box
other box trak (audio) and audio frames

Figure 5 – Example of a simple ISO file used for interchange, containing two streams

2.3.4. MPEG-4 Part 14: File Format

ISO/IEC 14496-12:2005 and ISO/IEC 14496-14:2003 together specify the MPEG-4 File Format.
This supports storage of compressed audio data in tracks. It also provides support for metadata in
the form of „meta‟ boxes at the file-, movie- and track-level. This allows support for static (un-
timed) metadata. The type of contents inside the file format is specified by the file type box
„ftyp‟. Figure 6 schematically illustrates the location of the boxes.

MP4 file
ftyp meta moov (Movie) mdat (Media
data)
meta trak (Track) trak (Track)
meta meta

Figure 6 – ISO/MP4 file schema

2.3.5. MPEG-4 Part 10: Advanced Video Coding

ISO/IEC 14496-10 Advanced Video Coding (MPEG-4 AVC) provides higher compression of
moving pictures for various applications such as videoconferencing, digital storage media,
television broadcasting, internet streaming, and communication. It is also designed to enable the
use of the coded video representation in a flexible manner for a wide variety of network
environments.

A conceptually distinction has been made in the specification between a video coding layer (VLC)
and a network abstraction layer (NAL). The VLC comprises the signal processing part of the
codec such as transform, quantization, etc. The output of VLC is referred to as slices containing an
integer number of macroblocks and the information of the slice header. A macroblock being a
16x16 block of luma and corresponding chroma samples. The NAL provides formatting and
encapsulatin of the VLC output in a way compliant to the chosen transmission channel or storage
media. Packet-oriented as well as bitstream systems are being supported by adding appropriate
header information.

Higher layer meta information necessary to appropriately handle the data and to operate the
decoder are conveyed in parameter sets. The specification distinguishes between two types of
parameter sets: sequence parameter set (SPS) and picture parameter set (PPS). An active sequence
parameter set remains unchanged throughout a coded video sequence and an active picture
parameter set remains unchanged within a coded picture. Higher layer meta information is
supposed to be transmitted reliably and in advance. Figure 7 shows the layer abstraction of the
MPEG-4 AVC.

A main property of the specification is the decoupling of the decoding process and time (e.g.
sampling time, transmission time, presentation time, etc.). The design requires only 16-bit
arithmetic for processing on encoding and decoding side. Furthermore, it is the first MPEG video
standard achieving exact quality of decoded video because of the definition of an exact-match
inverse transform.

VCL Data – Frame 1 VCL Data – Frame 1 •••

NAL NAL NAL NAL NAL NAL NAL NAL NAL


•••
(SPS) (SPS) (SEI) (VCL) (VCL) (VCL) (VCL) (VCL) (VCL)

NAL RBSP (Raw Bytes NAL RBSP (Raw Bytes


•••
header Sequence Payload) header Sequence Payload)

slice header slice data slice header slice data •••

MB MB skip_run MB ••• MB MB

mb_type mb_pred coded_residual

Figure 7 – MPEG-4 AVC layer structure

2.3.6. MPEG-4 Part 20: LASeR

The MPEG-4 Lightweight Application Scene Representation (LASeR) (ISO/IEC 14496-20:2006)


is a scene description format that specifies various aspects of 2D scene representation and updates
of scenes as a part of rich media content. A scene description is composed of graphics, animation,
text, and spatial and temporal layout.

A scene description specifies the following areas of a presentation:

 Spatial layout of the visual elements


 Temporal organization of the media elements (e.g. synchronization)
 Interactivity (e.g. mouse clicks, key inputs)
 Change of scenes (e.g. animation effects)
LASeR is designed to be suitable for lightweight embedded devices such as mobile phones.

2.3.7. MPEG-7 Part 3: Visual

The Multimedia Content Description Interface, MPEG-7 (ISO/IEC 15938) specifying a series of
interfaces from system to application level to allow disparate systems to interchange information
about multimedia content. It describes the architecture for systems, a language for extensions and
specific applications, description tools in the audio and visual domains, as well as tools that are
not specific to audio-visual domains.

ISO/IEC 15938-3 (MPEG-7 Part 3) Visual specifies tools for description of visual content,
including still images, video and 3D models. These tools are defined by their syntax in DDL and
binary representations and semantics associated with the syntactic elements. They enable
description of the visual features of the visual material, such as color, texture, shape and motion,
as well as localization of the described objects in the image or video sequence. The structure of
MPEG-7 Visual is as shown in Figure 8.

Basic Structures

Descriptor Containers Basic Supporting Tools


 GridLayout  TemporalInterpolation
 TimeSeries  Spatial2DCoordinateSystem
 MultipleView

Visual Features

Color

Color Feature Descriptors Color Supporting Tools


 DominantColor  ColorSpace
 ScalableColor  ColorQuantization
 ColorLayout
 ColorStructure
 GofGopColor

Texture Shape
 Homogeneous Texture  RegionShape
 TextureBrowsing  ContourShape
 EdgeHistogram  Shape3D

Motion Localization
 CameraMotion  RegionLocator
 MotionTrajectory  SpatioTemporalLocator
 ParametricMotion
 MotionActivity

Other
 FaceRecognition

Figure 8 – Overview of Visual Descriptor tools

2.3.8. MPEG-7 Part 5: Multimedia Description Scheme

ISO/IEC 15938-5 (MPEG-7 Part 5) Multimedia Description Scheme specifies a metadata system
for describing multimedia content. It consists of the basic elements to form the building blocks for
the higher-level description tools, the content description tools to describe the features of the
multimedia content and the immutable metadata related to the multimedia content, the tools for
navigation and access, to describe the browsing, summarization and access of content, and
classification schemes which organizes terms that are used by the description tools.

Content Organization User


Interaction

Collections Models

Navigation and User


Creation and Production Access Preferences

Summaries

Media Usage
User
Views
History
Content Metadata
Content Description
Variations
Structure Semantics

Basic Elements

Links & Media


Schema Tools Basic Datatypes Basic Tools
Localization

Figure 9 – Overview of MDS description tools

2.3.9. MPEG-21 Part 2: Digital Item Declaration

The Multimedia Framework, MPEG-21 (ISO/IEC 21000) provides content creators, producers,
distributors and service provides a normative open framework for multimedia delivery and
consumption. It is based n two essential concepts: the definition of a fundamental unit of
distribution and transaction (the Digital Item) and the concept of Users interacting with Digital
Items. The goal of MPEG-21 is to define the technology needed to support Users to exchange,
access, consume, trade and otherwise manipulate Digital Items in an efficient, transparent and
interoperable way.

ISO/IEC 21000-2 (MPEG-21 Part 2) Digital Item Declaration specification is to describe a set of
abstract terms and concepts to form a useful model for defining Digital Items. Within this model, a
Digital Item is the digital representation of a work, and the action (managed, described, exchanged,
collected, etc.) upon the model. The example of the hierarchical structure of Digital Item
Declaration Model is as shown in figure 10.
Container
Item Item
Descriptor Descriptor

Component Component
Descriptor Descriptor

Resource Resource

Component Item
Descriptor Descriptor

Resource Component
Descriptor

Resource

Figure 10 – Example of Digital Item Declaration model

2.3.10. MPEG-21 Part 4: Intellectual Property Management and Protection Components

ISO/IEC 21000-4 (MPEG-21 Part 4) Intellectual Property Management and Protection (IPMP)
Components aims to address the need for effective management and protection of intellectual
property in the Multimedia Framework over heterogeneous access and delivery infrastructures. It
specifies component for IPMP applied to Digital Items to facilitate the exchange of governed
content between peers. The standard includes the ways of retrieving IPMP tools from remote
locations, exchanging messages between IPMP tools and between these tools and the terminal. It
also addresses authentication of IPMP tools, and has provisions for integrating Rights Expressions
according to the Rights Data Dictionary and the Rights Expression Language.

The IPMP Components consist of two parts:

 IPMP Digital Item Declaration Language, which provides for a protected Representation of
DID model, allowing DID hierarchy which is encrypted, digitally signed or otherwise
governed to be included in a DID document in a schematically valid manner, and
 IPMP Information schemas, defining structures for expressing information relating to the
protection of content, including tools, mechanisms and licenses. The IPMP information part is
flexible enough to signal protection information for the digital media which is not declared by
DIDL model as well.

2.3.11. MPEG-21 Part 5: Rights Expression Language

ISO/IEC 21000-5 (MPEG-21 Part 5) Rights Expression Language is a tool that can declare rights
and permissions using the terms as defined in the Rights Data Dictionary. It is intended to provide
flexible, interoperable mechanisms to support transparent and augmented use of digital resources
in publishing, distributing, and consuming of digital movies, digital music, electronic books,
broadcasting, interactive games, computer software and other creations in digital form, in a way
that protects digital content and honors the rights, conditions, and fees specified for digital
contents. It is also intended to support specification of access and use controls for digital content
in cases where financial exchange is not part of the terms of use, and to support exchange of
sensitive or private digital content.

The REL is also intended to provide a flexible interoperable mechanism to ensure personal data is
processed in accordance with individual rights and to meet the requirement for Users to be able to
express their rights and interest in a way that addresses issues of privacy and use of personal data.
The MPEG REL data model for a rights expression consists of four basic entities and the
relationship among those entities. The basic relationship is defined by the REL assertion “grant”.
Structurally, it consists of the following: the principal to whom the grant is issued, the right that
the grant specifies, the resource to which the right the grant applies and the condition that must be
met before the right can be exercised. This model is shown in Figure 11, while Figure 12 shows
the authorization model in MPEG-21 REL.

Right

Issued to Associated with Subject to

Principal Resource Condition

Figure 11 – REL Data Model

Authorization story
Primitive grant Authorization request
r:Principal
Authorized r:Grant or r:GrantGroup
r:Right

Authorizer
r:Resource
r:License is
authorization
Interval of time
proof for
r:Principal
Authorization context
Time instant
r:License elements
Authorization context
r:Grant elements that do not require
Authorization story an authorizer

Figure 12 – REL Authorization Model

2.3.12. MPEG-21 Part 9: File Format

ISO/IEC 21000-9 (MPEG-21 Part 9) File Format is a storage format inherited from MPEG-4 file
format in order to make multi-purpose file, in which an MPEG-21 XML document such as DID,
IPMP, and REL and some or all of its referenced content can be place d in a single „content
package‟ file. This enables the interchange, editing and playback of MPEG-21 documents.

The main difference of MPEG-21 file format with MPEG-4 file format is the use of meta box in
file level as mandatory box. For dual-function file, the MPEG-21 file format can contains both an
MP4 presentation using movie box as well as an MPEG-21 DID, and it is permitted to use either
MPEG-21 or MPEG-4 player/reader.

In this project report, we will limit our description to the MAFs that are part of this project, which are
described in detail in the following sections.

3. Work Items

3.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format

3.1.1. Overview

The Music Player MAF specifies a simple and uniform way to carry MP3 coded audio content in
MPEG-4 File Format augmented by simple MPEG-7 metadata and a JPEG image for cover art. As
such, MPEG-4, MPEG-7 and MPEG-21 represent an ideal environment to support the current
“MP3 music library” user experience, and, moreover, to extend that experience in new directions.

The Protected Music player AF builds on the Music Player. It adds content protection for mp4
song files, explains a default encryption for the song files and adds protection to the mp21 album
and playlist files with flexible protection tool selection and key management components. The
following cases are possible:

a) Protected content files in mp4 file format, without Key Management components, with the
default AES-128 encryption tool and MPEG-4 IPMP-X signalling in the IPMPInfoBox
b) Protected files with flexible tool selection and Key Management components (MPEG-21
IPMP and REL) using the mp21 file format with embedded mp4 content files
c) Protected mp21 file with Key Management components (MPEG-21 IPMP and REL) but
without embedded mp4 content file (variation of (b) ) that functions as a “license file” for an
external protected mp4 content file (a)

MP21 file
Protected MP4
Link to license/KMS
file
Protected MP4
file
MP21 file
Link to Protected
License 1 Content

MP21 file MP21 file


Protected MP4 Protected MP4
License 1 License 1
file file

Protected MP4 Protected MP4


License 2 License 2
file file

Figure 13 — Examples illustrating the different cases for the relationship of mp4 and mp21
files
Optional separation of protected content and license supports a broad range of "governed content
scenarios" including “super distribution of protected content” and “subscription models”. Figure
13 illustrates the different cases and gives some examples.

3.1.2. File format

Music player AF structure uses MPEG-4 file format. It contains of file type box, movie box and
media data box. The „mdia‟ box and subsequent child boxes of the sample description is used to
find and decode MP3 data. The combination of „iloc‟ and „iinf‟ boxes in movie-level „meta‟
box is used to present the JPEG image.

The Music player AF specification also allows the use of MPEG-21 file format to store album
with single or more tracks. Each track is structured using the MPEG-4 file format (known as
“hidden MP4 file”, while its presentation is described using „iloc‟ and „iinf‟ boxes in file-
level „meta‟ box.

When a single hidden mp4 file is embedded in an mp21 file, the IPMP information is signalled in
the form of XML metadata description (as MPEG-21 IPMP Base Profile original form). The
protection description is carried in the „meta‟ box at file level using MPEG-21 DID and MPEG-21
IPMP_DIDL.

Figure 14 shows an illustration of the approach. The IPMPDIDL metadata contains two major
parts. Note that the structure described below is not an exhausted one, more additional DID
elements may exist:

– Descriptor that contains IPMPGeneralInfo. It is recommended that this Descriptor is


defined at the beginning of the IPMPDIDL metadata. The IPMPGeneralInfo contains:

 ToolList, as defined in MPEG-21 IPMP Base Profile.


 Container for licenses. The license information is described by MPEG-21 REL MAM
Profile.

– An Item element that model the structure of the Protected Music Player MAF content. The
Item shall contain at least three children Container elements. Each Container carries a
Resource element for each sub resource of the Protected Music Player MAF (mp4 file with
MP3onMP4 audio track, JPEG image and MPEG-7 metadata). If the sub resource is protected,
the Resource element shall have an IPMPInfo element that describes the protection
mechanism.

The protection mechanism for a multiple tracks file is similar to the case of a single track file with
MPEG-21 metadata and file type. In the multiple tracks case the same approach is applied as for
mp21 files with one embedded hidden mp4 file. However, the structure of the digital item in the
DIDL/IPMPDIDL has one more level.

Figure 15 shows the illustration for the case of protecting a multiple tracks mp21 album file. Note
that the structure of IPMPDIDL metadata now has several Item elements. Each Item element is
associated with one hidden mp4 file in the „mdat‟ box.
ftyp meta mdat
iloc ftyp
Item 1 (MP3onMP4)
moov
Item 2 (JPEG Image) meta

Item 3 (MPEG-7 XML)


MPEG-7 XML

xml
mdat
IPMPDIDL
Descriptor
IPMPGeneralInfo
MP3onMP4 AU
ToolList
Licenses (if any)
Item
Container
Resource (MP3onMP4)
IPMPInfo JPEG
Container
Resource (JPEG Image)
IPMPInfo
Container
Resource
IPMPInfo

Figure 14 — Protected Music player AF file format with one hidden MP4 file

ftyp meta mdat


iloc Hidden MP4 File
Item 1; Item 2; Item 3

Item 4; Item 5; Item 6


●●●
Item (n-2); Item (n-1); Item n
Hidden MP4 File

xml
IPMPDIDL
Descriptor
●●●

IPMPGeneralInfo
ToolList
Licenses (if any)
Item
Container Hidden MP4 File
Resource (MP3onMP4)
IPMPInfo
Container
Resource (JPEG Image)
IPMPInfo
Container
Resource
IPMPInfo
Item

Item

Figure 15 — Protected Music player AF file with more than one hidden MP4 files

3.1.3. System architecture

Music player AF specifies a lossless, reversible conversion of a standard mp3 bitstream file into
an MPEG-4 file structure described in previous sub section. An MP3 bitstream file (containing
mp3 audio frames and an ID3 tag) is converted using two modules, as shown in Figure 16.
The first module translates an MP3 bitstream into a series of MP3 Access Units. This is
accomplished by the MP3onMP4 formatter, specified in ISO/IEC 14496-3:2005 subpart 9. The
Access Units are stored into one audio track of an MPEG-4 File.

The second module extracts the meta-data information from the input file‟s ID3 tag and expresses
it as MPEG-7 descriptor (see section 3.1.4). This MPEG-7 meta-data is stored - together with the
optional JPEG image for cover art - into the corresponding meta-box of the audio track.

MP4 file

MP3 file with ID3 MP3onMP4 MP3onMP4


tags formatter data

Extract ID3 tags


MPEG-7
and express in
metadata
MPEG-7

JPEG
image

Figure 16 – Creating Music player AF

Playback consists of extracting the meta-data from the MPEG-4 file and displaying it on a suitable
visual interface and extracting the MP3onMP4 data from the MPEG-4 file, filtering it with very
light-weight de-formatting operation, and playing it through a “classic” MP3 decoder. In practice,
it may be that the MP3onMP4 data is played by an “MP3onMP4 decoder,” consisting of the
concatenation of the MP3onMP4 deformatter and the MP3 decoder. This description is illustrated
in Figure 17.

MP4 file MP4onMP3


decoder
MP3onMP4 MP3onMP4
MP3 decoder
data de-formatter

MPEG-7 Display Artist,


metadata Album, Song

JPEG
JPEG decoder Display album art
image

Figure 17 – Playing Music player AF

3.1.4. Metadata

The MPEG-4 file format supports the storage of meta-data associated to a data track. Associated
meta-data describing the audio track, like artist or song name, is expressed in MPEG-7
nomenclature, as specified in ISO/IEC 15938-5:2003. MP3 bitstream files can contain associated
meta-data, typically ID3 tags. The specific mapping from ID3 v1.1 tags and the corresponding ID3
v2.3 frames to MPEG-7 meta-data is show in Table 2. Parenthetical comments under Artist clarify
that MPEG-7 is able to make a distinction between Artist as a person and Artist as a group name.

Table 2 — Mapping from ID3 v1.1 and ID3 v2.3 Tags to MPEG-7

ID3 v1 ID3 v2.3 Frame Description MPEG-7 Path


Artist TOPE Artist CreationInformation/Creation/Creator[Role/@href=”urn:mpeg:
(Original artist / performing the mpeg7:RoleCS:2001:PERFORMER”]/Agent[@xsi:type=”Perso
performer) song nType”]/Name/{FamilyName, GivenName} (Artist Name)
CreationInformation/Creation/Creator[Role/@href=”urn:mpeg:
mpeg7:RoleCS:2001:PERFORMER”]/Agent[@xsi:type=”Perso
nGroupType”]/Name (Group Name)
Album TALB Title of the CreationInformation/Creation/Title[@type=”albumTitle”]
(Album / Movie / album
Show title)

Song Title TIT2 Title of the CreationInformation/Creation/Title[@type=”songTitle”]


(Title / Songname / song
Content
description)
Year TORY Year of the CreationInformation/CreationCoordinates/Date/TimePoint
(Original release recording (Recording date.)
year)

Comment COMM Any comment CreationInformation/Creation/Abstract/FreeTextAnnotation


of any length
Track TRCK CD track Semantics/SemanticBase[@xsi:type=”SemanticStateType”]/Att
(Track number / number of song ributeValuePair
Position in set)

Genre TCON ID 3 V1.1 CreationInformation/Classification/Genre[@href=”urn:id3:v1:4


(Content type) Genre ”]
ID 3 V2 Genre CreationInformation/Classification/Genre[@href=”urn:id3:v1:4
(4)(Eurodisco) ”]/Term[@termID=”urn:id3:v2:Eurodisco”]
CreationInformation/Classification/Genre[@href=”urn:id3:v1:4
”]
CreationInformation/Classification/Genre[@type=”secondary][
@href=”urn:id3:v2:Eurodisco”]

Table 3 — ID3 information of a song

ID3 V1.1 Value


Song Title If Ever You Were Mine
Album Title Celtic Legacy
Artist Natalie MacMaster
Year 1995
Comment AG# 3B830D8
Track 05
Genre 80 (Folk)
MPEG-7 Path notation is shorthand for the full XML notation, and an example of the
correspondence between MPEG-7 Path and XML notation is shown as follows. Table 3 shows the
ID3 information and the corresponding values. The metadata representation of the ID3 is shown in
Table 4.

Table 4 — MPEG-7 instantiation example representing the ID3 information

<?xml version="1.0" encoding="UTF-8"?>


<!-- ID3 V1.1 Example -->
<Mpeg7 xmlns="urn:mpeg:mpeg7:schema:2001" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg7:schema:2001 C:\mpeg7\is\Mpeg7-2001.xsd">
<Description xsi:type="CreationDescriptionType">
<CreationInformation id="track-05">
<Creation>
<!-- ID3 Song Title -->
<Title type="songTitle">If Ever You Were Mine</Title>
<!-- ID3 Album Title -->
<Title type="albumTitle">Celtic Legacy</Title>
<!-- ID3 Comment -->
<Abstract>
<FreeTextAnnotation>AG# 3B8308D8</FreeTextAnnotation>
</Abstract>
<!-- ID3 Artist -->
<Creator>
<Role href="urn:mpeg:mpeg7:RoleCS:2001:PERFORMER"/>
<Agent xsi:type="PersonType">
<Name>
<FamilyName>MacMaster</FamilyName>
<GivenName>Natalie</GivenName>
</Name>
</Agent>
</Creator>
<!-- ID3 Year -->
<CreationCoordinates>
<Date>
<TimePoint>1995</TimePoint>
</Date>
</CreationCoordinates>
</Creation>
<!-- ID3 Genre (80 = Folk) -->
<Classification>
<Genre href=" urn:id3:cs:ID3genreCS:v1:80">
<Name>Folk</Name>
</Genre>
</Classification>
</CreationInformation>
</Description>
<Description xsi:type="SemanticDescriptionType">
<Semantics>
<SemanticBase xsi:type="SemanticStateType">
<!-- ID3 Track -->
<AttributeValuePair>
<Attribute>
<TermUse
href="urn:mpeg:maf:cs :musicplayer:CollectionElementsCS:2007:assetNum"/>
</Attribute>
<IntegerValue>6</IntegerValue>
</AttributeValuePair>
<!-- ID3v2 TRCK /12-->
<AttributeValuePair>
<Attribute>
<TermUse
href="urn:mpeg:maf:cs :musicplayer:CollectionElementsCS:2007:assetTot"/>
</Attribute>
<IntegerValue>12</IntegerValue>
</AttributeValuePair>
<!-- ID3v2 TPOS 1/2-->
<AttributeValuePair>
<Attribute>
<TermUse
href="urn:mpeg:maf:cs :musicplayer:CollectionElementsCS:2007:volumeNum"/>
</Attribute>
<IntegerValue>1</IntegerValue>
</AttributeValuePair>
<AttributeValuePair>
<Attribute>
<TermUse
href="urn:mpeg:maf:cs :musicplayer:CollectionElementsCS:2007:volumeTot"/>
</Attribute>
<IntegerValue>2</IntegerValue>
</AttributeValuePair>
</SemanticBase>
</Semantics>
</Description>
</Mpeg7>

3.2. MPEG-A Part 4: Musical Slide Show Application Format

3.2.1. Overview

The existing Music player AF was designed as a simple format for enhanced MP3 players. It
contains MP3 audio data, optional MPEG-7 meta-data and JPEG still image for cover art. The
Photo Player MAF under development combines JPEG still images with MPEG-7 meta-data.
Musical Slide show AF builds on top of the Music player and the Photo player AF and is meant as
a superset of these two MAFs.

The use of Musical slide show AF is presented in the following use cases:

 Foreign language exercise materials


 Photo-music album applications
 Storytelling application
 Personal slideshow application
 Karaoke application
 Karaoke + slideshow application

3.2.2. File format

The normative file structure of Musical slide show AF, as seen in Figure 18, consists of three
boxes at the file level: File Type Box („ftyp‟), Movie Box („moov‟) and Media Data Box
(„mdat‟). The „ftyp‟ box defines the type of the file format that the file structure complies to.
For the file type box, the major-brand is „mp42.‟ The brand that identifies the Musical slide show
application format file format is „mss1‟. The brand „mss1‟ is used as a compatible brand for the
Musical slide show application format file format. The „moov‟ box contains three types of tracks
(slide show, audio, text) and a metadata box:

 Normative Slide show Track Box (A „trak‟ box for timed JPEG images)
 Normative Audio Track Box (MP3 audio)
 Optional Text Track Box (Timed text)
 Optional Metadata Box (Media resource information and LASeR scene description)
ftyp moov mdat
meta trak (mp3 audio) MP3
iloc/iinf mdia
MP3
item_ID = 1
stbl
item_name = <rel. uri to mp3>
content_type = audio/mp3

JPEG #1
item_ID = 2
meta
item_name = <rel. uri to jpeg #1>
content_type = image/jpeg
xml
MPEG-7

JPEG #n
item_ID = n+1
item_name = <rel. uri to jpeg #n>
trak (jpeg slide show) JPEG
content_type = image/jpeg
mdia
Timed Text
item_ID = n+2
stbl
item_name = <rel. uri to text>
content_type = text
1 2 … n-1 n
meta
xml
LASeR xml
MPEG-7

trak (text) 3GPP TS 26.245


mdia
stbl
tx3g

Figure 18 — Musical slide show file format

The „trak‟ boxes contain temporal and spatial information of the media data (JPEG images, MP3
audio, timed text). For the Musical slide show application format, all the images that are used in
the slide show presentation are arranged in a single track.

A Musical slide show player shall support application formats with the following number of
tracks:

 Single slide show track (normative)


 Single audio track (normative)
 Single text track (optional)

The track handler types for the above tracks are:

 „vide‟ for the slide show track


 „soun‟ for the audio track
 „text‟ for the text track

The movie level metadata box („meta‟) contains the item information box („iinf‟) and the item
location („iloc‟) box. For each media data, an item ID is assigned, and the physical location and
size of the media data are contained in the item location box. The item name and the content type
information are contained in the item information box.

The „xml‟ box that is located in the movie level „meta‟ box contains the LASeR scripts
responsible for the animation effects, and since it exists as a single “file,” the meta handler-type is
„lsr1‟ for the „meta‟ box. The “mdat” box contains the actual media data bytes.
For the Musical slide show application format, there are two possible rendering modes:

 Basic mode
 Enhanced mode

For the “Basic” mode, the timed text and slide show of JPEG images shall be rendered using the
sample table („stbl‟) box in the file format. In the “Enhanced” mode, LASeR scene description
shall be responsible for coordinating the overall presentation (slide show, animation effects, and
timed text).

The operational flow for both Basic and Enhanced mode is shown in Figure 19.

‘timed text’

‘slide show’ without animation Basic mode

Start ‘MP3’
no no

LASeR handling LASeR scene Enhanced mode


capability? yes description?

yes ‘LASeR scene desription’


(slide show + animation + timed text)

Figure 19 — Basic and Enhanced mode operational flow diagram

Basic mode

In the “Basic” mode, MP3, JPEG images and timed text (3GPP TS 26.245) are rendered
concurrently by only using the information (timing, sample size, sample offset) obtained from the
„stbl‟ boxes. Therefore, when the file is loaded, the „moov‟ box is parsed first, then, the tracks
are read. For each track, the „stbl‟ box is parsed in order to gain access to the spatial and
temporal information regarding the sample data. Players that are not capable of handling LASeR
scene description may ignore the „meta‟ box where LASeR scene description („xml‟ box) is
placed. In this mode, the JPEG images are rendered based on the timing information in the „stbl‟
box.

Enhanced mode

In the “Enhanced” mode, animation effects shall be applied to the JPEG images in the slide show
presentation. The LASeR script is responsible for rendering the JPEG images and the timed text
data (using the „text‟ element). Therefore, the part that describes the timing information regarding
the JPEG images is ignored. For the “Enhanced” mode, the timeline of the slide show presentation
is fully dependent on how the LASeR scene description is formed. The MP3 is played in the same
way as in the “Basic” mode. However, the sample table box of the slide show track shall contain
timing information in case the player does not support LASeR decoding capability.

3.2.3. System architecture

Creating a Musical slide show application format file involves formatting different types of media
data, and storing them into an MPEG-4 file format. Figure 20 shows the system architecture of
creating Musical slide show AF file. The Musical slide show AF consists of MP3 audio
(mandatory), JPEG images (mandatory), timed text (optional) and LASeR scene description for
animation effects (optional).

Playing a Musical slide show AF file involves two parts: content extraction and content
synchronization. The method of content extraction of Musical slide show AF is similar to Music
player AF. The method content synchronization is described in the next sub section.

MSS AF Creator MSS AF file

Create MP4 file structure

MP3 audio MP3 audio

JPEG
Create multimedia tracks JPEG images
images

Text
Timed text
(timed text)

Animation LASeR scene


Store metadata
effects description

Figure 20 — System architecture of creating MSS AF


3.2.4. Synchronization

Synchronization of the media data is primarily achieved with the use of the sample table box
(„stbl‟). The sample table contains all the time and data indices of the media samples in a track.
For the slide show track, each JPEG image is considered to be a sample. Therefore, the timing
information (slide show duration), and the physical sizes and locations of the images regarding the
slide show presentation are stored inside the „stbl‟ box. Specifically, the following sub-boxes
are used:

 „stts‟ (Decoding Time to Sample Box)


 „stsz‟ (Sample Size Box)
 „stco‟ (Chunk Offset Box)

The slide show duration is stored in the „stts‟ box, and the image size and the image location are
stored in the „stsz‟ box and the „stco‟ box respectively. Figure 21 shows an illustration of
allocating JPEG samples and referring them from the sample table box.
stbl mdat

offset
stts
JPEG 1 time

offset
JPEG 2 JPEG 1

offset
… size
JPEG n
time

offset
stsz JPEG 2
JPEG 1
size
JPEG 2
time
… JPEG 3
JPEG n

size
stco

●●●
JPEG 1
time
JPEG 2
JPEG n

JPEG n size

time

Figure 21 — Allocating several JPEGs as a collection of samples

In order for the timeline of the text track to be aligned with the slide show track, the timing
information in the „stts‟ box of the slide show track should act as the “clock.” In other words,
the text track should be fully dependent on the timeline of the slide show track. Figure 22 shows
an example of a synchronized Musical slide show application format presentation.

LASeR animation animation animation animation animation


Script
Image 1 Image 2 Image 3 Image 4 Image 5

Image
Samples 

Images
0 sec 5 sec 7 sec 10 sec 14 sec
Time stamps

t
Text
Time stamps 0 sec 4 sec 8 sec 12 sec

Text Synchronized Synchronized text Synchronized text Synchronized text


text (Lyric 1) (Lyric 2) (Lyric 3) (Lyric 4)

Samples

Figure 22 — Synchronizing resource


3.2.5. Animation

The animation effects for the Musical slide show application format focus on creating a more
entertaining image slide show presentation. Figure 23 shows an example of an animation effect
that combines image transformation and opacity control.

Figure 23 — Example of animation

It is important to note that the effects featured in the Musical slide show application format shall
only be used for image transitions during the slide show presentation, and they shall be comprised
of simple image filtering effects. In addition, the timing information within LASeR can
independently be defined regardless of the timing information in the sample table box. Table 5
shows the functionalities and the description elements (LASeR) that shall be supported for the
basic transition effects.

Table 5 — List of basic transition effects

Description
Effects Functionalities elements Semantics
Defined in subclause 6.8.15 of
Grouping Effect grouping g
ISO/IEC 14496-20:2006
Defined in subclause 6.8.16 of
Image referencing Image dimension image
ISO/IEC 14496-20:2006
Referencing image
Defined in subclause 6.8.4 of
Opacity control Fade-in / Fade-out animate
ISO/IEC 14496-20:2006
Geometrical Defined in subclause 6.8.7 of
Translation animateTransform
transformation ISO/IEC 14496-20:2006
Scale
Rotation
Skew
Object motion on a predefined Defined in subclause 6.8.6 of
Object motion animateMotion
path ISO/IEC 14496-20:2006
Defined in subclause 6.8.5 of
Color change Changes object color animateColor
ISO/IEC 14496-20:2006
Defined in subclause 6.8.28 of
Attribute control Sets the value of an attribute set
ISO/IEC 14496-20:2006
Defined in subclause 6.8.22 of
Shapes & Motion path path
ISO/IEC 14496-20:2006
Defined in subclause 6.8.26 of
Basic shapes rect
ISO/IEC 14496-20:2006
Defined in subclause 6.8.9 of
circle
ISO/IEC 14496-20:2006
Defined in subclause 6.8.13 of
ellipse
ISO/IEC 14496-20:2006
line Defined in subclause 6.8.17 of
ISO/IEC 14496-20:2006
Defined in subclause 6.8.24 of
polyline
ISO/IEC 14496-20:2006
Defined in subclause 6.8.23 of
polygon
ISO/IEC 14496-20:2006

Using LASeR in a textual format provides an easy way to create and edit descriptions for the
animation effects, since the input data can simply be typed in, and the data itself can be more
intuitive in terms of understanding the functionalities. Therefore, textual format shall be the
normative way of using LASeR. In the Musical slide show application format, a reduced set of
scene description elements for animation is used in local playback settings. Therefore, the data
size or the decoding speed may not be an issue in terms of parsing or decoding the data. Figure 24
shows a possible model for a LASeR renderer.

Figure 24 — LASeR renderer model for Musical slide show application format

3.2.6. Timed text

Timed text is intended to be used for applications (e.g. “karaoke” and language study materials)
that require extensive use of textual presentation. In the Musical slide show application format,
there are two possible ways to render timed text.

For players that are not capable of handling LASeR scene description (Basic mode) or contents
that only require minimum use of textual presentation, 3GPP TS 26.245 timed text format is used.
3GPP TS 26.245 timed text data consists of:

 Text samples
 Sample descriptions

A text sample consists of one text string and text modifiers (optional). Figure 25 shows the
structure of the timed text. Sample descriptions and text modifiers are parameters that determine
how the text string is to be displayed. Sample descriptions provide global information such as font,
position, background color about a text sample or samples, where as text modifiers provide
information about a text string when it is displayed. In file format structure, the sample description
is located inside the sample table box, along with time information of the text sample associated to
it, which is located in the time to sample box within the same sample table box, as shown in
Figure 26. This synchronization method is similar to that of slide show.

For the Musical slide show application format, there are four types of text modifiers (optional):

 „styl‟ (for text style)


 „hlit‟ (for highlighted text)
 „krok‟ (for Karaoke, closed captioning and dynamic highlighting)
 „blnk‟ (for blinking text)

For detailed sample description and text modifier syntax, refer to 3GPP TS 26.245 specification.

The players that are capable of handling LASeR scene description (Enhanced mode), the „text‟
element in LASeR is used for timed text functionality. The supported functionalities (optional) of
timed text are:

 Characters and glyphs support


 Font Support
 Color Support
 Text rendering position and composition
 Highlighting, closed captioning and “karaoke”

sample table box (stbl)

Sample Sample Sample


●●●
description 1 description 2 description n

Each text sample is associated


to one sample description

Text sample Text sample Text sample Text sample


●●●
1 2 3 n

Text string Text modifiers


Text modifier 1

Text modifier 2

Text modifier 3

Text modifier n
Character 1

Character 2

Character 3

Character 4

Character n

●●●

media data box (mdat)

Figure 25 — Timed text structure

trak mdat
mdia
stbl
Text sample 1

Text sample 2

Text sample n

stsd/tx3g stts
●●●
Sample Time
description information

Figure 26 — Timed text structure in file format


3.2.7. Metadata

For the Musical slide show application format, metadata provides simple background information,
such as creation date, artist/creator information, and title of a photo series or a song.

The two types of normative metadata (textual XML) included in the Musical slide show
application format are:

 Collection and item level metadata for the slide show track (Collection-level metadata are for
allowing users to define groups/categories/sets of photos and to store metadata relating to
those groups, independently of the ordering of the slide-show; Item-level metadata are for
enabling content-based search of slides)
 Metadata for the audio track

The Musical slide show application format file structure allows the metadata to be stored inside
the media tracks. The metadata handler type is „mp7t‟ for both slideshow and audio tracks. Figure
27 shows the locations of the metadata in the Musical slide show application format file structure.

ftyp moov mdat


meta trak (MP3 audio)
MP3
meta
iloc
xml mdia
MPEG-7
iinf

xml trak (JPEG slide show)


JPEGs
LASeR meta
xml mdia 1 2 … n-1 n
MPEG-7

trak (timed text) 3GPP TS 26.245

mdia

Figure 27 — MPEG-7 metadata in MSS AF file format

3.2.7.1. Metadata for slide show

For the Musical slide show application format, images are structurally arranged in a single
track, therefore, both collection and item level metadata are contained inside the slide show
track as single XML data.

For the collection and item level (metadata for individual photos) descriptive metadata,
MPEG-7 ContentCollection and Image DS are used, respectively (aligned with the
Photo player application format). In order to combine the two metadata, the Image DS for
the item-level metadata shall be contained under the Content element in the
ContentCollection DS for the collection-level metadata. Every photo in the file shall
have a corresponding Content element in the root collection. This means there shall be as
many Content elements in the root collection as there are photos.
In the item-level metadata, MediaLocator DS is used for associating the metadata
pertaining to the individual image with its resource data within the file, identified by its
item_ID. In the collection level, images can be referenced using the ContentRef element.
The ContentRef element shall only exist in sub-collections. The normative specification of
all semantics is given in ISO/IEC 15938-2 and ISO/IEC 15938-5:2003.

The slide show track schema is defined with respect to the MPEG-7 Version 2 schema as
specified in ISO/IEC 15938-10. The namespace of the Version 2 schema providing a basis for
the slide show track schema is “urn:mpeg:mpeg7:schema:2004”.

3.2.7.2. Metadata for audio

The metadata for the audio track provides the song title, name of the artist, album title, year
and genre of the audio content. It is the same as the metadata for audio used in Music player
AF. Please refer to section 3.1.4 in this document for the semantics and example.

3.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format

3.3.1. Overview

The Protected Musical slide show application format builds on the Musical slide show AF as
described in section 3.2. It adds content protection for MP3 audio, JPEG images, 3GPP Timed
Text, and LASeR script animation with flexible protection tool selection and key management
components.

The two rendering modes for Musical slide show AF: “Basic” mode and “Enhanced” mode as
described in Musical slide show AF are still available in Protected Musical slide show AF. In this
case, the resource shall be unprotected prior to rendering. As the player in Basic mode may ignore
the LASeR script for animation, the protection for LASeR script may also be ignored in this mode.

The application shall be able to read the MPEG-21 DID metadata stored in movie-level „meta‟ box.
By parsing and executing the protection scheme to unprotect the protected contents (if any), the
output of the devices shall be the same as the output of Musical slide show AF without the MPEG-
21 DID metadata, for “Basic” or “Enhanced” mode.

As in Musical slide show AF, the Protected Musical slide show uses the same method of
synchronizing contents. The protection scheme is designed to not alter nor modify the
synchronization of contents. Therefore, the description of media synchronization and animation
will not be described in this section.

Table 6 shows the comparison of technologies used in Musical slide show AF compared to
Protected Musical slide show AF.

Table 6 — Technologies used in Musical Slide Show AF

Brand ‗mss1‘ ‗mss2‘


Technologies
MPEG-1/2 Layer 3 Audio  
JPEG Images  
MPEG-7 metadata  
LASeR script for animation  
3GPP TS 26.245 timed text data  
MPEG-21 DID 
MPEG-21 IPMP Components Base Profile 
MPEG-21 REL MAM Profile 
MPEG-21 Fragment Identifier 

3.3.2. File format

The format of file in Protected Musical slide show AF is basically the same as that of Musical
slide show AF. It has three tracks of JPEG slide show, MP3 audio and a 3GPP timed text which
corresponds to three trak boxes in movie box. The difference is the inclusion of MPEG-21 DID in
the meta box contains the IPMPDIDL metadata for protection information. Due to this issue, the
LASeR is now part of the MPEG-21 DID in the same xml box as Musical slide show AF file
format. Figure 28 shows the Protected Musical slide show AF structure.

The „ftyp‟ box of the ISO Base Media File Format contains a list of “brands” that are used as
identifiers in the file format. To enable player applications to easily identify files which are
compliant to this AF specification, specific brand identifiers are defined. These brands are used in
the compatible-brands list in addition to other appropriate brand types, like “iso2”, “mp42” or
“mp21”. The brand that identifies the Protected Musical slide show AF is „mss2‟. The brand
follows the brand of the Musical slide show AF by altering the number of brands into number „2‟,
to specify the „2nd Edition‟.

ftyp moov mdat


meta trak (jpeg slide show) JPEG
iloc/iinf mdia
MP3
item_ID = 1
stbl
item_name = <rel. uri to mp3>
content_type = audio/mp3

JPEG #1 1 2 … n-1 n
item_ID = 2
meta
item_name = <rel. uri to jpeg #1>
content_type = image/jpeg
xml
MPEG-7

JPEG #n
item_ID = n+1
item_name = <rel. uri to jpeg #n>
trak (mp3 audio) MP3
content_type = image/jpeg
mdia
Timed Text
item_ID = n+2
stbl
item_name = <rel. uri to text>
content_type = text

meta
xml
Descriptor xml
IPMPGeneralInfo MPEG-7
ToolList
License
Descriptor
LASeR script
Item
trak (text) 3GPP TS 26.245
Component
Resource (audio/mp3)
mdia
IPMPInfo
Item
stbl
Component tx3g
Resource (image/jpeg)
IPMPInfo

Item
Component
Resource (image/jpeg)
IPMPInfo
Item
Component
Resource (text)
IPMPInfo

Figure 28 — Protected Musical slide show AF file structure


3.3.3. System architecture

Creating a protected Musical slide show AF file involves formatting different types of media data,
defining the protection and license information, and storing them into an MPEG-4 file format.
Based on the Musical slide show AF system architecture described in section 3.2.3, the protection
module is included in the Creator to protect the resources based on the protection and license
description. Figure 29 shows an example of protected Musical slide show AF creator system
architecture. MP3 audio, JPEG images, and text data are formatted as individual MP4 media
tracks. Descriptions for the animation effects are stored as LASeR scene description in XML
format. These resources are described in structured way using MPEG-21 Digital Item Declaration
Language (DIDL), while the protection and license information for the protected resource is
described using MPEG-21 Intellectual Property Management and Protection (IPMP) and MPEG-
21 Rights Expression Language (REL).

Protected MSS AF Creator Protected MSS AF file

Create MP4 file structure

MP3 audio MP3 audio

MPEG-21 DIDL/IPMP/REL
Protection and license

Protect resources
JPEG
Create multimedia tracks JPEG images
information

images

Text
Timed text
(timed text)

Animation LASeR scene


Store metadata
effects description

Figure 29 — Protected Musical slide show AF creator system architecture

3.3.4. Metadata

Figure 30 shows how the information described by IPMP and what resources are protected. The
mechanism of signaling protection for the resources with the metadata is as follows: each of MP3
audio and 3GPP Timed text is described as one item while the collection of JPEG images (the
slide show) is described as one item contains individual JPEG image as a component and LASeR
script for animation is described as Descriptor. If the resource is protected, its description will be
described as ProtectedAsset using IPMP DIDL description scheme.
ftyp moov mdat
xml trak (JPEG slide show) Protected JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
Protected LASeR script
IPMPInfo
Item
trak (MP3 audio) Protected MP3
Component
Resource (audio/mp3) meta mdia
IPMPInfo
Item
Component
Resource (image/jpeg)
IPMPInfo

Item trak (timed text) Protected 3GPP
Component TS 26.245
Resource (image/jpeg)
IPMPInfo
mdia
Item
Component
Resource (text)
IPMPInfo

Figure 30 — Protected resources pointed by the metadata

The IPMPDIDL metadata contains Descriptor that contains IPMPGeneralInfo. It is


recommended that the Descriptor is defined at the beginning of the IPMPDIDL metadata. The
IPMPGeneralInfo contains:

 ToolList, as defined in MPEG-21 IPMP Base Profile.


 Container for licenses. The license information is described by MPEG-21 REL MAM
Profile

With this specification, the collection and item level metadata will be left intact. The MPEG-21
will not be used to describe any information that has been described by the MPEG-7 metadata in
both audio and slide show „trak‟s. Instead, it will be only used to describe the structure and the
governance of the multimedia content inside the AF.

The example of metadata instantiation of protecting the contents shown in Figure 30 is as shown
in Table 7. The tool list is carried at the top of IPMP DIDL with necessary license collection. As
described in ipmpinfo:IPMPToolID element, the AES-128 encryption tool can be signalled
by defining the tool name as the identification tag. The linkage that refers the tool used for the
content protection is described in the localID attribute in the ToolList element in
IPMPGeneralInfoDescriptor and localidref attribute in the
IPMPInfoDescriptor. An item in the IPMP DIDL represents a collection of MP3 audio,
slide show (JPEG images) and timed text. The protected resources are described by the Resource
element. The structure of IPMP Component Base profile is as shown in Figure 31.
(a)

(b)

Figure 31 — IPMP Component Base Profile

Table 7 — Metadata example of protecting contents in Protected Musical slide show AF

<?xml version="1.0" encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-
NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="application/laser">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents>
<![CDATA[-------------- animation script or code ---------]]>
</ipmpdidl:Contents>
</ipmpdidl:ProtectedAsset>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="audio/mp3">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId002</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(0, 4550000)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId003</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(4550000, 4550370)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId004</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(4550370, 4550892)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="text/txt">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId005</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(4550892, 4550900)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
</Item>
</Container>
</DIDL>

Several protection scenarios are possible:

Protecting individual item

Each resource: MP3 audio, JPEG image, 3GPP Timed Text, and LASeR script animation can be
protected individually:

 Figure 32 shows how the protection can be described in the metadata for MP3 audio only.
 Figure 33 shows how the protection can be described in the metadata for JPEG slide show (all
JPEGs in slide show track assumed as one single slide show item).
 Figure 34 shows how the protection can be described in the metadata for 3GPP timed text
only
 Figure 35 shows how the protection can be described for the LASeR script only.

ftyp moov mdat


xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) Protected MP3
Resource (audio/mp3)
IPMPInfo meta mdia
Item
Component
Resource (image/jpeg)

Item
Component
Resource (image/jpeg) trak (timed text) 3GPP TS 26.245
Item
Component
mdia
Resource (text)

Figure 32 — Protecting MP3 audio


ftyp moov mdat
xml trak (JPEG slide show) Protected JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) MP3
Resource (audio/mp3)
Item meta mdia
Component
Resource (image/jpeg)
IPMPInfo

Item
Component
Resource (image/jpeg) trak (timed text) 3GPP TS 26.245
IPMPInfo
Item
Component
mdia
Resource (text)

Figure 33 — Protecting JPEG slide show

ftyp moov mdat


xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) MP3
Resource (audio/mp3)
Item meta mdia
Component
Resource (image/jpeg)

Item
Component
Resource (image/jpeg)
Item trak (timed text) Protected 3GPP
Component TS 26.245
Resource (text)
mdia
IPMPInfo

Figure 34 — Protecting 3GPP timed text

ftyp moov mdat


xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
Protected LASeR script
IPMPInfo
Item
trak (MP3 audio) MP3
Component
Resource (audio/mp3) meta mdia
Item
Component
Resource (image/jpeg)

Item
Component
Resource (image/jpeg) trak (timed text) 3GPP TS 26.245
Item
Component
mdia
Resource (text)

Figure 35 — Protecting LASeR script

Table 8 shows the metadata instantiation example of protection to only the MP3 audio.
Table 8 — Metadata instantiation example of protecting MP3 audio

<?xml version="1.0" encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-
NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType=" application/laser ">
<![CDATA[-------------- animation script or code ---------]]>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="audio/mp3">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(0, 4550000)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550000, 4550370)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550370, 4550892)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550892, 4551006)"/>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="text/txt" ref="#mp (/byte(4551006, 4551078)"/>
</Component>
</Item>
</Container>
</DIDL>

Table 9 shows the metadata instantiation example of protecting LASeR animation script for the
JPEG slide show.

Table 9 — Metadata instantiation example of protecting LASeR animation script

<?xml version=”1.0” encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType="application/laser">
<ipmpdidl:ProtectedAsset mimeType="application/laser">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents>
<![CDATA[-------------- animation script or code ---------]]>
</ipmpdidl:Contents>
</ipmpdidl:ProtectedAsset>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="audio/mp3" ref="#mp (/byte(0, 4550000)"/>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550000, 4550370)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550370, 4550892)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550892, 4551024)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551024, 4551812)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551812, 4552002)"/>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="text/txt" ref="#mp (/byte(4552002, 4552082)"/>
</Component>
</Item>
</Container>
</DIDL>

Protecting combination of individual items

The whole resource or its combination (e.g. MP3 audio and JPEG images, or JPEG images and its
animation) can be protected at the same time. Figure 36 shows the protection description to protect
JPEG images and slide show animation

ftyp moov mdat


xml trak (JPEG slide show) Protected JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
Protected LASeR script
IPMPInfo
Item
trak (MP3 audio) MP3
Component
Resource (audio/mp3) meta mdia
Item
Component
Resource (image/jpeg)
IPMPInfo

Item
Component trak (timed text) 3GPP TS 26.245
Resource (image/jpeg)
IPMPInfo
Item
mdia
Component
Resource (text)

Figure 36 — Protecting JPEG slide show and LASeR script

Protecting one or more JPEG images

Each JPEG images inside the image track can be protected individually or collectively. Figure 37
shows the protection description for protecting two JPEG images.
ftyp moov mdat
xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) MP3
Resource (audio/mp3)
Item meta mdia
Component
Resource (image/jpeg)
IPMPInfo
Item
Component
Resource (image/jpeg)
IPMPInfo trak (timed text) 3GPP TS 26.245

Item
Component
mdia
Resource (image/jpeg)
Item
Component
Resource (text)

Figure 37 — Protecting two JPEG images

Table 10 shows the metadata instantiation example of protection to two JPEG images.

Table 10 — Metadata instantiation example of protecting two JPEG images in slide show

<?xml version="1.0" encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType="application/laser">
<![CDATA[-------------- animation script or code ---------]]>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="audio/mp3" ref="#mp (/byte(0, 4550000)"/>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(4550000, 4550370)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
<Component>
<Resource mimeType="application/ipmp">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId002</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
<ipmpdidl:Contents ref="#mp (/byte(4550370, 4550892)"/>
</ipmpdidl:ProtectedAsset>
</Resource>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550892, 4551024)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551024, 4551812)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551812, 4552002)"/>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="text/txt" ref="#mp (/byte(4552002, 4552082)"/>
</Component>
</Item>
</Container>
</DIDL>

Protecting certain segment of resource

Using MPEG-21 Fragment Identifier (ISO/IEC 21000-17), it is possible to protect specific


segment of the content, such as specifically defined rectangle region of JPEG image or specific
segment of MP3 audio which bytes duration are defined, as shown in Figure 38 and Figure 39,
respectively.
ftyp moov mdat
xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 1 2 … n-1 n


IPMPGeneralInfo
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) Partition protected
Resource (audio/mp3) MP3
Anchor meta mdia
Fragment ref=”location”
IPMPInfo
Item
Component
Resource (image/jpeg)

Item trak (timed text) 3GPP TS 26.245
Component
Resource (image/jpeg) mdia
Item
Component
Resource (text)

Figure 38 — Protecting specific segment in MP3 audio


ftyp moov mdat
xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 2


IPMPGeneralInfo 1 … n-1 n
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) MP3
Resource (audio/mp3)
Item meta mdia
Component
Resource (image/jpeg)
Item
Component
Resource (image/jpeg)
Anchor
Fragment ref=”location” trak (timed text) 3GPP TS 26.245
IPMPInfo
… mdia
Item
Component
Resource (image/jpeg)
Item
Component
Resource (text)

Figure 39 — Protecting specific segment in a JPEG image

Table 11 shows the metadata instantiation example of protection to specific segment in the MP3
audio data (from byte 35,340 to 3,200,234).

Table 11 — Metadata instantiation example of protecting specific segment of MP3 audio

<?xml version="1.0" encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-
NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType=" application/laser ">
<![CDATA[-------------- animation script or code ---------]]>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="audio/mp3" ref="#mp (/byte(0, 4550000)"/>
<Anchor>
<Fragment fragmentId="#mp (/byte(35430, 3200234) ">
<ipmpdidl:ProtectedAsset mimeType="audio/mp3">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
</ipmpdidl:ProtectedAsset>
</Fragment>
</Anchor>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550000, 4550370)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550370, 4550892)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg” ref="#mp (/byte(4550892, 4551006)"/>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="text/txt" ref="#mp (/byte(4551006, 4551078)"/>
</Component>
</Item>
</Container>
</DIDL>

Table 12 shows a metadata instantiation example of protection to specific region in two JPEG
images. The first image is protected in specific rectangular region pointed by left upper pixel
coordinate [20, 20] and right bottom pixel coordinate [40, 40]. The second image is protected in
specific ellipse region pointed by its circumscribing pixel coordinate from left upper at [45,30] to
right bottom at [120,120].
Table 12 — Metadata instantiation example of protecting specific segment of a JPEG image

<?xml version="1.0" encoding="UTF-8"?>


<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"
xmlns:ipmpdidl="urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS" xmlns:ipmpinfo="urn:mpeg:mpeg21:2004:01-
IPMPINFO-
BASE-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS"
xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-REL-R-NS rel-r.xsd
urn:mpeg:mpeg21:2003:01-REL-MX-NS rel-mx.xsd
urn:mpeg:mpeg21:2004:01IPMPINFO-BASE-NS IPMPInfo-Profilev0.4.xsd
urn:mpeg:mpeg21:2004:01-IPMPDIDL-NS IPMPDIDL.xsd
urn:mpeg:mpeg21:2002:01-DII-NS dii.xsd
urn:mpeg:mpeg21:2002:02-DIDL-NS DIDL.xsd">
<Container>
<Descriptor>
<Statement mimeType="text/xml">
<ipmpinfo:IPMPGeneralInfoDescriptor>
<ipmpinfo:ToolList>
<ipmpinfo:ToolDescription localID="10">
<ipmpinfo:IPMPToolID>AES-128-CBC</ipmpinfo:IPMPToolID>
</ipmpinfo:ToolDescription>
</ipmpinfo:ToolList>
<ipmpinfo:LicenseCollection>
<ipmpinfo:RightsDescriptor>
<ipmpinfo:License>
<!-- license information -->
</ipmpinfo:License>
</ipmpinfo:RightsDescriptor>
</ipmpinfo:LicenseCollection>
</ipmpinfo:IPMPGeneralInfoDescriptor>
</Statement>
</Descriptor>
<Descriptor>
<Statement mimeType="application/laser">
<![CDATA[-------------- animation script or code ---------]]>
</Statement>
</Descriptor>
<Item id="1">
<Component>
<Resource mimeType="audio/mp3" ref="#mp (/byte(0, 4550000)"/>
</Component>
</Item>
<Item id="2">
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550000, 4550370)"/>
<Anchor>
<Fragment fragmentId="#mp (/region(rect(20,20,40,40))) ">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId001</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
</ipmpdidl:ProtectedAsset>
</Fragment>
</Anchor>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550370, 4550892)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4550892, 4551024)"/>
<Anchor>
<Fragment fragmentId="#mp (~region(ellipse(45,30,120,120))) ">
<ipmpdidl:ProtectedAsset mimeType="image/jpeg">
<ipmpdidl:Identifier>
<dii:Identifier>IPMPId002</dii:Identifier>
</ipmpdidl:Identifier>
<ipmpdidl:Info>
<ipmpinfo:IPMPInfoDescriptor>
<ipmpinfo:Tool>
<ipmpinfo:ToolRef localidref="10"/>
</ipmpinfo:Tool>
</ipmpinfo:IPMPInfoDescriptor>
</ipmpdidl:Info>
</ipmpdidl:ProtectedAsset>
</Fragment>
</Anchor>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551024, 4551812)"/>
</Component>
<Component>
<Resource mimeType="image/jpeg" ref="#mp (/byte(4551812, 4552002)"/>
</Component>
</Item>
<Item id="3">
<Component>
<Resource mimeType="text/txt" ref="#mp (/byte(4552002, 4552082)"/>
</Component>
</Item>
</Container>
</DIDL>

Moreover, it is also possible to combine the protection to certain segment of resource with
protection of other content, e.g., protecting certain segment of MP3 audio and protecting JPEG
images that are synchronized to the timestamp of the protected segment of MP3 as shown in
Figure 40.

ftyp moov mdat


xml trak (JPEG slide show) JPEGs

Descriptor meta mdia 2


IPMPGeneralInfo 1 … n-1 n
ToolList
License
Descriptor
LASeR script
Item
Component
trak (MP3 audio) Partition protected
Resource (audio/mp3) MP3
Anchor meta mdia
Fragment ref=”location”
IPMPInfo
Item
Component
Resource (image/jpeg)
Item
Component trak (timed text) 3GPP TS 26.245
Resource (image/jpeg)
Anchor
mdia
Fragment ref=”location”
IPMPInfo

Item
Component
Resource (image/jpeg)
Item
Component
Resource (text)

Figure 40 — Protecting specific segment in MP3 audio as well as specific segment in a JPEG
image
3.4. MPEG-A Part 10: Video Surveillance Application Format

3.4.1. Overview

The Video Surveillance AF is a file format designed to provide for a first level of interoperability
for video-based surveillance systems. It contains MPEG-4 AVC baseline profile video and
associated MPEG-7 meta-data. Usage of other coded video formats will be assisted. The proposed
Video surveillance application format is not intended to directly accommodate the legacy content
and components. Rather, it is intended to provide a lightweight and useful wrapper to the video
content from the MPEG technologies that are the best fit for purpose at the date of expected
finalisation. However, the description of any relation existing between Video surveillance
application format content and other video content will need to be addressed. Currently, JPEG and
MPEG-4 Part-2 are arguably the most commonly deployed digital video standards. However, in
due course it is expected that AVC will be more commonly deployed, not least because it is
understood to be the most „fit-for-purpose‟ of the available video technologies.

3.4.2. File format

A Video Surveillance AF contains of a set of self-contained AF fragments which are connected to


each other. A Video surveillance application format fragment covers a limited amount of time.
Each Video surveillance application format fragment is identified by an UUID (universal unique
identifier) which is linked to a predecessor and successor fragment through their UUIDs as shown
in Figure 41. All Video surveillance application format data is stored within the Video
surveillance application format fragments. If a fragment has no predecessor or successor its value
is set the current fragment. Additionally an URI can be given serving as a hint to the location of
the predecessor and successor fragments. A Video surveillance application format fragment
remains self contained even if unhinged. Note that there is no requirement to use more than one
Video surveillance application format fragment. The concept of using fragments e.g. enables ring
buffer architectures. Each fragment shall be a valid AVC file as defined by the AVC file format.

To store the compressed video data in sample units and its corresponding metadata, the Video
surveillance application format provides a file structure based on the ISO Base Media File Format
which takes an object-oriented approach to the design of boxes for data storage. The file structure
of Video surveillance application format contains the boxes specifically designed for AVC File
Format.

Figure 42 shows the file structure of Video surveillance application format which consists of all
the MPEG-4 file format mandatory boxes: the file type („ftyp‟) box to store the information for
the identification of file format, the movie container („moov‟) box to store the location and timing
information of video samples, and media data container („mdat‟) box to store the compressed
video data in sample units. The sample unit for AVC bitstreams is NAL (Network Adaptation
Layer) unit.

In the file type box, the major brand „vsf1‟, which stands for „video surveillance format 1‟, is
specified to identify Video surveillance application format. This means any Video surveillance
application format-capable player should identify this brand in order to parse the file structure of
Video surveillance application format and to handle its contents.
●●●
predecessor id successor id

VS AF fragment n-1

UUID
Start time
Duration
Predecessor UUID
Successor UUID

predecessor id successor id

VS AF fragment n

UUID
Start time
Duration
Predecessor UUID
Successor UUID

predecessor id successor id

VS AF fragment n+1

UUID
Start time
Duration
Predecessor UUID
Successor UUID

predecessor id successor id
●●●

Figure 41 — VS AF fragments

In the movie container box, each track box carries the information about the video contents in
terms of locations, times, durations, track type handlers, AVC-specific information (profiles and
levels for codec types, bit rates, frame rates, screen sizes), media information etc. which are used
to identify and to enable the access to the corresponding media contents. By default, at least one
video track shall exist in the Video surveillance application format. The specification of Video
surveillance application format allows only containing one video data type encoded using AVC
baseline profile up to level 3.1. The handler type for each media handler box is specified based on
the content it refers to. Each video track is identified by the „vide‟ handler type while each timed
metadata track is signaled by the „meta‟ handler type.

The Video surveillance application format requires the capture time stamp to be stored for every
video frame. The timestamps are stored in timestamp metadata samples in a time parallel metadata
track which is linked to the video track by means of a track reference with type „vsmd‟. For all
video samples a timestamp metadata sample shall exist with decoding time equal to the decoding
time of the corresponding video sample. The Video surveillance application format defines the
storage of a binary coded timestamp for all video samples of a video track. However future
version of the Video surveillance application format might store more information about a video
sample. The video surveillance metadata sample entry contains a version number to inform the
reader of the sample format used in this metadata track.
ftyp meta moov moov
vsmi trak AVC video
meta stbl
xml cami

MPEG-7 xml

MPEG-7

trak AVC video


meta stbl
cami
xml

MPEG-7

trak Timed
metadata
meta stbl

●●●

●●●
Figure 42 — VS AF file structure

3.4.3. System architecture


Creating Video surveillance application format file involves the use of movie fragment (do not be
confused with Video surveillance application format fragment). In general, movie and track
fragments extend a presentation in time. All fragments must be stored in sequence given by an
ordinal sequence number. Each movie fragment contains one or more track fragments for all
tracks in the movie. Using movie fragment, the video data can be read during the storing and
creating the Video surveillance application format file. Figure 43 illustrates how the movie
fragments in file format.

moov mfra mdat


trak tfra tfra AVC video sample 0

meta stbl Track fragment Run Array 0 Track fragment Run Array 0 AVC video sample 1

cami Track fragment Run Array 1 Track fragment Run Array 1 Timed metadata sample 0

xml AVC video sample 2


●●●

●●●

MPEG-7 Timed metadata sample 1


●●●

Free space (new array Free space (new array


trak fields are appended) fields are appended) AVC video sample n
appended
meta stbl
Timed metadata sample n
appended

Figure 43 — Using movie fragment


3.4.4. AVC video

Because of multi-functionality of MPEG-4 AVC, subsets of different tools have been defined in
order to allow effective implementations of the standard. These subsets, called "Profiles", limit the
tool set which shall be implemented. For each of these Profiles one or more Levels have been set
to restrict the computational complexity of implementations.

MPEG-4 AVC accepts various sizes of input picture within the capability specified from the
Profile and Level. In this AF usage of the MPEG-4 AVC video codec is required. The Baseline
Profile tool set will be used up to level 3.1 (maximum value of level_idc shall be 31). Both,
constraint_set0_flag and constraint_set1_flag shall be set to 1 simultaneously. The profile and
level of MPEG-4 AVC for Baseline profile and Level 1 to 3.1 is shown in Table 13 and Table 14
respectively.

Table 13 — MPEG-4 AVC Baseline profile

Features Baseline
I and P Slices 
B Slices 
SI and SP Slices 
Multiple Reference Frames 
In-Loop Deblocking Filter 
CAVLC Entropy Coding 
CABAC Entropy Coding 
Flexible Macroblock Ordering (FMO) 
Arbitrary Slice Ordering (ASO) 
Redundant Slices (RS) 
Data Partitioning 
Interlaced Coding (PicAFF, MBAFF) 
4:2:0 Chroma Format 
Monochrome Video Format (4:0:0) 
4:2:2 Chroma Format 
4:4:4 Chroma Format 
8 Bit Sample Depth 
9 and 10 Bit Sample Depth 
11 to 14 Bit Sample Depth 
8x8 vs. 4x4 Transform Adaptivity 
Quantization Scaling Matrices 
Separate Cb and Cr QP control 
Separate Color Plane Coding 
Predictive Lossless Coding 

Table 14 — MPEG-4 AVC Levels up to level 3.1

Max video bit


Examples for high
Max Max frame rate (VCL) for
Level resolution @ rame
macroblocks size Baseline,
number rate (max stored
per second (macroblocks) Extended and
frames) in Level
Main Profiles
128x96@30.9 (8)
1 1485 99 64 kbit/s
176x144@15.0 (4)
1b 1485 99 128 kbit/s 128x96@30.9 (8)
176x144@15.0 (4)
176x144@30.3 (9)
1.1 3000 396 192 kbit/s 320x240@10.0 (3)
352x288@7.5 (2)
320x240@20.0 (7)
1.2 6000 396 384 kbit/s
352x288@15.2 (6)
320x240@36.0 (7)
1.3 11880 396 768 kbit/s
352x288@30.0 (6)
320x240@36.0 (7)
2 11880 396 2 Mbit/s
352x288@30.0 (6)
352x480@30.0 (7)
2.1 19800 792 4 Mbit/s
352x576@25.0 (6)
352x480@30.7(10)
352x576@25.6 (7)
2.2 20250 1620 4 Mbit/s
720x480@15.0 (6)
720x576@12.5 (5)
352x480@61.4 (12)
352x576@51.1 (10)
3 40500 1620 10 Mbit/s
720x480@30.0 (6)
720x576@25.0 (5)
720x480@80.0 (13)
3.1 108000 3600 14 Mbit/s 720x576@66.7 (11)
1280x720@30.0 (5)

3.4.5. Metadata

To store the information about the time stamp of each video sample, the timed metadata track is
used in the „moov‟ box. In addition, to store the MPEG-7 metadata, the „meta‟ boxes are used.
There are two types of „meta‟ boxes used: file level „meta‟ box to store the file level metadata; and
track level „meta‟ box to store track level metadata.

File level metadata

The file level metadata is used to describe the information regarding the identification and creation
time of a Video surveillance application format file and the textual annotation about the contents
within the Video surveillance application format. It also allows for the use of classification scheme
to describe the content. The metadata should be located in the top level of the Video surveillance
application format file in order to enable easy access for identification information of the Video
surveillance application format file. In the file structure hierarchy of Video surveillance
application format, the metadata is located in the file level, hence being called the file level
metadata.

On file level two different boxes may be used to store metadata. The AF Identification Box is
required, and shall be included in every Video surveillance application format fragment. The
„vsmi‟ box shall provide the Video surveillance application format fragment identification UUID
and the UUID of the successor and predecessor. It may also provide the URI to the successor and
predecessor. If these URIs are provided there must be the possibility to resolve the URIs. The box
covers the following information:

 File identification: An UUID identifying every Video surveillance application format


fragment
 Successor and predecessor identification: The UUID of the previous/next fragment in
composition time shall be included (URIs describing the corresponding location may be
included)
 The UTC based time stamp of the first and the last sample in the video tracks

Additional metadata box containing further information may also be included in a Video
Surveillance AF fragment. If present, metadata contained inside this additional box is presented
using MPEG-7. The following profile applies to the MPEG-7 in file level metadata:

1. General text annotation

a. Textual annotation should be added using either free text or structured annotation. All
elements within structured annotation are limited to zero or one.

2. For file level metadata:

a. ID

i. The UUID of the Video surveillance application format file shall be repeated in the
PublicIdentifier of the DescriptionMetadata element. There shall be one of
these descriptors.

ii. It is assumed the UUID of the camera already contains the information of the cluster the
camera belongs to.

b. Time

i. The Video surveillance application format start-time should be repeated at file-level


(CreationTime of the DescriptionMetadata element). This indicates the
creation time of the fragment.

c. Textual annotations should be added using Comment of the DescriptionMetadata


element. Zero or one of the types described in 1.a. shall be used.

d. Classification schemes and Terms can be defined and used as described in MPEG-7 MDS.
The cardinality of the Definition element of the TermDefinitionBaseType shall be
zero or one. The cardinality of the Name element of the TermDefinitionBaseType
shall be zero or one. There shall be zero of the preferred attribute of the Name element of
the TermDefinitionBaseType.

e. Maintaining object references

i. Object references are grouped using the Graph DS and referenced using Relation DS
elements. The objects can be any DS, as used within the XML document, e.g. a still
region of video at track level.
ii. The attributes within a Relation are restricted to contain unary values. E.g. source, target
and type can only contain a single reference to a Classification Scheme Term, id reference,
etc. i.e. the values of the termReferenceType.

Track level metadata

The track level metadata is used to describe the information regarding the content of the Video
surveillance application format. It describes the track identification information, camera
equipments, timing information for each track, text annotation to describe the event,
decomposition of frames, locations of the objects as ROI‟s in the frame, color appearance of the
objects, and identification of the object. The current specification of Video surveillance
application format has a limited set of MPEG-7 Visual descriptors such as the dominant color and
scalable color descriptors. Since the track-level metadata describes the video content, the metadata
with the MPEG-7 dominant color and scalable color descriptor values are located inside the
respective track boxes of Video surveillance application format, hence being called the track level
metadata.

Track level metadata is included in two boxes. A required camera identifier box shall be included.
Provides the camera identification UUID and user defined identification extensions used to create
a particular track in a Video surveillance application format. The camera UUID should be
assigned with a physical camera or with a camera location. The camera identification box may be
enlarged (indicated by the size of the box) if storage of user defined identification data is required.
This extra information might not be understood by all Video surveillance application format
readers. If there are alternative tracks holding different encodings from the same camera then the
camera UUID shall be identical for all these alternate tracks. This box contains the camera
identification:

 An UUID identifying the camera


 Additional space for user defined identification

Additional metadata box may be included. If present, metadata contained inside this additional
box is presented using MPEG-7 with the following profiles:

a. Meta-data for each track are described using the Video Segment DS. E.g. VideoType.
Only one of these types shall be used.

b. ID.

i. The camera id shall be repeated from the 'cami' box in PublicIdentifier of the
DescriptionMetadata element. There shall be one of these descriptors.

c. Equipment

i. The camera / cluster settings should be given (Instrument of the


DescriptionMetadata element). If present, zero or one of these types shall be used.
ii. Additional information regarding the cluster to which the camera belongs to should be
given by EntityIdentifier and its VideoDomain element. If present, one of the
EntityIdentifier types shall be used and zero or more of the VideoDomain types
should be used. The VideoDomain elements reference entries from a Classification
Scheme (ClassificationScheme).
iii. The camera stream should be identified with StreamID. Zero or more of these types
shall be used. It is necessary to include the element InstanceIdentifier,although
this can be kept empty.
iv. The camera geographic position should be given using CreationLocation. Zero or
one of these types shall be used.
v. Camera calibration should be provided with the
Spatial2DCoordinateSystemType. A description of more than one of these
descriptors allows a calibration function for each preset for PTZ cameras to be calibrated.
Zero or one of these types shall be used
vi. If the media is outside of the Video surveillance application format fragment and
referenced using the Data Reference Box („dref‟) the MediaURI shall contain the same
reference. If no Data Reference Box („dref‟) is present the MediaURI should contain a
valid reference to a media instance. It is necessary to include the element
InstanceIdentifier, although this can be kept empty.

d. Time

i. Video offset has no specific element, so the Description Metadata DS


(DescriptionMetadata), Instrument and its Tool Setting elements should be used.
The setting name is “offset”. If present, one of these types shall have the format and
precision as given in Section 6.
ii. The time of the video shall be given with a media time element (MediaTime). Within
this element one time point shall be given. Duration information can be also given here.
To specify the duration of a video decomposition MediaDuration should be used. This
is a representation of the duration of the track as given in Section 6.
iii. To isolate where the StillRegion exists in the video, the MediaTimePoint shall be
used.

e. Decomposition

i. Groups of frames should be defined within the video using the


TemporalDecomposition. If present, one of these types shall be used
ii. Single frames should be decomposed using the StillRegion DS. If a frame
(StillRegion DS) is decomposed its time position shall be specified.
iii. To isolate a region within a frame, a choice shall be made between a Box or Polygon.
Zero or one of these can be described per StillRegion. If more regions are required
for a frame, then another StillRegion can be instantiated, referencing the same media
time point (mediaTimePoint).

f. Visual Descriptions

i. Color should be described in the StillRegion DS by the VisualDescriptor DS


or the GridLayout DS. The VisualDescriptor shall include one color descriptor.
The GridLayout can specify an arbitrary number of cells, each should contain one
color descriptor.
ii. DominantColor and ScalableColor shall be the only descriptors present from the
VisualDescriptor DS and GridLayout DS.

g. Semantic descriptions
i. A camera track should define semantic descriptions using TextAnnotation – see
1.”General text annotation”. This is possible with FreeTextAnnotation at
DescriptionMetadata, Video and StillRegion levels.
ii. A camera track should define semantic descriptions using TextAnnotation – see
1.”General text annotation”. This is possible with a structured annotation using the
StructuredAnnotation at DescriptionMetadata, Video and
StillRegion levels.
iii. In order to provide detailed meaning to semantic descriptions, Terms should be referenced
from Classification schemes (ClassificationScheme) – see 2.d.

h. Maintaining object references

i. Object references are defined as described in 2.e.

4. Implementation

4.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format

4.1.1. Reference software

The reference software consists of a Protected Music player AF authoring tool and player, which
are built in C++ language using MFC in Visual C++ 6.0. We use additional libraries to decode the
MP3 audio and JPEG image data, and to parse the XML metadata. The FMOD SoundSystem
library by Fireflight Technologies is used to decode the MP3 audio. The CxImage library by
Davide Pizzolato is used to decode JPEG data. To parse the XML metadata in the player, we use
MSXML SDK. Also in this reference software we use of ISOLIB library to parse ISO-based file
format.

4.1.1.1. Authoring tool

The software architecture to create the aforementioned file format is shown in Figure 44. First,
the MP3 audio files and JPEG images are loaded into the creator. The creator will handle the
list of candidate files to be packaged into the Protected Music player AF categorized by its
type: audio and images. Next, one audio file and one image file is selected to be bind in the
AF file. The bind information will be used by the creator to define file format structure and
protection information. Based on this information, the creator will then load the actual
resource and packaged them into one or more tracks within a Protected Music player AF file.
Protected Music Player AF Creator

Bind MP3 and JPEG


Get resource
directories

Information
Protection
MP3 file

JPEG file

Load MP3 and


load resources JPEG file

Write
save

file
Define boxes’ MAF file
data

Figure 44 — Software architecture of Protected Music player AF creator

In the user interface, as shown in Figure 45, the following procedure applies:

1. Add an MP3 file using “Add MP3” button. Repeat this step to add more MP3 files
2. Add a JPEG file using “Add JPEG” button. Repeat this step to add more JPEG files
3. Select the desired MP3 and JPEG file to be combined in one MAF track. The selected file
will be shown in the text box in the Resource preview area
4. Select the desired protection tool to be applied for the MAF file by pressing “Protection
tool” button. The protection tool GUI will be displayed as shown in Figure 46. Insert the
protection key in protection key text input. Although currently not supported by the
reference software, it is possible to also add remote reference to the protection tool.
5. Define the license for MP3 resource by double-click the MP3 file in the list. The License
information GUI will be displayed as shown in Figure 47.
6. Combine both MP3 and JPEG using “Bind JPEG to MP3” button. By this step one track
of MAF is created
7. Repeat step 3 and 4 to create another track.
8. Finally, press “Save” button to define a name for the new MAF file
Image preview

Resources list

Audio controls

Protection tool

Selected
resource

Figure 45 — Protected Music Player MAF Authoring Tool GUI

Figure 46 — Protection tool GUI


Figure 47 — License information GUI

4.1.1.2. Player

The software architecture for the Protected Music player AF application is shown in Figure 48.
Basically, it consists of three parts: box parser, un-protector and resource playback. The box
parser will be invoked first, parsing all necessary information stored inside the boxes, the un-
protector will be used to decrypt the encrypted protected resource(s), while the resource
playback will play the MP3 audio data and display its corresponding JPEG image data and
MPEG-7 metadata based on the parsed information. To reduce the memory usage, the
resource will be loaded to the memory only during the playback. The player is an extension to
Music player AF player reference software. The extension includes the use of ISOLIB library
to parse ISO-based file format, the IPMP DIDL parser and un-protection (decryption) module.

Protected Music Player AF Player

MP4 file reader MP3 decoder


Un-protection

Read FTYP
Tool

JPEG decoder

Read MOOV

Read META

load

Read TRAKs Play MP3


MAF file
u
parse Display JPEG
Unprotect

Read MDAT

Figure 48 — Protected Music player AF player system architecture


To parse the protected Music player AF we use the ISOLIB API in the following procedure:

1. Parse the file-level meta box to get the primary data, i.e. the IPMP DIDL metadata
2. Parse the file-level meta box to get the first item, i.e. the hidden MP4 file
3. Parse the hidden MP4 file to get the resources: MP3 audio, JPEG image and MPEG-7
metadata

By parsing the IPMP DIDL metadata, we can obtain the identifier for the protection tool used
inside the file. The un-protection tool will be invoked just once when a protected item is
found during the third procedure.

The IPMP DIDL metadata is parsed using MSXML library, following the schema of IPMP
Component base profile and profiled REL schema. This reference software uses several
protection tools to unprotect protected resource(s): an XOR algorithm and AES algorithm
(Rijndael). The protection tool is applied into each access units of the protected MP3 audio
data.

The following describes the method used to playback three types of resources stored inside
the hidden MP4 file. MP3 is stored inside the „mdat‟ box inside the hidden MP4 file
partitioned into many chunks (access units). Method used to playback audio data is as follow.
The application first parses „iloc‟ box to read the item‟s position and length. Next, it parses
„infe‟ box to read the item‟s name and content type for respective item index. Based on the
content type, in „iinf‟ box, the resources‟ variables can be separated into audio and image
which will make the variables easier to be managed.

Since the JPEG data is regarded as an item of „meta‟ box, the ISOFindItemByID() and
ISOGetItemData() APIs are used to obtain the data from the „mdat‟ box and display it
using CxImage library.

Method to parse both IPMPDIDL metadata inside file-level „meta‟ box of MPEG-21 file and
MPEG-7 metadata inside track-level meta box of hidden MP4 file are the same. In both
module, we use ISOGetFileMeta() API to get file-level metadata, while
ISOGetTrackMeta() API is used to get track-level metadata. Parser built using MSXML
library is then use to obtain the information located inside the metadata.

For this reference software we provide several files as follows:

Single track MAF

This example consists of a single protected MP3 audio data and a JPEG image data. The
audio data is a Korean pop song, “Gido”, and the image of the artist. The output is shown in
Figure 49.
Figure 49 — Protected Music Player MAF with single track MAF

Multiple tracks MAF

This example consists of 2 MAF tracks (each has 1 MP3 audio data and 1 JPEG image data).
The first track is a Korean folk story audio file and the image shows the title of the story. The
second track is a Korean drama soundtrack and the image from the drama. The output is
shown in Figure 50 (a) for the first track, and (b) for the second track.

(a)
Figure 50a — Protected Music Player MAF with two tracks MAF: first track
(b)

Figure 50b — Protected Music Player MAF with two tracks MAF: second track

To unprotect the protected resource, the MAF player will automatically invoke a dialog box to
have user insert the protection key based on the protection tool defined inside the IPMPDIDL
metadata. Figure 51 shows the dialog box.

Figure 51 — Un-protector GUI

4.2. MPEG-A Part 4: Musical Slide Show Application Format

4.2.1. Reference software

The Musical slide show AF file can be created using Protected Musical slide show AF reference
software by excluding the protection information. See section 4.3.2 of this document to detailed
description of the reference software.

4.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format

4.3.1. Implementation on PDA

As a part of the development of Protected musical slide show AF, a player application has been
made for PDA using Windows Mobile OS. The user interface is as shown in Figure 52.
Slideshow area

Control panel (left


to right):
- Open file
- Play/stop
- Pause
- Show information

Figure 52 — User interface of Musical Slide Show AF player

Playing unprotected resource

This is a very simple case, where we have the Musical slide show application format file, consists
of MP3 data, several JPEG images and a set of lyrics; without any protection. User can freely play
the resource for unlimited time within unlimited range of time.

Figure 53 — User interface of AF Player application showing unprotected resource

As shown in Figure 53, if user presses the “File Information” button, the application showed “No
license information available” for the unprotected resource.

Playing protected resource

In case of playing protected resource, as shown in Figure 54, user can see the license information
of protected resource using “File Information” button. In this example, user has been granted to
play the resource for 20 times during January 1, 2006 until January 1, 2007. For this application,
we define the condition of “playing content” as “playing content continuously for more than 75%
of the length of the MP3 data”. This means if the user plays the resource and listen to the song
more than half of it without perform any sliding or pausing, then the user exercise limit shall
decrease by one. In the example we can see that user has already play the content for one time,
therefore the file information shows the remaining times (19/20 means the only 19 times of
playing left, out of 20 times given by the license).

Figure 54— User interface of AF Player application showing license of protected resource

Playing protected resource when exercise limit license already expired

In case of playing protected resource when the limited exercise license already expired, the
application still allows user to see unprotected JPEG images and see the file information. However
user cannot play the protected resource. As shown in Figure 55, the user has granted 2 times of
playing the content, where the user already play the content continuously twice. When the user
tries to play the content one more time, the application will show the message that the exercise
limit has expired

Figure 55 — User interface of AF Player application shows expired exercise limit


Playing protected resource when validity condition license already expired

In case of playing protected resource with expired validity condition license, the application still
allows user to see unprotected JPEG images and see the file information. However user cannot
play the protected resource. As shown in Figure 56, here we have Protected Musical slide show
application format file with validity available from January 1, 2005 until July 16, 2006. The user is
trying to play the resource at July 18, 2006.

Figure 56 — User interface of AF Player application shows expired license

Playing protected resource with different protection tool

In case of playing protected resource with different protection tool during the authoring process,
the application invoke a message, as shown in Figure 57, tells user that the application unable to
unprotect the protected audio. This case can be happen, for example, if user tries to playing the
protected AF file acquired from different authors or producers that has different protection tool.

Figure 57 — User interface of AF Player application failed to unprotect the resource


4.3.2. Reference software

The reference software is build using the same platform and libraries as the reference software of
Protected Music player AF as described in section 3.1.5.2. Some modules also built based on that
of Protected Music player AF. Therefore in this section we will exclude the technical description
of the software modules.

4.3.2.1. Authoring tool

The authoring tool is presented to introduce how the protected musical slide show can be
constructed. It has the following features:

 MP3 player
 JPEG display
 MP3-JPEG synchronization
 MP3-Timed text synchronization
 Timed text font, highlight color and background color settings
 Content protection: MP3, JPEG, Timed-text, LASeR (in schema only)
 Choose of protection tools: XOR, AES-128-ECB, AES-128-CBC, and AES-128-CFB
 Partial protection for MP3
 Region protection for JPEG (experimental using XOR tool)

The authoring tool, as user interface shown in Figure 62, uses the following procedures to
create a Protected Musical slide show AF file.

 Select MP3 by clicking “Add MP3” button


 Select JPEG images by clicking “Add JPEG” button for one image. Repeat to add more
 Select timed text lyrics by clicking “Add Text” button. The timed text lyrics file is a text
file pre-formatted using the following rules (as shown in Figure 58):
 Separate synchronized text using slash “/” character
 End the file with “/”

Figure 58 — Formatting the timed text

To synchronize images:

 Select the image to be synchronized


 Play the MP3 using “Play” button, or drag the slider to the desired timestamp
 Select the animation effect
 Click “Synchronize” button below the slider, and click “OK” to confirm

To synchronize text:

 Play the MP3 using “Play” button


 Click “Synchronize” button below the timed text viewer according to the synchronization
rules

To add protection to audio:

 Double click the audio file directory name in resource list to invoke the protection
windows (Figure 59)
 Click “Protect” to protect the whole audio file
 For protecting certain segment, firstly click “|>” button to play the audio. At desired
timestamp, click “[s]” button to start the protection. To end the protection timestamp,
click “[e]” button. Click “[r]” button to reset
 Click “OK” to confirm

Figure 59 — MP3 protection user interface

To add protection to image:

 Double click the image file directory name in resource list to invoke the protection
windows (Figure 60)
 Click “Protect” to protect the whole image region
 For protecting certain segment in rectangle, click within the image to point the top-left
corner of the rectangle, and click once more to point the bottom-right corner of the
rectangle
 Click “OK” to confirm
Figure 60 — Image protection user interface

To set the protection tool and license scheme:

 Click “IPMP Tool” button to invoke IPMP Tool window (Figure 61)
 Select provided Tool ID
 Input protection key (any character)
 Define the license validation range
 Input the desired exercise number, or check “Unlimited” to define unlimited number of
exercising content
 Click “OK” to confirm

To protect the LASeR animation, check the “Protect” button below the slider.

To protect the timed text, check the “Protect” button above the timed text viewer.

Finally, click “Save” to save the musical slide show file.

The video tutorial on how to use the authoring tool is available in YouTube:
http://www.youtube.com/watch?v=hJfOaEGQxsE
Figure 61 — IPMP Tool and REL settings user interface

Resource list Timestamp slider Timed text viewer

Figure 62 — Authoring tool user interface

4.3.2.2. Player

The musical slide show player reference software is built to provide example of
implementation of how to extract contents from the specified file format and execute the
contents. The implementation of protected musical slide show does not have to follow the
algorithm of this reference software. It has the following features:
 MP3-JPEG-Timed text synchronized play
 ISO-base file format file structure view
 MPEG-7 SMP structure view
 MPEG-21 IPMP/REL structure view
 SVG Player (from MPEG output doc. N8821) for LASeR rendering

The user interface of the player as shown in Figure 63 uses the following procedure to play
Protected Musical slide show AF file:

 Click “Open” to load musical slide show file


 If the file is protected, input the protection key in the input window, and click “OK” to
continue
 Click “Play” to play the musical slide show
 The ISO-base file format file structure, the MPEG-7 SMP and the MPEG-21 IPMP are
shown in the tree structure on the right side of the player, as shown in Figure 64.

Figure 63 — Player user interface


Figure 64 — Clockwise from top left: file structure, MPEG-7 and MPEG-21 structure view

4.4. MPEG-A Part 10: Video Surveillance Application Format

4.4.1. Initial implementation

The initial implementation of Video surveillance AF was built as experiment result for MPEG
input proposal. It includes the file format creator and metadata generator. Since the current
specification of VS AF only for basic requirements, some parts are not implemented in reference
software.

4.4.1.1. Authoring tool

The authoring tool of video surveillance MAF needs authorized user assistance to put the
information regarding the creation of the MAF, along with the information describing the
equipment settings and the video data itself from the surveillance camera. These inputs then
processed in the authoring tool by the following modules: video processing modules,
metadata modules, MAF module, and file writing module. The system architecture of the
authoring tool is shown in Figure 65.
Video surveillance AF creator

Manual video
processing

Creation
User File-level metadata Define MAF file
information
generator structure

Camera
Track-level MP4 file format
information
metadata generator generator

Automatic video
Video data Write file
Surveillance processing
camera

VSAF File

Figure 65 — Video surveillance AF creator system architecture

There are two video processing modules: automatic module and manual module. Automatic
video processing module contains algorithm (which can be a complex algorithm) to
automatically extract or, if necessary, alter the video to perform the video processing such as
object segmentation, object tracking, motion activity, color extraction, etc., and generate the
item-type metadata such as summary or segmentation information and visual descriptors.
Manual video processing module handles any user assistance to the video processing such as
manual video segmentation or user-assisted object extraction and, similar to automatic
module, generate the item-type metadata. The metadata output from these modules are
depicted by green lines from processing modules to the metadata modules. The green lines
depicted the processed video data sent to the MAF module. The video modules use
DirectShow library to render the video during the processing.

The metadata modules consist of Collection-type metadata generator module and Item-type
metadata module. Collection-type metadata describe the information regarding the description
of equipment information, creation information and video collection information, while item-
type metadata describe the information regarding the content of the video. Collection-type
metadata generator module receives input from the creation information obtained initially
from the user, while the equipment information obtained from the user based on the
equipment‟s (camera) specification (which also possible to be obtained automatically from the
equipment itself). Item-type metadata generator receives input from both video processing
modules as mentioned before. Both metadata generator modules sent output as metadata in
form of XML to the MAF module as depicted by blue lines in the Figure.

Static metadata contains the description of the creation of the VS AF. The description is as
described in Table 15.

The purpose of describing the equipment information such as the lens and the video settings is
to enable the processing of the video based on its source capabilities. For example, the angle
of view information of the lens can be used in 3D object tracking to determine whether the
object located near the camera or far away from the camera. By knowing the video
compression from the camera, the creator of the MAF can determine what kind of tool is
suitable for such video format. And so on so forth.
Table 15 – Description of the creation of the surveillance AF

Element Description
Comment Describes the brief description of the creation of MAF file. It is
described using FreeTextAnnotation
Creator Describes the creator of the surveillance MAF. For now we use
PersonType descriptor with the following descriptions:
Name/GivenName, Name/FamilyName,
Affiliation/Organization/Name (three Names can be added)
CreationLocation Describes the location of the scene in the video.
CreationTime Describes the time of creation of the video
Instrument Describes the camera settings. The following descriptions are used:
tool name, camera name, lens maker, lens focal ratio, lens focal
length, lens focus range, horizontal angle of view, vertical angle of
view, video compression, video compression profile, video resolution
and video frame rate

Figure 66 shows the user interface of creating/displaying this metadata. The inputs for
creation time are automatically obtained from the video file, as well as the video resolution.
The metadata generated in this application has been previously validated.

Content metadata can be generated by the application is the summary description and video
segmentation. Figure 67 shows how the video can be segmented manually. To segment the
video, simply play the video until desired time point/position, then pause the video and set it
as video segment (by default, a video is one big segment from the beginning to the end). The
time start and duration of each segment is displayed in the list, and a description for each
segment can be added if necessary. Only sequential summary for the segment was made, but
it is possible to group the segments to make the hierarchical summary.

A visual descriptor can be used to be generated to metadata. Currently we can make the
MotionActivity descriptor based on the motion intensity of each region that can be described
using GridLayout descriptor. The intensity of the motion is calculated from the amount of
pixel difference from a frame with previous frame. A threshold value is set so only large pixel
differences are considered as “object is moving / active”. Based on this value, we partition the
frame into 8x8 blocks and only blocks that have pixel activities (i.e. large pixel value
difference) are considered as active. Based on this, we can determine the magnitude of
activities in the regions.

The one purpose of describing the activity of a region is to have one or more region as area
within the frames that need more attention. This might be useful for the camera that encloses
quite large area, but only some parts of the frame should be monitored. For example, in
“traffic” sequence, we might need to consider only the road part of the frame.

Figure 68 and 69 show the magnitude of activities for “lab” sequence and “traffic” sequence,
respectively. For “lab” sequence, 3x3 regions is used, while in “traffic” sequence 4x4 regions
is used. From the experiment, we can say that for “lab” sequence, region no 5 might need
special attention, because it encloses the alley part of the lab. For “traffic” sequence, special
attention might be paid to region no 9 to 12 because they enclose the road part.

MAF module defines all necessary data regarding the creation of MAF boxes. This module
contains MPEG file format library (here we use ISOLib library). This module shall receive
two types of input data: metadata and video data. Metadata will be put in „meta‟ box, while
video data will be put in „mdat‟ box. After these MAF definitions (boxes‟s variable‟s value)
are defined, the writing module writes the MAF file as the output of the authoring tool.

Automatically obtained
from file

Figure 66 — Static description metadata user interface

Figure 67 — Segmenting the video


1 2 3

4 5 6

7 8 9

Figure 68 — Grid layout and motion activity for ―lab‖ sequence

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Figure 69 — Grid layout and motion activity for ―traffic‖ sequence

4.4.1.2. Player

The video surveillance AF player basically does the reverse of the authoring tool. It parses the
boxes inside the MAF to extract the metadata and the video stored within it, and play the
video based on the information described in the metadata. As shown in Figure 70, the player
contains of the following modules: MAF parser, XML parser, video renderer and display unit.

MAF parser first read the MAF file to determine whether valid MAF file is loaded to the
player. Next, it parses the whole boxes inside the MAF, and extracts the data within the boxes.
The video data itself, stored in „mdat‟ box of MAF file, is also extracted from the MAF using
MAF parser module.
Based on data parsed from MAF parser module, the XML parser will parse the description of
the collection-type and item-type metadata, stored the value in the memory which will be use
to define how the video data will be rendered/played. This module uses MSXML library to
parse the metadata which works through the nodes of the XML file and obtain the value.

Video surveillance AF player


File format parser

Read file type

Read movie

XML parser
Read
metadata

Video decoder
VSAF File Read track

Read file type Display unit

Figure 70 — Video surveillance AF player system architecture

The video renderer will render the video based on the information obtained from the XML
parser (depicted by dotted orange line in the Figure). The information on how the video will
be rendered includes video segmentation and visual descriptors. Along with the video data,
the renderer is also responsible to render any descriptors such as grid rendering or object
markers rendering. As in the authoring tool, we use DirectShow library in this module to
perform this job. Finally, the display unit will display the rendered video (depicted by blue
line in the Figure) and any static and visual description (green line) in the user interface.

Figure 71 — Video surveillance AF creator and player user interface


The player part can parse the MAF file, extract the metadata and video part, and play it. Based
on the segment description described in the metadata, user can jump to the position described
in it. User can also see the description of camera information and the intensity of motion in
the regions described by in the grid layout descriptors. The user interface is as shown in
Figure 71.

4.4.2. Reference software

The reference software for MPEG-A Part 10 standardization development is being made as a joint
work with Kingston University, UK. ICU is responsible for the conformance points in the
reference software. This sub section is as described in the MPEG contribution documents and
output documents of the reference software.

The Video surveillance application format reference software is normative in the sense that it
correctly implements the normative clauses contained in ISO/IEC 23000-10. Conforming ISO/IEC
23000-10 implementations are not expected to follow the algorithms or the programming
techniques used by the Video surveillance application format reference software. Although the
packing software is considered normative, it is not expected to add anything normative to the
Video surveillance application format textual clauses included in ISO/IEC 23000-10. The
reference software consists of authoring tool (Video Surveillance Packer, VSP) and player (Video
Surveillance Viewer, VSV).

4.4.2.1. Authoring tool

The authoring tool is a simple pakager. It packages XML data and AVC video into a VS AF
file format using modified ISOLib library. The modification is performed to add CAMI and
VSMI box as described in the specification.

The initialization settings file (an .INI file) is used to define the UUIDs of AVC video and the
role of the XML metadata to be packaged. As shown in Figure 72, the content handler will
read the initialization file and accordingly, load the XML metadata and AVC video. The file
format generator will package the contents and produce a VS AF file that conforms to the VS
AF specification.

Video surveillance AF packager

File format
generator

XML
metadata
Content
handler

ISOLib

AVC video VSAF File

Initialization
settings

Figure 72 — VSP system architecture


4.4.2.2. Player

The player is a simple application that unpacks the contents inside the VS AF file. It also
contains the library for MPEG-7 metadata parser, the MP7JRS library and JM decoder for
decoding AVC bitstream. It works similar to the player in the implementation application of
surveillance AF, however, as shown in Figure 73, additional conformant check modules for
both MPEG-7 metadata and AVC video are implemented to ensure the conformant of the
metadata and video data inside the VS AF file. As shown in Figure 74 and Figure 75, the user
interface of the reference software displays the information regarding the conformant points
of the VS AF file unpacked by the player.

Video surveillance AF player


File format parser

Read file type

Read movie

XML parser
Conformant
Read check
metadata

Video decoder

Conformant
Read track

check
Read file type Display unit

Figure 73 — VSV system architecture

Figure 74 — VSV main screen showing contents and meta-data


Figure 75 — VSV track player screen showing contents and meta-data

5. Achievements

In this section we present the achievements have been made for each standard development. In each MAF, there
will be a list of MPEG input contributions, MPEG output documents, and papers.

5.1. MPEG-A Part 3 2nd Edition: Protected Music Player Application Format

5.1.1. MPEG input contributions


a. M12197, Hendry, Munchurl Kim, Protecting and Governing Music MAF Player Format
based Contents by using MPEG-21 IPMP, Poznan, Poland, July 2005.
This input document is the initial proposal to the protection of Music player AF using MPEG-
21 IPMP.
b. M12588, Hendry, Munchurl Kim, Florian Pestoni, A Flexible and Extensible Protection of
Music Player MAF using Ligthweight MPEG-21 IPMP, Nice, France, October 2005.
This input document is the updated version of previous input document by specifying the
profiled MPEG-21 IPMP named as “Lightweight MPEG-21 IPMP”.
c. M12855, Hendry, Munchurl Kim, Takafumi Ueno, Shen ShengMei, ZhongYang Huang,
Florian Pestoni, Satoshi Ito, Jeho Nam, Protected Music Player MAF based on MPEG-21
IPMP and REL Profiles, Bangkok, Thailand, January 2006.
This input document is the updated version of previous input document by including the
rights expression information to the protection using REL profile.
d. M13223, Hendry, Munchurl Kim, Florian Pestoni, Zhongyang Huang, Updated Text for WD
of AMD1 Protected Music Player MAF — Section 2, Montreux, Switzerland, April 2006.
This input document is the updated version of the output document from previous meeting
regarding the encryption scheme for Protected Music player AF.
e. M13370, Zhongyang Huang, Shengmei Shen, Takafumi Ueno, Hendry, Munchurl Kim, Idea
to harmonize section 1 and section 2 of WD 1.0 AMD/1 Protected MPEG-A Music Player,
Montreux, Switzerland, April 2006.
This input document is the initial document of working draft of Protected Music player AF
based on the working draft of protection scheme described in output documents in previous
meeting. The result of this input document is the working draft of Protected Music player AF.
f. M13658, Houari Sabirin, Jeongyeon Lim, Hendry, Munchurl Kim, Contribution to Reference
Software of ISO/IEC 23000-2: MPEG Music Player Application Format, Klagenfurt, Austria,
July 2006.
This input document is the initial implementation of Music player AF. The work done in this
document is used as the basis in the development of reference software of Protected Music
player AF.
g. M14172, Hendry, Houari Sabirin, Munchurl Kim, Contribution for Protected Music Player
MAF Reference Software, Marrakech, Morocco, January 2007.
This input document is the initial implementation of Protected Music player AF reference
software.
h. M14176, Hendry, Houari Sabirin, Munchurl Kim, Editor‟s Study of ISO/IEC FCD 23000-2
MPEG-A, Music Player 2nd Edition, Marrakech, Morocco, January 2007.

5.1.2. MPEG output documents


a. N7874, Working Draft of AMD/1 Protected MPEG-A Music Player Section 1, Bangkok,
Thailand, January 2006.
This output document is the description of file structure specification for Protected Music
player AF.
b. N7875, Working Draft of AMD/1 Protected MPEG-A Music Player Section 2, Bangkok,
Thailand, January 2006.
This output document is the description of encryption scheme specification for Protected
Music player AF.
c. N8091, Working Draft of 2nd Edition of MPEG-A Music Player Section 2: Protected Music
Player, Montreux, Switzerland, April 2006.
This output document is the result of input document M13370 by combining output document
N7874 and N7875.
d. N8359, ISO/IEC CD 23000-2 MPEG-A Music Player 2nd edition, Klagenfurt, Austria, July
2006.
This output document is the advance of the working draft resulted in previous meeting.
e. N8582, ISO/IEC FCD 23000-2 MPEG-A Music Player 2nd edition, Hangzhou, China,
October, 2006.
This output document is the advance of the committee draft resulted in previous meeting.
f. N8583, Reference Software Workplan for MPEG-A Music Player 2nd edition, Hangzhou,
China, October, 2006.
This output document describes the workplan for developing the reference software for
Protected Music player AF, based on the contribution M13658.
g. N8820, Study of ISO/IEC FCD 23000-2 MPEG-A Music Player 2nd edition, Marrakech,
Morocco, January 2007.
This output document is the advance of final committee draft resulted in previous meeting.
h. N8821, Reference Software Workplan v2.0 for MPEG-A Music Player 2nd edition including
initial software, Marrakech, Morocco, January 2007.
This output document describes the workplan for developing the reference software for
Protected Music player AF based on the contribution M14172.
i. N9122, Text of ISO/IEC 23000-2 FDIS Music Player Application Format 2nd Edition, San
Jose, USA, April 2007.
This output document is the final draft of international standard of Protected Music player AF,
concludes the work of Protected Music player AF.
5.2. MPEG-A Part 4: Musical Slide Show Application Format

5.2.1. MPEG input contributions


a. M12396, Jeongyeon Lim, Munchurl Kim, Synchronization of Multiple JPEG data to MP3
tracks in Music MAF Player Format, Poznan, Poland, July 2005.
This input document is the initial proposal of implementing slide show in Music player AF.
b. M12589, Chansuk Yang, Jeongyeon Lim, Munchurl Kim, Extensions to Music MAF Player
Format for Multiple JPEG images and Text data with Synchronizations to MP3 data, Nice,
France, October 2005.
This input document is the updated proposal of implementing slide show to Music player AF
with addition of text synchronization.
c. M13673, Houari Sabirin, Jeongyeon Lim, Hendry, Munchurl Kim, Contribution to Reference
Software of ISO/IEC 23000-4: MPEG Musical Slideshow Application Format, Klagenfurt,
Austria, July 2006.
This input document is the initial implementation of Musical slide show AF. Most parts of the
contribution are used as the basis of the development of the Protected Musical slide show AF.
d. M13563, H. Jean Cha, Tae Hyeon Kim, Harald Fuchs, Munchurl Kim, Updated text for WD
1.0 Musical Slide Show MAF, Klagenfurt, Austria, July 2006.
This input document is the updated working draft of Musical slide show AF.
e. M14184, Hyouk-Jean Cha, Tae Hyeon Kim, Harald Fuchs, Munchurl Kim, Editor‟s study text
of ISO/IEC FCD 23000-4 Musical slide show MAF, Marrakech, Morocco, January 2007.
This input document is the updated final committee draft of the Musical slide show AF.
5.2.2. MPEG output documents
a. N8131, WD of ISO/IEC 23000-4 (Musical Slide Show MAF), Montreux, Switzerland, April
2006.
This output document is the first working draft of Musical slide show AF.
b. N8397, Text of ISO/IEC 23000-4/CD (Musical Slide Show MAF), Klagenfurt, Austria, July
2006.
This output document is the update of the working draft from the previous meeting. It
implements the proposal on synchronization method of timed text and JPEG images as one
slide show sample.
c. N8674, Text of ISO/IEC 23000-4/FCD (Musical Slide Show MAF), Hangzhou, China,
October, 2006.
This output document is the update of the committee draft from the previous document.
d. N8880, Study Text of ISO/IEC 23000-4/FCD (Musical Slide Show MAF), Marrakech,
Morocco, January 2007.
This output document is the update of the final committee draft from the previous document.
e. N9038, Text of ISO/IEC 23000-4/FDIS (Musical Slide Show MAF), San Jose, USA, April
2007.
This output document is the final draft of international standard of Musical slide show AF,
concludes the work of Musical slide show AF.
5.2.3. Papers
a. Muhammad Syah Houari Sabirin, 김문철, "Authoring Tool of Musical Slide Show MAF
Content", 2006 한국방송공학회 학술대회, 11 월 10 일, 서울산업대학교.
5.3. MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format

5.3.1. MPEG input contributions


a. M13722, Houari Sabirin, Hendry, Munchurl Kim, Proposal to Improve Musical Slideshow
File Format, Klagenfurt, Austria, July 2006.
This input document is an initial proposal for Protected Musical slide show AF. It contains
very basic information on the idea of protecting contents in Musical Slide show AF.
b. M14175, Hendry, Houari Sabirin, Munchurl Kim, Proposal for Protected Musical Slide Show
MAF with IPMP, Marrakech, Morocco, January 2007.
This input document is a proposal for adding content protection in Musical slide show AF
using MPEG-21 IPMP and MPEG-21 REL. It contains metadata instantiation examples of
content protection for Musical slide show AF.
c. M14477, Houari Sabirin, Hendry, Munchurl Kim, Updated Proposal for Protected Musical
Slide Show MAF with IPMP, San Jose, USA, April 2007.
This input document is updated text of previous proposal of Protected Musical slide show AF.
It provides more detailed descriptions and metadata instantiation examples for protected
Musical slide show AF.
d. M14644, Houari Sabirin, Munchurl Kim, Proposed text for Protected Musical Slide Show
MAF PDAM, Lausanne, Switzerland, July 2007.
This input document is a proposed text for amendment of Musical slide show AF. It contains
updated description of technical specifications and metadata schemes for protecting contents
of Musical slide show AF.
e. M15124, Houari Sabirin, Munchurl Kim, Use cases for content protection in Musical slide
show Application Format 2nd Edition, Antalya, Turkey, January 2008.
This input document described some use cases for content protection in Protected Musical
slide show AF. The proposed use cases have been incorporated in the Annex of FDIS
ISO/IEC 23000-4.
f. M15417, Houari Sabirin, Munchurl Kim, ISO/IEC 23000-4 2nd Edition Reference Software,
Archamps, France, April 2008.
This input document described the specifications of Protected Musical slide show AF
reference software. It contains the descriptions of the authoring tool and the player. It also
provides the description to the conformant files created by the authoring tool. The proposed
reference software has been incorporated into the Amendment 2 of FDIS ISO/IEC 23000-4
and is now in PDAM status.
5.3.2. MPEG output documents
a. N9040, WD1.0 of ISO/IEC 23000-4/Amd.2 Protected Musical Slide Show, San Jose, USA,
April 2007.
This output document is the first working draft of Protected Musical slide show AF based on
input documents M13722, M14175 and M14477.
b. N9290, Text of ISO/IEC 23000-4/CD Musical Slide Show 2nd Edition, Lausanne,
Switzerland, July 2007.
This output document is the advance of the working draft resulted in previous meeting. It
includes the updated input document M14644.
c. N9389, Text of ISO/IEC 23000-4/FCD Musical Slide Show 2nd Edition, Shenzhen, China,
October 2007.
This output document is the advance of the committee draft resulted in previous meeting.
d. N9691, Study Text of ISO/IEC FCD 23000-4 Musical Slide Show 2nd Edition, Antalya,
Turkey, January 2008.
This output document is the advance of the final committee draft resulted in previous meeting.
e. N9843, Text of ISO/IEC FDIS 23000-4 Musical Slide Show 2nd Edition, Archamps, France,
April 2008.
This output document is the final draft of international standard of Protected Musical slide
show AF, concludes the work of Protected Musical slide show standardization. Included in
the document is the contribution M15124.
f. N9847, ISO/IEC 23000-4:2008/PDAM2 Protected MSS Conf. & Ref. Software, Archamps,
France, April 2008.
This document is the committee draft of amendment no.2 of Protected Musical slide show
(the 1st amendment is the reference software to Musical slide show).
5.3.3. Papers
a. Muhammad Syah Houari Sabirin, Hendry, Munchurl Kim, "Musical Slide Show MAF with
Protection and Governance using MPEG-21 IPMP Components and REL," 19th IS&T/SPIE
Symposium on Electronic Imaging: Multimedia on Mobile Devices 2007, January 2007, San
Jose, California, USA.

5.4. MPEG-A Part 10: Video Surveillance Application Format

5.4.1. MPEG input contributions


a. M14173, Jeongyeon Lim, Houari Sabirin, Munchurl Kim, Proposal for Surveillance MAF,
Marrakech, Morocco, January 2007.
This input document is the initial proposal for VS AF. It includes the proposal for the file
format and the MPEG-7 MDS and Visual to be implemented in VS AF. The document also
includes some use cases.
b. M14486, Houari Sabirin, Jeongyeon Lim, Munchurl Kim, A Proposal for Basic Video
Surveillance Application Format, San Jose, USA, April 2007.
This input document is the update of the previous proposal for VS AF, contains specific
description for file format and MPEG-7 metadata based on the requirements for basic version
of VS AF.
c. M14645, Houari Sabirin, Munchurl Kim, MPEG-7 core description profile and visual
descriptors for Video Surveillance MAF, Lausanne, Switzerland, July 2007.
This input document proposes the use of MPEG-7 core description profile as the starting point
to create MPEG-7 profile for VS AF, based on the requirements of VS AF as described in
MAF Overview document from the previous meeting.
d. M14893, Houari Sabirin, James Annesley, Munchurl Kim, and James Orwell, Visual
Surveillance Multimedia Application Format: MPEG-7 Profile, Shenzhen, China, October
2007.
This input document proposes the MPEG-7 profile for VS AF. Most of the descriptions are
based on the input document M14645.
e. M15426, Houari Sabirin, Munchurl Kim, Proposal for the usage of MPEG-7 and MPEG-21 in
Advanced Video Surveillance AF, Archamps, France, April 2008.
This input document proposes the use of more elements in MPEG-7 and MPEG-21 for the
next stage of development of VS AF.
f. M15472, James Annesley, Houari Sabirin, Update of ISO/IEC 23000-10/Amd1 WD1.0
Conformance and Reference Software, Archamps, France, April 2008.07.23
This input document is an update to the working draft of the VS AF conformance and
reference software produced in previous meeting.
5.4.2. MPEG output documents
a. N9295, Text of ISO/IEC 23000-10/CD (Video Surveillance MAF), Lausanne, Switzerland,
July 2007.
This output document includes the proposal described in input document M14645.
b. N9412, Study Text of ISO/IEC 23000-10/FCD (Video Surveillance Application Format),
Shenzhen, China, October 2007.
This output document is the updated result of VS AF committee draft. It includes the proposal
described in input document M14893.
c. N9706, Text of ISO/IEC FCD 23000-10 (Video Surveillance Application Format), Antalya,
Turkey, January 2008.
This output document is the advance of the final committee draft resulted in previous meeting.
d. N9707, Text of ISO/IEC 23000-10/AMD1 WD1.0 Conformance and Reference Software,
Antalya, Turkey, January 2008.
This output document is the initial work of developing conformance files and reference
software for VS AF. It includes the first version of the reference software.
e. N9856, Study Text of ISO/IEC FCD 23000-10 (Video Surveillance Application Format),
Archamps, France, April 2008.
This output document is the advance of the final committee draft resulted in previous meeting.
f. N9857, Text of ISO/IEC 23000-10/AMD1 WD2.0 Conformance and Reference Software,
Archamps, France, April 2008.
This output document is the updated version of the working draft from previous meeting. It
includes the second version of the reference software.
g. N9858, Future Work on Surveillance AF's - collection of requirements, Archamps, France,
April 2008.
This output document is the list of work to be done for the next stage of development of VS
AF. It includes the proposal described in input document M15426.
5.4.3. Papers
a. Wonsang You, M.S. Houari Sabirin, and Munchurl Kim, "Moving Object Tracking in
H.264/AVC bitstream," MCAM 2007, LNCS 4577, pp.483-492.
b. M.Syah Houari Sabirin, Munchurl Kim "Computation of MPEG-7 Motion Descriptors in
AVC|H.264 Bitstreams for Video Surveillance MAF", 2007 한국방송공학회 학술대회,
p117~118, 11 월 3 일, 고려대학교 이공계캠퍼스 창의관.

6. Conclusions

This document describes the work on the project of standardization of several parts of MPEG-A Multimedia
Application Format. MPEG-A is a standard from Moving Picture Expert Groups (MPEG) that specifies the
storage format by combining existing technologies to create rich-content multimedia application.

The work spans over 2 years and has resulted three final drafts of international standard: MPEG-A Part 3 2nd
Edition: Protected Music Player Application Format, MPEG-A Part 4: Musical Slide Show Application Format,
and MPEG-A Part 4 2nd Edition: Protected Musical Slide Show Application Format. One part, MPEG-A Part
10: Video Surveillance Application Format is still in final committee draft status and the reference software is
still in development (which will be completed as soon as the specification reaches the final draft status).

Most of the contributions proposed in the development of each MAF had been promoted to the output
documents, such as the implementation of protection in Music player application format and Musical slide show
application format, the synchronization of JPEG images for the slide show, and the metadata profile for Video
surveillance application format. It also has resulted reference software for Protected Music player application
format, Protected Musical slide show application format (which also includes the non-protected part) and some
part in Video surveillance application format.
7. References

1. ISO/IEC JTC1/SC29/WG11 MPEG2005/N7068, Busan, Korea, April 2005, White Paper on MPEG-A
2. ISO/IEC JTC1/SC29/WG11 MPEG2008/N9840, Archamps, France, April 2008, MAF Overview
3. ISO/IEC 14496-12:2005, Information technology – Coding of audio-visual objects – Part 12: ISO base media
file format
4. ISO/IEC 14496-14:2003, Information technology – Coding of audio-visual objects – Part 14: MP4 file format
5. ISO/IEC 14496-20:2006, Information technology – Coding of audio-visual objects – Part 20: Lightweight
Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
6. ISO/IEC 15938-1, Information technology – Multimedia content description interface – Part 1: System
7. ISO/IEC 15938-3, Information technology – Multimedia content description interface – Part 3: Visual
8. ISO/IEC 15938-5, Information technology – Multimedia content description interface – Part 5: Multimedia
description schemes
9. ISO/IEC 21000-2, Information technology – Multimedia framework (MPEG-21) – Part 2: Digital Item
Declaration
10. ISO/IEC 21000-4, Information technology – Multimedia framework (MPEG-21) – Part 4: Intellectual Property
Management and Protection Components
11. ISO/IEC 21000-5, Information technology – Multimedia framework (MPEG-21) – Part 5: Rights Expression
Language
12. ISO/IEC 21000-17, Information technology – Multimedia framework (MPEG-21) – Part 17: Fragment
identification of MPEG resources
13. 3GPP TS 26.245, Transparent end-to-end Packet switched Streaming Service (PSS); Timed text format, V7.0.0,
2007-06-21
14. ISO/IEC JTC1/SC29/WG11 MPEG2006/M13658, Houari Sabirin, Jeongyeon Lim, Hendry, Munchurl Kim,
Contribution to Reference Software of ISO/IEC 23000-2: MPEG Music Player Application Format, Klagenfurt,
Austria, July 2006
15. ISO/IEC JTC1/SC29/WG11 MPEG2007/M14172, Hendry, Houari Sabirin, Munchurl Kim, Contribution for
Protected Music Player MAF Reference Software, Marrakech, Morocco, January 2007
16. ISO/IEC JTC1/SC29/WG11 MPEG2006/M13673, Houari Sabirin, Jeongyeon Lim, Hendry, Munchurl Kim,
Contribution to Reference Software of ISO/IEC 23000-4: MPEG Musical Slideshow Application Format,
Klagenfurt, Austria, July 2006.
17. ISO/IEC JTC1/SC29/WG11 MPEG2008/M15417, Houari Sabirin, Munchurl Kim, ISO/IEC 23000-4 2nd
Edition Reference Software, Archamps, France, April 2008.
18. ISO/IEC JTC1/SC29/WG11 MPEG2008/N9856, Study Text of ISO/IEC FCD 23000-10 (Video Surveillance
Application Format), Archamps, France, April 2008.
19. ISO/IEC JTC1/SC29/WG11 MPEG2008/N9857, Text of ISO/IEC 23000-10/AMD1 WD2.0 Conformance and
Reference Software, Archamps, France, April 2008.

You might also like