You are on page 1of 15

Proceeding of the SMPTE 2008 Annual Technical Conference, Oct 28-30, 2008.

H.264 Parameter Optimizations for Internet Based Distribution of High Quality Video
Kourosh Soroushian1, Shaiwal Priyadarshi1 and John Villasenor2
The digital video revolution is evolving from a physical-media distribution model to electronic-media distribution models that utilize Content Delivery Networks (CDNs) and Consumer Grade Networks (CGNs such as residential Internet and inhome networks) for delivery of content to devices. The utilization of the Advanced Video Coding (AVC/H.264) standard is prevalent in todays optical and broadcast industries, but the adoption of this standard at bit-rates suitable for CDN/CGN distribution has not yet materialized in a unified and open specification for resolutions including full-HD (1080p) video. In this paper we present a set of empirical and scientific measurements (PSNR and SSIM) which have been collected through over 6,500 H.264 encodings of a set of content samples in order to determine the optimal compression settings for delivering a high-quality viewing experience across CDNs/CGNs. Based on this research, a specific set of operating points have been devised in order to maximize compatibility across both personal computer (PC) and consumer electronics (CE) platforms, resulting in high quality video at data rates that are encoded at up to 40% lower rates than those of the H.264 Level 4 data rates, while still maintaining a good visual quality level. To achieve the higher-compression ratios necessary for network delivery, compression systems may need to reduce the quality of the user experience of certain operations. In this paper we also examine the effect of the CDN/CGN compression settings on visual-search, and offer a method that actually increases the user-experience beyond traditional visual-search on optical-media. The proposed solution offers smooth visual-search capability in both the forward and reverse directions, operating at speeds from 2x to 200x, implementable on both PCs and CE devices that access content from optical disks or electronic sources across CDNs/CGNs. When combined, both of these features are expected to provide a high quality userexperience for content targeted at delivery over many types of networks.

1. DivX, Inc. email: (ksoroushian, spriyadarshi)@divxcorp.com 2. Electrical Engineering Department, UCLA e-mail: jvilla@attglobal.net

1 Introduction
Since its adoption as an ISO/IEC standard, the H.264 specification has gained tremendous momentum in the digital media industry. The applications for this video specification range from optical media (such as the Blu-ray Disc/BD specification) to digital broadcast (DVB, ATSC) to videos watched and distributed via IP networks such as the World Wide Web (Adobe Flash now supports H.264). The popularity of H.264 to a large part is due to its considerable bit-rate savings over legacy standards such as MPEG-2 and MPEG-4, which, while varying over a wide range, has been estimated to average 63% and 37%, respectively [1]. This significant bit-rate saving has helped to enable new markets which previously were not economically feasible: digital broadcast of high-definition (HD) content via satellite is now more viable, and distributing video over IP networks can be enabled at bandwidths that are available over consumer-grade networks such as DSL and Cable Modems (in the context of this paper, HD is taken to mean 1080p). Furthermore, the adoption of H.264 as a standard has ensured interoperability across a wide range of product applications and consumer electronic devices. Similar to legacy video standards, H.264 has selected certain operating points via the profile and level designations. Profiles are put into place to ensure compatibility across the different spectrum of encoding tools that a standard may offer, whereas levels are used in limiting operational parameters such as bit-rate, resolution, and Macroblock rate. Among the most popular profiles for H.264 is the High Profile, which extends the H.264 Main Profile with tools that are part of the MPEG-2 specification. The High Profile has been selected by a number of broadcast and optical-media forums. DivX, Inc. (in the context of this paper, we will refer to DivX, Inc. as DivX, and will use trademark identifiers where appropriate to differentiate the technology from the company) intends to support the H.264 format, and is planning to use High Profile Level 4.0 as the basis of its primary combination of profile and level for its future high-definition encoders, decoders and content services.1 We recommend some restrictions within the confines of High Profile Level 4.0 parameter set. These restrictions aim to characterize a good balance amongst the following factors: Visual quality Average bit-rate Consumer Electronics (CE) device performance Based on specific use-case scenarios of digital video encode, decode, and service product categories, certain other factors have been considered in deciding the profile restrictions. These additional factors include: Support for H.264 in CE ASIC decoders Content Distribution Network (CDN) bandwidths In-home wireless network bandwidths Encoder and decoder performance on Personal Computers The overall operating parameters suggested in this paper will be rolled out in a program known as DivX PlusTM, and will be backed by a DivX certification program which can help ensure compatibility between different decoding devices.

1.1

Summary of Internet Profile Constraints for Decoders

We placed additional restrictions beyond High Profile Level 4.0 on our selected profile/level combination based on the capabilities of the entire set of addressable devices. These restrictions have been verified via a series of objective encoding tests, and the combination of all restrictions verified via a set of single stimulus quality evaluation (SSQE) subjective tests. Results for both sets of tests will be presented in the following sections. The additional restrictions on H.264 High Profile Level 4.0 that form the DivX H.264 Decoder Profile include: A limit on the maximum consecutive B-pictures (3) A maximum limit on an IDR interval (4 seconds) Flexible picture resolution assignment for progressive frames: o o A minimum and maximum picture width range (320 to 1920 columns) A minimum and maximum picture height range (240 to 1080 lines)

Many of the features for optimization of video for internet-based distribution include patent-pending inventions from DivX.

A set of predetermined horizontal and vertical resolution combinations for interlaced coding A set of predetermined video frame-rates (between 24 Hz to 60 Hz) o For frame-rates greater than 30 Hz, the maximum picture resolution is 1280 pixels by 720 lines

Disallowing of mixed slice types, unpaired fields, and mid-stream change of entropy coding modes

1.2

Summary of Internet Profile Constraints for Encoders

In addition to this Decoder Profile selection, we have also conducted an extensive program of encoder parameter selection targeted at delivering high-quality high-definition video files. This parameter selection process has led to the following recommendations for encoding: Recommended maximum reference frame count of 4 frames Recommended default average bit-rates for different operating points: Width 1280 1920 1920 1280 1920 Height 720 816 1080 720 1080 Frame-rate 23.976 23.976 23.976 29.97 29.97 Average Bit-rate 2.2 Mbps 4.65 Mbps 7.2 Mbps 2.73 Mbps 8.96 Mbps

2 Profile Definition
As mentioned previously, four distinct factors have been used in determining the starting point for the profile selections presented in this paper. Each factor plays an important, and at least partially independent, role in helping narrow down the profile selection: the speed of consumer-grade networks is important in acquiring the content, and the in-home wireless distribution bandwidth helps determine how effectively the content can be made available at different stations in a home. Successful adoption by the consumer-electronics industry is an important factor in ensuring broad adoption for a profile. And finally, for internet-based content distribution one can assume that a minimum requirement is that the data can be encoded and played back on a Personal Computer whether a laptop or desktop solution. In this section of the document we present our research behind each one of these categories and how it helped to shape our decisions around selecting the final H.264 profile.

2.1

H.264 Compatibility of CE ASICs

In creating the new profile, we studied two different spheres of operation: the existing specifications with relation to H.264, and the capability of the latest CE decoding processors and systems. The existing specifications came from four distinct organizational sources: Blu-ray H.264 encoding parameters, DVB broadcast standard, CableLabs Packet Cable 2.0 specification, and a survey of open-source high definition PC encoders and their default behavior. An examination of the existing H.264 profiles showed a uniform selection of High Profile as the common operating point for high definition videos, with a varying degree of commonality of the Level selection from 4.0 all the way to 5.1.

Organizational Source Blu-ray DVB Packet Cable 2.0 x264 OSS encoder Table 1. Operating points of various encoding sources

H.264 Profile / Level High Profile, Level 4.1 (up to 40 Mbps [6]) High Profile, Level 4.0 High Profile, Level 4.0 High Profile, Up to Level 5.1

A survey of publicly available information on CE ASIC devices capable of decoding H.264 shows a mixed number of devices capable of decoding H.264 Baseline, Main, and High profiles at various levels. None of the devices surveyed, however, claimed compliance with H.264 High Profile Level 5.1. In an effort to create interoperability between the H.264 CE decoding and the PC encoding market segments, we chose the High Profile Level 4.0 as the operational basis for our high definition H.264 decoding profiles.

2.2

Constraints Imposed by Consumer Grade Networks

The selection of an appropriate profile for distribution of H.264 content over the internet also involves the determination of an adequate operating point which serves the mainstream internet-distribution marketplace. The operating point serves as the basis for conducting subjective and objective quality measurements and determining the level of adjustment required for a satisfactory quality level. Looking at consumer-grade networks, let us first establish the following formula which correlates the time required for downloading a multimedia stream to its video and audio bit-rates:

+ Raudio ) (1 + Coverhead ) (R Tacquisition = Tcontent video BW


Tacquisition Tcontent Rvideo , Raudio Coverhead BW total acquisition time in seconds total run time of the content in seconds, bits/sec for audio and video stream container overhead percentage bits/sec Bandwidth rate

Equation 1. Content acquisition time

Here, the container overhead Coverhead is represented by a percentage value that is typically below 5% and becomes a multiply factor for the audio and video rates. As expected, Equation 1 shows that the acquisition time is directly proportional to the bit-rate and inversely proportional to the available bandwidth to the home. Maintaining an acceptable acquisition time is important in download-to-own business models, and essential in near real-time delivery systems. In order to complete the equation we need to compute two additional factors: content time and bandwidth to the home. For the first variable, we used two different sources in our estimates: one from the United States box office records and the second from a list of some 460 Blu-ray titles [3] for which there is public information available (not all of this information has been independently verified). Here, we use a list of Hollywood movies grossing more than $100 million, and the top 1000 grossing Hollywood movies, which have an average of 115.6 and 119.6 minutes, respectively. Similarly we have an average of 114 minutes for a large number of released Blu-ray titles [3]. Other studies are also available [2] which have measured a more distributed use-case scenario by sampling actual file sizes and durations of over 10000 random files via a network-based Personal Video Recorder. For television content, they report four different groupings: most files have been measured to be around 30 minutes (animation series, sitcoms, etc.), along with shorter files which were typically news events; another peak was observed around 45-60 minutes which represents the length of television dramas and periodical shows; and finally there was a fourth grouping of content between 90 to 120 minutes which are typical dramas. Using the traditional motion picture market as the focus of our estimates, we will use a 116 minute metric as the default used in our calculations. Estimating the average bandwidth to the home is a more difficult challenge, given the broad diversity of delivery mechanisms and the available options in each of the worlds geographical regions. Here, we will be relying on several sources to perform this estimation. The OECD [4] provides a list of 30 countries and their broadband penetration as of June 2007. While this report does not provide a study of bandwidth measurements to the home, it does provide the fastest advertised broadband download speeds offered by all technologies in the countries covered by the report. Using the top 15 of these countries (Japan, Sweden, Korea, France, Finland, U.S.A., Portugal, New Zealand, Italy, Denmark, Australia, Belgium, Norway, Germany, and Canada) we next turn to SpeedTest [5] which provides a bandwidth measurement service and uses a GeoIP database to determine a users location. The SpeedTest bandwidth measurements (Figure 1) are generally not representative of what is available to the home as their numbers include data from all internet sources including corporations, government offices, and universities. But the numbers do offer a glimpse of the average internet bandwidth available in each country. In order to get a more realistic view of bandwidth to the home, we used video streaming services such as YouTube, Joost, and PPLive as the basis for determining our real-time bandwidth delivery figures: given the requirements of real-time or near real-time video delivery to the home, it is reasonable to assume the metrics used by these services as one possible source of estimating the bandwidth to the average broadband home (Figure 2). For this purpose, we used published information from the literature [2] in addition to utilizing our own samplings of video bit-rates from different streaming sources. The summary of these findings is shown in Figure 2, which includes recent samplings from YouTube High Quality videos (at an average of 589 kbps in H.264 format). The overall average bandwidth for real-time, internet-based delivery systems can be measured to equal an average of 465 kbps.

16000 14000 12000 10000 8000 6000 4000 2000 0


Ja p Sw an ed e Ko n r Fr ea an Fi ce nl an d U SA P N or ew tu Ze gal al an d D Ita en ly m Au ark st ra Be lia lg iu N m or G wa er m y a C ny an Sw i t ada N zer et la n h Lu erla d xe nd m s bo ur g

Bandwidth (KBPS)

Country

Figure 1. Average bandwidth reported by country on SpeedTest.net

700 600 Bandwidth (KBPS) 500 400 300 200 100 0 Joost Movie Joost (Japan) Zatto (short) Zatto show PPLive PPLive (short) Service Name PPLive SopCast YouTube YouTube (Japan) HQ

Average

Figure 2. Average bandwidth measured for different real-time internet-video providers Given the average video bandwidth of 22.85 Mbps [3] and a maximum bandwidth of 40 Mbps [6], the Blu-ray encoding profile which falls under Level 4.1 yields internet acquisition times that are probably not realistic for the average user. This was the first confirmation that Level 4.0 is the correct operating point for our profile; furthermore, we also have automated recommendations for default encoding bit-rates based on a maximum of 12 Mbps average bit-rate which is set within the H.264 Level 4.0 (see later sections). Using the bit-rate figures listed previously, utilizing a 192 kbps audio rate and a video bit-rate of 4650 kbps for a 1080P movie of 1920x816 resolution at 23.976 frames per second (from Table 1)with a 3 percent container overhead, the acquisition time for a movie of 116 minutes at 465 kbps will be approximately 21 hours (see Equation 1). Comparing this figure with the acquisition time of the same title encoded following the Blu-ray specification (average total bit-rate of 30.33 Mbps [3] results in a 5+ day acquisition time), we achieve a 6 fold increase in efficiency while offering the viewer good visual quality at high-definition resolutions (see Subjective Testing Methodology and Results). This comparison points to the fact that operating within the confines of H.264 Level 4.1 is not in the best interest of enabling an internet-distribution model for high definition content via consumer grade bandwidths.

2.3

Constraints Imposed by Wireless Distribution inside the Home

Wireless distribution via the IEEE 802.11 has gained popularity, especially with the introduction of the 802.11g and the subsequent 802.11n standards. Among these products, 802.11n is the latest standard to take shape and has a relatively higher price-point compared to its predecessors. Likewise, 802.11a is typically associated with corporate environments due to its relatively higher cost-basis at the time of its introduction, which was partially due to the new parts required to support the 5MHz carrier band versus 802.11b and 802.11g which both utilize a 2.4MHz carrier band. 802.11n introduces new technologies such as Multiple-Input/Multiple-Output (MIMO), channel bonding, and improved MAC efficiency [8].

Both 802.11a and 802.11g have a maximum data rate of 54 Mbps, whereas the 802.11b supports a maximum data rate of 11 Mbps. Due to its backwards compatibility with the 802.11b standard, 802.11g routers have gained popularity in homenetworking environments (see Table 2). As a rule of thumb, however, the effective sustainable throughput for any of the above standards is 50 60% of the wireless rate and lower, due to the media-access and protection mechanisms ([9],[10]). Wireless Router Brand
Linksys WRT54G2 Wireless-G Broadband Router Linksys-Cisco WRT54GL Wireless-G Broadband Router D-Link DIR-655 Extreme N Wireless Router Linksys WRT54G Wireless-G Router Linksys WRT54GS Wireless-G Broadband Router with SpeedBooster Linksys WRT160N Ultra RangePlus Wireless-N Broadband Router Linksys WRT310N Wireless-N Gigabit Router TRENDnet 54 Mbps Wireless G Broadband Router ( TEW-432BRP Version D1.0R) Netgear WGR614 Wireless-G Router Linksys WRT150N Wireless N Home Router with 4-Port Switch Mimo Average Price

802.11g Price
$45.99 $49.99

802.11n Price

$94.99 $44.99 $54.85 $69.99 $79.99 $24.99 $39.99 $89.99 $43.46 $83.74

Table 2. A sampling of recent wireless router prices from Amazon.com ([7]) Given the relatively recent introduction of the 802.11n standard and the price-points associated with this product as compared to 802.11g (almost twice, as shown in Table 2)), chose to focus primarily on the 802.11b and 802.11g standards where we expect these devices to dominate the home-networking environment. Likewise, an 802.11n network theoretically can support bandwidths that are well beyond 20 Mbps and hence would not require any further investigation into its viability for distributing H.264 level 4.0. The maximum throughput for 802.11b using the Transmission Control Protocol (TCP) is estimated at 5.5~6.0 Mbps ([9],[10]), and likewise the maximum theoretical throughput for TCP under both 802.11a and 802.11g is at 27.3 Mbps [9]. Actual throughput measurements, however, are typically far below the theoretical throughput rates ([11]). Additionally, because of the backward compatibility mode of 802.11g with the slower 802.11b standard, the theoretical rates for 802.11g vary between 9 to 15 Mbps in a mixed 802.11b and 802.11g Access Points (AP).

Figure 3. Maximum throughput for various IEEE 802.11 standards under ideal conditions The 802.11 standard supports a multitude of data rates which enable clients to communicate at the best possible data rate while minimizing the total communication errors; systems perform a specific procedure to determine the best data rate for a communication session. There are a number of environmental variables which effect the selected data rates, including distance between Wireless LAN devices, building and home materials, and radio frequency interferences[12]. The throughput for 802.11g varies in the presence of a legacy 802.11b access point, since the 802.11g access points coordinate the use of the transmission medium with protection mechanisms. These protection mechanisms, while enabling backwards compatibility, decrease the maximum throughput of an 802.11g network for the transmission of TCP data down to potentially 9 Mbps as in the case of the RTS/CTS protection model [10].

There is a clear disjoint between the maximum throughput and actual measurements taken in real-world settings such as an office or home. In addition to the environmental factors listed above, a number of product-specific factors come into play, including the transmission power levels, channel interference, and antenna type and location. In two separate studies, we have results which reflect a UDP throughput of 5 Mbps and 7 Mbps with 802.11g and RTS/CTS protection turned on, at distances of 18 feet and 22 feet, respectively ([12], [13]). In the latter study, both UDP and TCP measurements were similar at the distances measured and hence we will logically deduce the same to hold true in the first study. Without the RTS/CTS protection enabled (as would be in a 802.11g only environment), measured throughputs were at 13.5 Mbps [12]. Throughput measurements conducted for 802.11b for UDP packets in close proximity have been reported to be as high as 6.36 Mbps[14], while the same measurements for TCP data in a 2-story U.S. home (2,500 ft2, wood construction) with 6 testing nodes measured at 2.41 Mbps average [15] between the different nodes at the 11 Mbps 802.11b transmission rate. Based on all of the theoretical and experimental throughput measurements of various IEEE 802.11 standards for TCP data packets, we have chosen to use the following three levels: 2.41 Mbps for 802.11b, 5 Mbps for a mixed 802.11b/g environment with 802.11g access points, and 13.5 Mbps for the maximum of an isolated 802.11g access point. In accordance with these operating points, we performed a series of visual experiments to determine the maximum acceptable bit-rate upon which we can build our recommended encoding profile. After much experimentation, we have decided to choose 12 Mbps as the maximum average encoding bit-rate upon which all of our other encoding bit-rates are calculated. Based on this maximum bit-rate, we have devised recommended default bit-rates according to formulas which take into account the image resolution, frame-rate, and the specific operating point. Using these formulas, we have recommended a set of default encoding bit-rates in correspondence with the above operating points: The DivX default 720p (1280x720 @ 23.976 fps) at 2.2 Mbps for 802.11b AP The DivX default 1080p (1920x816 @ 23.976 fps) at 4.65 Mbps for 802.11g AP (mixed b/g environment) These bit-rates will allow a multiplex with audio and the container overhead to be streamed within the average bit-rate confines of the 802.11 throughput measurements. Note that DivX streams will still utilize the maximum bit-rate and CPB buffer size as specified by H.264 Level 4.0. However, with proper buffering and considering the low average bit-rates, the high thresholds of Level 4.0 are expected to achieve smooth playback when the recommended profiles have been observed.

Figure 4. The DivX operating points corresponding to 802.11 throughput measurements

2.4

Performance Measurements on Personal Computers

In the process of defining the DivX PlusTM Profile, we performed a set of decoder and encoder cycle-performance measurements to refine the allowed parameters within the H.264 Level 4.0 High Profile. The results of our encoding performance measurements have been shown in the next section, where some of these results were used to fine-tune our recommended encoder profile. Likewise, we have used performance measurements on a personal computer with the DivX H.264 decoder software in order to confirm that the recommended encoding parameters will be decoded faster in relation with the comparable stream. For this study, we used 4 streams two encoded in compliance with the Blu-ray specification and the other two in compliance with the DVB specification. The decoding performance measurements were made on a Quad-core Intel Pentium 4 CPU running at 2.8 GHz. The results of the performance comparison are shown next:

60
Frames Per Second Decoded

50 40 30 20 10 0 BD1 BD2 DVB1 DVB2


Stream Type

Original Profile DivX Plus Profile

Figure 5. Comparison of decoding performance (frames per second) of Blu-ray and DVB streams vs. DivX PlusTM Based on these results, the recommended default encoding profile is anywhere from 30% to 138% faster to decode than the comparable stream encoded at the Blu-ray or DVB profiles.

3 Objective Testing Methodology and Results


The DivX PlusTM objective tests were performed in order to further fine-tune the recommended encoder settings for the DivX H.264 High Profile Level 4.0 encoder. In performing our tests we used a set of 10 video source sequences that were selected based on filming method, aspect-ratio, frame rate and content, as outlined in the following table. The 10 video sequences were obtained from 8 different clips, two of which were used at two different resolutions. The following table gives information for each sequence: all clips are over 5 minutes in length and contain at least 7,200 frames of video; all frames of all clips were encoded during the tests.
Original Aspect Ratio
16:9 2.35:1 2.35:1 16:9 2.35:1 2.35:1 4:3 16:9

Clip #
1 2 3 4 5 6 7 8

Source
1080p 1080p 1080p 1080p DVB DVB DVD MOV File

Filming Method
CGI Film Film Green screen Green screen Green screen VHS DV

Frame Rate
23.976 23.976 23.976 23.976 25.000 25.000 29.970 29.970

Types of Motion
Static shots; many scene cuts; low motion Slow sweeps; frequent scene cuts; wide-angle Moderate sweeps and number of scene-cuts; close-ups Zoom in; slow sweeps; high motion Zoom out; slow sweeps; low motion Zoom out; slow sweeps; medium motion Fast zooms and sweeps Fast sweeps; medium motion; high detail

Resolution(s) Encoded
896x480 1120x480 3a) 1920x816 3b) 1120x480 896x480 5a) 1920x816 5b) 1120x480 1120x480 720x480 1280x720

Table 3. Video clips used in our objective tests These types and lengths of video sequences are representative of typical encoding scenarios and are therefore well-suited for a generalized study in to the effects of different encoding parameters. If in analyzing the results of our encode sessions we find that the averaged effect of a certain variable has a large standard deviation, then this would indicate that the variables effect is dependent on the properties of the video being encoded; conversely, a small standard deviation implies that the video sequences properties have a small effect on the variables behavior. The encoder was operated in a fixed-quantization mode rather than a rate-control mode. The effect of the change of a variable was measured by performing two encodes: one at a low-quality quantizer setting (30) and the other at a high-quality setting (20). Fixed quantizer encoding was used over a rate-control method for the following reasons:

by essentially disabling rate-control, the implementation specific differences between encoders is eliminated from the results analysis, allowing the results to be applied generically to other encoders using a fixed quantizer allows a cleaner analysis of the effect of a change to a single variable, rather than the effect being compounded by any form of a compensatory response embedded within a rate-control algorithm although not a driving force, the fact that only single-pass encoding was required allowed many more results to be gathered in a fixed time In achieving the results of our experiments, we utilized a proprietary H.264 encoder as well as an open-source encoder. The encodings were performed on an 8-core Intel Xeon machine running at 2 GHz.

3.1

Analysis of GOP-Length

In these tests, GOP Length (the maximum time between two adjacent IDR-frames) was varied over a range of 0.2 to 9.9 sec and the effect on PSNR, SSIM, average bit-rate (ABR) and the encoder speed performance was measured as shown in the table below. PSNR values are averages across all 10 video sequences. The Increase columns show the difference relative to the baseline 4 second GOP spacing. This value of baseline was chosen because it is the maximum GOP length permitted under our recommended profile, independent of frame rate. The standard deviation provides a measure of the consistency of the results across the 10 sequences (measured across the relative percentage change with respect to the baseline), with lower standard deviation corresponding to more consistency.
Qp=30 Increase (Decrease)
2.35% 0.67% 0.28% 0.11% 0.0% (0.18%)

GOP Length (seconds)


0.2 1.0 2.0 3.0 (baseline) .4.0 9.9

PSNR
41.14 40.46 40.31 40.24 40.20 40.12

Std.Dev of the % Change


0.60% 0.27% 0.14% 0.06% 0.0% 0.12%

PSNR
46.15 45.56 45.47 45.44 45.42 45.39

Qp=20 Increase (Decrease)


1.59% 0.30% 0.11% 0.04% 0.0% (0.07%)

Std.Dev of the % Change


0.38% 0.14% 0.05% 0.02% 0.0% 0.05%

Table 4. Average GOP Length vs. Average PSNR From this table it can be seen that PSNR varies by an average of 2.53% over the range of almost a 50x change in the GOP length, for low quality encodings (Qp=30). At the higher quality Qp=20, PSNR varies even less with GOP length. Hence, even with a 3.3-sigma addition to the average, for the results of the lower-quality encodes, there would be a negligible quality difference in the resulting video. We have also measured the mean and standard deviation in quality with another measurement the Structural Similarity index or SSIM [16] which is shown next (the choice for using SSIM was to ensure that a second objective quality metric other than PSNR was also considered in our results). As we see, similar to the PSNR, the mean SSIM change is small while the standard deviation over different encodings is even smaller. Visual inspection of the different clips also confirms the negligible difference in quality, although specifically between a two second and 9.9 second GOP there is some objective difference between the average PSNR and SSIM. From these tables we can conclude the following: PSNR and SSIM both decrease slightly with increased GOP lengths the rate of PSNR and SSIM decrease itself decreases as the required video encoding quality increases the amount of decrease is stable across many types of video sequences

GOP Length (seconds)


0.2 1.0 2.0 3.0 (baseline) 4.0 9.9

SSIM
0.963 0.960 0.959 0.958 0.958 0.957

Qp=30 Increase (Decrease)


0.60% 0.19% 0.08% 0.03% 0.0% (0.05%)

Std.Dev of the % Change


0.16% 0.07% 0.03% 0.01% 0.0% 0.04%

SSIM
0.984 0.982 0.982 0.982 0.982 0.982

Qp=20 Increase (Decrease)


0.22% 0.04% 0.02% 0.01% 0.0% (0.01%)

Std.Dev of the % Change


0.06% 0.02% 0.01% 0.00% 0.0% 0.01%

Table 5. Average GOP Length vs. Average SSIM

GOP Length (seconds)


0.2 1.0 2.0 3.0 (baseline) 4.0 9.9

ABR
1,892 1,044 954 922 907 883

Qp=30 Increase (Decrease)


126.88% 17.77% 6.12% 1.82% 0.0% (3.09%)

Std.Dev of the % Change


45.66% 7.22% 2.61% 0.61% 0.0% 1.37%

ABR
7,330 5,069 4,817 4,725 4,683 4,617

Qp=20 Increase (Decrease)


74.12% 10.65% 3.71% 1.06% 0.0% (1.84%)

Std.Dev of the % Change


37.34% 5.40% 1.99% 0.43% 0.0% 0.96%

Table 6. Average GOP Length vs. Average Bit-rate (ABR)


Qp=30 Increase (Decrease)
13.03% 1.23% 1.35% 0.90% 0.0% (0.43%)

GOP Length (seconds)


0.2 1.0 2.0 3.0 (baseline) 4.0 9.9

EP
8.9 7.9 8.0 7.9 7.8 7.8

Std.Dev of the % Change


6.76% 1.61% 1.16% 1.65% 0.0% 0.86%

EP
8.0 7.1 7.0 7.0 6.9 6.9

Qp=20 Increase (Decrease)


17.26% 3.03% 1.53% 0.84% 0.0% (0.19%)

Std.Dev of the % Change


6.91% 1.23% 1.02% 1.02% 0.0% 0.77%

Table 7. Average GOP Length vs. Relative Encoding Performance The next table provides a similar set of data, this time as a function of average bit-rate (ABR). From this table we can see that as one would expect, ABR varies greatly with GOP Length, and that the variance itself is a strong function of the quality of the encoding. However, the increasing standard deviations with decreasing GOP length indicate that the effect on ABR can be strongly correlated with the particular video sequence. Additionally, the impact of changing GOP length on the sequences is smaller at higher quality levels. Given a requirement for smooth trick-play functionality in CE devices, our proposed GOP Length of 4 seconds does not substantially increase encoding times (see Table 7) or reduce visual quality beyond the much larger GOP Length of 9.9 seconds. A choice of 4 seconds is a good operating point in terms of balancing encoding complexity, PSNR performance, bitrate, and the importance of being able to offer effective visual search and other trick play capabilities. Using a significantly shorter GOP length would give smoother forward and reverse search, though at an increasing cost in average bit-rate. And while there is a potential 3% increase in ABR between a 4 seconds and 9.9 second GOP, this difference is deemed unimportant in lieu of achieving a good user-experience during trick-play operations. When combined with the enhanced visual-search method outlined later in this paper, we believe the user-experience will surpass any available option in todays consumer market space.

3.2

Analysis of Consecutive B-frame Limit

In this set of analysis we turned our attention to the issues involved with using multiple bi-directionally predicted frames. Using the same test sequence, we kept all other encoding variables constant while selecting between 0, 1, 2, or 3 consecutive B-frames. Naturally, the choice in utilizing all the allowed B-frames lies with the encoders mode decision process and no alterations were made to bias the encoding towards using more or less B-frames.
Qp=30 Increase (Decrease)
0.63% 0.12% 0.03% 0.0 %

Max B-frames
0 1 2 (baseline) 3

PSNR
40.45 40.25 40.21 40.20

Std.Dev of the % Change


0.15% 0.06% 0.02% 0.0 %

PSNR
45.74 45.48 45.44 45.42

Qp=20 Increase (Decrease)


0.71% 0.13% 0.03% 0.0 %

Std.Dev of the % Change


0.19% 0.06% 0.02% 0.0 %

Table 8. Consecutive B-frame Limit vs. Average PSNR


Qp=30 Increase (Decrease)
0.16% 0.04% 0.01% 0.0 %

Max B-frames
0 1 2 (baseline) 3

SSIM
0.959 0.958 0.958 0.958

Std.Dev of the % Change


0.05% 0.02% 0.01% 0.0 %

SSIM
0.983 0.982 0.982 0.982

Qp=20 Increase (Decrease)


0.09% 0.02% 0.00% 0.0 %

Std.Dev of the % Change


0.05% 0.01% 0.00% 0.0 %

Table 9. Consecutive B-frame Limit vs. Average SSIM

Based on the above two tables, it is apparent that the change in both PSNR and SSIM is relatively minor, although there is a more marked difference when B-frames are completely omitted. This effect, in turn, is reflected in the average bit-rate which is shown next:
Qp=30 Increase (Decrease)
14.61% 1.62% 0.17% 0.0 %

Max B-frames
0 1 2 (baseline) 3

ABR
1,020 917 908 907

Std.Dev of the % Change


6.85% 1.58% 0.30% 0.0 %

ABR
5,231 4,739 4,689 4,683

Qp=20 Increase (Decrease)


13.71% 1.54% 0.25% 0.0 %

Std.Dev of the % Change


5.69% 1.39% 0.48% 0.0 %

Table 10. Consecutive B-frame Limit vs. Average Bit-rate (ABR)


Qp=30 Increase (Decrease)
(3.41%) (1.37%) (0.10%) 0.0 %

Qp=20 Std.Dev of the % Change


2.68% 2.00% 0.52% 0.0 %

Max B-frames
0 1 2 (baseline) 3

EP
8.5 8.7 8.8 8.9

EP
5.7 5.9 5.9 6.0

Increase (Decrease)
(2.79%) (1.04%) (0.79%) 0.0 %

Std.Dev of the % Change


4.45% 1.96% 1.04% 0.0 %

Table 11. Consecutive B-frame Limit vs. Relative Encoding Performance Here, we see that the average bit-rate increases by 13%~15% when B-frames are not utilized versus when they are used at the different encoder quality levels. The standard deviation across streams which have not been encoded with B-frames is around 5%, which signifies a fairly uniform bit-rate increase across all stream types. Once B-frames are used, however, the bit-rate improvements depreciate substantially, where in the case between using 2 and 3 B-frames the difference is between 0.17% and 0.25%. Based on the above results, the bit-rate gain between using two or three B-frames is minimal. However, one of our main goals is to create a profile that is acceptable for its community of software-encoder users, while creating streams that are decodable by most CE devices. It was deemed important to provide users with the flexibility of determining the total number of B-frames while offering a feature that clearly goes beyond what was offered by the previous DivX MPEG-4-compliant encoder. More importantly, having this additional flexibility will help encompass more streams which have already been created with some open source encoders.

3.3

Analysis of Reference Frame Counts

The recommended H.264 profile in this paper places no restrictions on the number of reference frames which are allowed by the H.264 profile/level combination. However, the DivX encoder by default uses a maximum of 4 reference frames. In this case, in addition to performing the quality and bit-rate analysis we also took into account the cycle-performance of the encoder and how it could affect the speed of the encoding session. The results have been shown in the next four tables. Based on the results, we can see that there is minimal effect on SSIM and PSNR between the different reference frame counts. The average bit-rate, however, decreases with an increased number of reference frames and given the small standard deviation across the various sources we used, it can be concluded that a direct relation between the number of reference frames and the average bit-rate exists. In choosing the proper operating point, we used a fourth metric the average performance of the encoder on our system would relatively decrease as the number of reference frames were increased.

Max Ref-frames
1 2 3 (baseline) 4 5 6 7

PSNR
40.156 40.180 40.190 40.197 40.203 40.207 40.211

Qp=30 Increase (Decrease)


(0.10%) (0.04%) (0.02%) (0.0% 0.01% 0.03% 0.03%

Std.Dev of the % Change


0.08% 0.04% 0.01% 0.0% 0.01% 0.02% 0.02%

PSNR
45.408 45.417 45.420 45.423 45.425 45.427 45.429

Qp=20 Increase (Decrease)


(0.03%) (0.01%) (0.01%) (0.0% 0.00% 0.01% 0.01%

Std.Dev of the % Change


0.06% 0.03% 0.01% 0.0% 0.01% 0.01% 0.02%

Table 12. Max Reference Frame Limit vs. Average PSNR

Max Ref-frames
1 2 3 (baseline) 4 5 6 7

SSIM
0.958 0.958 0.958 0.958 0.958 0.958 0.958

Qp=30 Increase (Decrease)


(0.01%) (0.01%) (0.00%) 0.0% 0.00% 0.01% 0.01%

Std.Dev of the % Change


0.02% 0.01% 0.00% 0.0% 0.00% 0.00% 0.01%

SSIM
0.982 0.982 0.982 0.982 0.982 0.982 0.982

Qp=20 Increase (Decrease)


(0.00%) (0.00%) (0.00%) 0.0% 0.00% 0.00% 0.00%

Std.Dev of the % Change


0.01% 0.00% 0.00% 0.0% 0.00% 0.00% 0.00%

Table 13. Max Reference Frame Limit vs. Average SSIM


Qp=30 Increase (Decrease)
2.06% 1.18% 0.44% 0.0% (0.34%) (0.53%) (0.74%)

Max Ref-frames
1 2 3 (baseline) 4 5 6 7

ABR
924.9 916.9 910.9 907.4 904.8 903.6 902.2

Std.Dev of the % Change


0.81% 0.45% 0.31% 0.0% 0.17% 0.28% 0.43%

ABR
4,764.8 4,720.9 4,697.1 4,683.2 4,673.7 4,667.3 4,661.9

Qp=20 Increase (Decrease)


1.78% 0.86% 0.32% 0.0% (0.23%) (0.38%) (0.53%)

Std.Dev of the % Change


0.52% 0.29% 0.14% 0.0% 0.08% 0.14% 0.21%

Table 14. Max Reference Frame Limit vs. Average Bit-rate (kbps)
Qp=30 Increase (Decrease)
8.59% 5.47% 2.14% 0.0% (2.50%) (4.83%) (6.56%)

Max Ref-frames
1 2 3 (baseline) 4 5 6 7

EP
8.2 8.0 7.8 7.6 7.5 7.3 7.2

Std.Dev of the % Change


5.92% 3.46% 1.50% 0.0% 1.67% 1.91% 2.84%

EP
6.7 6.5 6.2 6.0 5.8 5.6 5.5

Qp=20 Increase (Decrease)


12.73% 8.49% 3.89% 0.0% (3.36%) (7.19%) (9.57%)

Std.Dev of the % Change


5.61% 2.76% 1.42% 0.0% 1.21% 1.92% 2.72%

Table 15. Max Reference Frame Limit vs. Relative Encoding Performance While the recommended profile does not limit the total number of reference frames (other than what is bounded by H.264 Level 4.0), the maximum allowed reference frame count for our software encoder is 4 frames. This number yielded an average bit-rate that was 0.74% higher than using 7 reference frames, but the encoding was also an average of 6.56% faster than the same encode on our encoding system (see Table 15).

4 Subjective Testing Methodology and Results


In order to verify the selection of 12 Mbps as the recommended maximum average bit-rate for the DivX encoder, we performed a single stimulus quality evaluation (SSQE) measurement where viewers were presented four different clips encoded with five different configurations (see Table 16). Each clip was selected to be between 1 to 2 minutes long, and one of the five encodings was shown at random to allow the viewer to familiarize them with the contents of the clip. All clips were encoded with the H.264 High Profile tools. The original source was not one of the clips shown to the viewers, and the presentation order of the five encodings was randomized. A total of 24 viewers were chosen with varying levels of expertise (novice to expert) in order to ensure that the results were not determined solely by video engineers. After the playback of each clip in its entirety, the viewer was asked to rate the clip with a score of 1 through 5 representing the quality ranging from excellent to bad.
Format
1PASS_4.0 MAX 1PASS_P12 MAX 2PASS_P12BEST 2PASS_14BEST_LG 1PASS_P12MAX_720P 1080p@24 1080p@24 1080p@24 1080p@24 720p@24

Single/Dual Pass
Single Single Dual Dual Single

Average Bit-rate
19 Mbps 11 Mbps 8.33 Mbps 9.75 Mbps 11 Mbps

GOP Size
2 seconds 2 seconds 2 seconds 10 seconds 2 seconds

Maximum Bit-rate
20 Mbps 12 Mbps 12 Mbps 14 Mbps 12 Mbps

Table 16. Encoding parameters and their corresponding designation Initially we were aiming to limit the maximum bit-rate as well as the average bit-rate and, hence, why we chose different operating parameters (12 Mbps and 14 Mbps). Likewise, we wanted to know if we could reduce the average bit-rate by using

dual-pass encoding (2PASS_14BEST_LG, 2PASS_P12BEST) versus single-pass encoding. We also wanted to measure the effects of using a long-GOP encoding (2PASS_14BEST_LG). The results of our visual testing (see Table 17) were weighted with a non-linear scale in order to make the aggregation of the data more reflective of the typical human experience, where the scale was more heavily weighted towards the worsening quality of the videos.
Designation
1PASS_4.0MAX 1PASS_P12MAX 2PASS_P12BEST 2PASS_14BEST_LG 1PASS_P12MAX_720P

Score
1200 1665 2070 2245 2915

Table 17. Results of the DivX Single Stimulus Quality Evaluation (lower score is better) Based on these results, we see that our viewers were able to differentiate the quality of the 720p clips versus 1080p clips, and that the highest encoding profile (1PASS_4.0MAX) was selected as having the highest quality. The long-GOP testing was rated worse than using a short-GOP structure, even when compared to a lower bit-rate encoding. But with the results of the subjective visual testing we decided to adopt the bandwidth limitations of the H.264 High Profile Level 4.0 and allow the maximum bit-rate to apply to an encoding cap of 12 Mbps. Upon further evaluation of the quality, we determined that the gap between 12 Mbps and 20 Mbps was closer than originally anticipated and hence why we chose that as the artificial cap of our average bit-rate for our encodings.

5 Improving Visual Search Performance for Long GOP Encodings


Visual-search through digitally encoded multimedia files is typically performed by displaying only the key-frames (aka intraframes and IDR-frames) of the relevant video stream. The key-frames are displayed for a time corresponding to the speed of the visual search being performed and some may be skipped when a high search speed is requested. To facilitate smooth visual-search, key-frames are usually laced at regular and frequent intervals throughout an encoding. However, as has already been presented in this paper, frequent insertions of key-frames into files can increase the average bitrate of the encoding by large amounts; hence, the smoother the requirement for visual-search, the exponentially larger the ABR of the encoding. It is for this reason that most systems compromise visual-search capability, as can be seen by the functionality of DVD and even Blu-ray players. Another problem is introduced by having infrequent key-frames: the time between key-frames is large and irregular, delivering visually inconsistent performance with coarse-grain presentation of the contents video frames, which can actually make visual-search more difficult to use. What is more important, however, is that performing smooth reverse trick play functionality with high definition content is extremely costly, especially given the flexible GOP structure of an H.264 stream. Smooth forward and reverse trick play may require decoding of all frames of the video sequence while only presenting the viewer with a subset of the decoded frames, which would increase the burden on the decoding devices. We propose an alternative method of enabling visual-search in both forward and reverse time-order which can be scaled between impact on file-size and smoothness of visual-search. Furthermore, our proposed system can provide 10-times more key-frames than DVD or Blu-ray ([6]) formats while incurring an impact of much less than 10% on the total file-size. Finally our system does not increase the system processing requirements of the video playback system, including network throughput, while also allowing the visual-search data to be utilized for content-preview animations in advanced media managers. Our method is as follows: sub-sample the original content to a lower resolution sub-sample the low-resolution content to a lower frame rate encode this low-resolution, low-frame-rate frame rate visual-search track (VST) as a key-frame only bitstream write the resulting data either in-line with the original content, or append this VST to the end of the original piece of content

For example, by taking 25% of the spatial data (pixels) and 21% of the temporal data (frames) from a source, nearly 95% of the original data can be discarded. The resulting frames are all encoded as key-frames, which is known to be extremely inefficient. However, since the source is 5% of the original, our experiments have shown that the VST file can be anywhere from a few percent to 10% the size of the original content (a general rule of thumb of 7.5% can be used in the estimation of the encoded VST file-size). Using actual numbers, we can present the following real-world scenario: assuming that a normally encoded (i.e. 1 or more key-frames per second) movie (i.e. 23.976 fps) of resolution 1920x816 has a file-size of 10 GB the movie is re-mastered with a key-frame rate of 1 or more key-frames every 4 seconds, reducing the file-size to 8.23 GB, i.e. a 17.7% reduction (as shown in Table 6) the content is then sub-sampled to a resolution of 480x272 and a frame-rate of 5 fps (25% and 21% respectively) to generate a VST source this VST is then encoded as key-frames only, resulting in a VST file-size of 618 MB (i.e. 7.5% of 8.23 GB) the combined file-size of the best-encoding with visual-search enhancement is 8.85 GB this is a saving of 1.15 GB from the original, with improved visual-search performance and the ability for advanced video players and media managers to show animated icons in this case, a device could perform up to 40x visual-search (in either the forward or reverse time-line) without requiring more than 1x original video system performance higher speeds can be achieved by skipping key-frames as needed to keep the system performance within the limits of the device (or software) performing the visual-search The concept of the enhanced visual-search is shown in the next few diagrams. The first figure shows a sample video encoded at 25 frames-per-second, where each frame type is represented with a small square underneath the video frames. Here, the red-square designates an intra-frame type (I Frame), a yellow square a bi-directionally predicted frame (B Frame) and a green square a uni-directionally predicted frame (P Frame).

Figure 6. Example video encoded at 25 fps, with non-uniformly dispersed Intra frames (red squares) From Figure 6 we see that the distance between the Intra frames is not uniform, which has been demonstrated more clearly in Figure 7. We also see that besides the temporal distance between the intra frames, the bit-rates for those intra frames also vary greatly across the different frames. Both of these have been resolved with the addition of the VST, which in turn is shown in Figure 8. Here we can see that the bit-rates across the intra-frames are much more uniform, and the temporal distance is also consistent. Both of these factors will lead to a better user experience during smooth forward and reverse operations, in both optical as well as streaming playback scenarios.

Figure 7. Intra frame properties in original video sequence

Figure 8. Visual search track with uniform bit-rate and temporal distance of access frames Of course, what has been shown here are simple examples used to illustrate the process and benefits of creating a VST. We believe that this technique brings real-time, over the Internet, visual-search capability to the market for the first time. This feature significantly improves the on-line video viewing experience, bringing it closer to the physical at-home opticaldisk experience.

6 Conclusions
This paper has discussed the distribution of high quality video over the internet using H.264. We have considered the bandwidths available in various wired and wireless internet technologies, and presented the results of a comprehensive set of experiments designed to identify H.264 parameters that lead to advantageous tradeoffs between complexity, video quality, and bit-rate. In addition, we have proposed a new and efficient visual search methodology to facilitate smooth fast-forward and reverse playback at different speeds.

7 References
[1] T. Wiegand, H Schwarz, A Joch, F. Kossentini, G Sullivan, Rate-Constrained Coder Control and Comparison of Video Coding Standards, IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 688-703, July 2003 [2] T Hofeld, K Leibnitz, A Qualitative Measurement Survey of Popular Internet-based IPTV Systems, Second International Conference on Communications and Electronics, June 2008 [3] http://forum.blu-ray.com/showthread.php?t=3338 [4] Broadband Growth and Policies in OECD Countries, OECD Ministerial Meeting on the Future of the Internet Economy, June 2008 [5] http://www.speedtest.net, Doug Suttles [6] From Wikipedia (http://en.wikipedia.org/wiki/Blu-ray_Disc), Sep. 2008 [7] Amazon.com sales ranking, 8/20/2008 [8] M. Gast, 802.11 Wireless Networks, the Definitive Guide, OReilly Publications 2nd Edition [9] I. Kozintsev, J McVeigh, Improving last-hop multicast streaming video over 802.11, Workshop on Broadband Wireless Multimedia, Oct. 2004 [10] M. Gast,When is 54 Not Equal to 54?, OReilly Wireless DevCenter, Aug. 2003 [11] White paper, The New Mainstream Wireless LAN Standard, Broadcom Corporation, July 2003 [12] A. Wijesinha, Y. Song, M. Krishnan, V Marhur, J Ahn, V Shyamasundar, Throughput Measurement for UDP Traffic in an IEEE 802.11g WLAN, Proc. of the Sixth Int. Conf. on SW Engineering, AI, and Networking and Parallel/Distributed Computing (SNPD/SAWN), July 2005 [13] M. Boulmalf, T. Rabee, K. Shuaib, A. Lakas, Performance Characterization of IEEE 802.11g in a Cubicles Environment, The Seventh Annual U.A.E. University Research Conference, Apr. 2006 [14] S. Garg, M. Kappes, An Experimental Study of Throughput for UDP and VoIP Traffic in IEEE 802.11b Networks, Wireless Communications and Networking, March 2003 [15] K. Papagiannaki, M. Yarvis, W. Conner, Experimental Characterization of Home Wireless Networks and Design Implications, INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings, Apr. 2006 [16] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.

You might also like