Appendix A: Technical Report

PREVIOUS -- HOME -- NEXT

Appendix A: Technical Report

Technical Report

Table of Contents

1.MPEG information

2.Parallelization issues

3.Parallelization choices

4.Scheduling method

5.Performance issues

MPEG information

Digital video is represented as a series of frames, and each frame is represented as a matrix of pixels. In order for a person to perceive the video as continuous, 20 frames per second must be displayed. Films are displayed at 24 frames per second, television at 30. MPEG-1, the most popular video format used with computers, is the form of compression implemented in this project. It commonly uses a frame size of 352 by 240 pixels, although this is not explicitly required by the standard. This roughly corresponds to television quality resolution.

There are three kinds of MPEG video frames: I, P, and B. I (Intra) frames are encoded using information entirely within the frame. The process closely resembles the popular JPEG still image compression method. P (forward Predicted) frames are encoded using information both from within the frame and from the previous I or P frame. Compression is about 10 times better than for I frames, but the performance costs are high. B (Bi-directional predicted) frames are encoded using information from within the frame, from the previous I or P frame, and from the following I or P frame. B frames generally are compressed to about one sixtieth the size of I frames, but performance suffers even further. B frames are never used as a basis for the encoding of other frames. The time it takes to decode each of the three frame types is roughly the same.

The MPEG standard defines no set sequence of I, P, and B frames that must be used to encode a video. However, several patterns have become de facto standards. A stream encoded strictly with I frames can be encoded in real time on the last several generations of computers, but the compression is not high enough to be useful for most purposes. Streams encoded with many B frames provide excellent compression, but compression this high is well out of the reach of consumer hardware. A very popular compromise is:

IBB PBB PBB PBB PBB P

In this sequence, each set of two B frames depends on the preceding I frame and following P frame. This sets a small minimum delay time between when frames are received and when they are displayed. All of the P frames depend on the initial I frame.

Each triplet can be encoded independently of following triplets. The sequence in its entirety can be encoded independently of the sets that precede it and follow it. At real time, this sequence would be repeated 1.5 to 2 times per second. One final note: the choice of sequence usually has a negligible effect on the performance of the decoder.

Parallelization Issues

The first choice had to be made when converting the serial encoder into a parallel one is what size chunks to deliver each of the processors. There are several good choices.

The encoder we adapted for this project encoded many frames at once on a single processor. This decision was made when real time compression could not be achieved, even given a large pool of current (for the time) workstations. If large enough groups of frames are encoded at once, no dependencies exist between subsequent sets, which is a major benefit of this approach. Additionally, almost any pattern of I, P, and B frames can be used. However, because the sets that are encoded represent large units of time, this approach is unreasonable for real time.

At the other extreme, MPEG defines a subdivision of the frame called a slice. There are no interdependencies between the slices of a frame. Any pattern of I, P, and B frames can be used. This approach is also scalable; increasing the number of computers can increase the frame size. The biggest obstacle is that the structure of the program would be very different from that of the encoder we modified for this project. This would be an infeasible task. Furthermore, the cost of networking would be high. The throughput required to encode at real time is several dozen megabits per second. In order to use ten-megabit hardware, a tree or cube topology would be required. The propagation delays this would induce would be too high. The only alternatives would be all 100-megabit hardware with a hub or a combination of 10 and 100 megabit hardware connected with a switch. These solutions would make the end product unreasonably expensive.

We chose to encode one frame at a time on each processor. This approach was carried out with reasonable modifications to the original encoder. However, once a network topology is chosen, there are limitations on the patterns of I, P, and B frames that can be chosen. As was mentioned in the previous section, only several patterns are actually used in practice, so this is not a major problem. Given the sequence IBBPBBPBBPBBP, each set of three frames can be encoded concurrently with no dependencies, making this a good compromise between fine and course grain parallelism.

This decision inherently sets a limit on the number of machines that can be used to perform the encoding. If many slower machines were used, the minimum delay from encoding to decoding would be set by the longest delay for a single computer to encode a single frame, which can grow to be quite large. This approach would face have the same penalties as the ``many frames at one'' method described above. Given that dependent frames come in sets of three, it makes sense to encode three or six frames at once. Longer sequences would lead to undesirably long delay. Three computers can be connected on a single 100-megabit ethernet wire for a reasonable cost. This setup will provide adequate throughput, and it is the easiest setup to configure. This choice has one further effect: it sets the maximum amount of time each computer has to encode a frame.

Parallelization Choices

We looked at several different parallel programming environments. The most popular of them was PVM, but we also looked at MPI, LINDA, and several others.

PVM had many advantages. Initial configuration and control of processes on slave computers is provided through an interactive shell or through a parameter file. Message passing is straight forward, and any variations in data presentation between computers are hidden from the user. Also, if an error occurs in either the network or one of the remote computers, PVM can often stop the program gracefully.

MPI provides a lower level interface than PVM. Also, it doesn't much in the way of error recovery. It wouldn't have offered any real advantage over PVM, and the implementation would have been more difficult. LINDA is a higher-level environment than PVM. In fact, one implementation of the LINDA model sits on top of PVM. Although LINDA's documentation claims that it is fundamentally different than message passing or shared memory. LINDA basically provides an implicit form of message passing. This model probably would have worked as well as PVM for our purposes. LINDA is distributed under a more restrictive policy than PVM, but evaluation software is available for free.

In the end, we decided on a different method for splitting the encoding process between several machines. The encoder we used as a basis for our project provided a usable infrastructure for the development of a frame level parallel encoder. The underlying functions are socket based, but the interface we were presented with was at a somewhat higher level. The remote machines, which could be varying in number, are initialized using the rsh command. Functions are provided to start a server on the master machine. Messages must be packed into a buffer and exchanged using send and recv, but nearly all communication is of char size, so the byte order of the machines is not an issue. In short, the interface we faced was much the same as the one PVM provides.

Using the existing parallel features of the original encoder provided a noticeable performance improvement. In the PVM approach, each group of three related frames is encoded, one on the master (which gets a less difficult I or P frame), and one on each slave. The master blocks until the slaves have completed their work and then the three frames are merged with the output bitstream, using the serial combination function of the original encoder. Our final version is very similar, except for the final steps. The services used to combine frames that are received from multiple computers that is also provided as part of the original encoder is generic enough to encode both series of frames (its original intent) and single frames. This method is useful for us because encoded frames are queued before they are combined. This allows for a more uniform use of network and I/O resources.

There is one major reason it took us so long to decide on a parallel environment. The choice had to be made after the structure of the code was thoroughly understood. However, in order to understand the original program, we had to learn many details of the MPEG encoding algorithm, some of which are fairly complex. This shifted the balance of effort we expended away from implementation and design and into the preliminary planning stage.

Scheduling Method

The original encoder and our encoder are fundamentally different in the way tasks are allocated to the various computers. The original encoder is designed for total utilization of resources, even if the computers have different capabilities. This is achieved by sending a few frames to each computer as a test. When a slave finishes with its allotment of frames, it notifies the server how long the job took. After all of the slaves have reported on their initial load, the server divides most of the remaining frames among the computers, using the relative completion times to weight the number of frames each gets. Frames at the end of the sequence are kept. These are use to keep computers busy if some finish sooner than others.

After the initial time test, allocation of frames by the original encoder is asynchronous. This is not suitable for a real time environment. Although resource utilization is important, it is even more important that the encoder's behavior is completely predictable. Therefore, after starting the slaves, our encoder waits for each of them to report back. This message includes a number identifying which host it is, and then the host is matched with its socket (PVM would have taken care of this step for us if we had chosen to use it). One frame is then allocated to each of the computers. When those three have been completed, another three are sent. This process is repeated until all of the frames have been completed.

Performance Issues

There are four possible performance bottlenecks for the parallel encoder. These are the processor (and memory), the hard drive, the network, and data dependencies between neighboring frames. Using the hardware we had available at the time, we ran the existing software and made predictions as to what kind of hardware would be needed to achieve real time. However, these predictions fell short. One reason for this is that we thought that the hard drive was a more important performance factor than it turned out to be. In reality, processor performance turned out to be more dominating than we originally believed.

At real time, both hard drive and network throughput must be several tens of megabits per second. Assuming 100-megabit ethernet hardware, the network performance will not be a factor. Current, moderately priced hard drives can provide that bandwidth, particularly given that any access of consequence is completely sequential. Currently, frames that have been encoded but have not yet been included in the output stream are buffered to disk. This is because the original encoder was designed to encode many frames on each computer concurrently, meaning that a large buffer was needed. With extra modification, our encoder could eliminate those accesses (by buffering the frames to memory), further reducing the influence of the hard drive.

Data dependency is also a relatively minor issue. With the original encoder, which was primarily designed to reduce this kind of overhead, the parallel version looses about 10% of performance due to dependencies. Our encoder, because of its less flexible scheduling rules, looses about 20% of the total execution time waiting for an event to occur. There is no practical way to improve that figure; it is dictated by the structure of the algorithm. One way the figure could become worse is if the three computers completed their tasks in significantly differing periods of time. This is because our encoder blocks until all three

computers complete there current task. However, the workload is well balanced between them, so this is not an issue.

This means that processor performance will solely determine performance. In the parallelization sections, we outlined why only three computers would be used. Originally, we expected to use older hardware. However, in order to achieve fast compression, very modern computers must be employed. Using 266 megahertz Pentium II based computers, only several frames per second were managed.

In order to increase performance, the SIMD abilities of the Pentium II processor were used. Using the GNU program profiler, it was found that a couple dozen lines of code represented about 70% of the execution time of the encoder. In order to determine where predictable motion occurs between two frames, this implementation of the MPEG algorithm subtracts a block of pixels in one frame from a corresponding block in another frame. The absolute difference of each subtraction is accumulated, and this value is used as a basis for choosing the best way to encode the frame (this is by far the most common method for accomplishing this goal, but the standard does not require that it be done this way). It is a simple process, but it is repeated many times. By using MMX instructions, the time spent on that portion of the algorithm was cut in four. However, the total frame rate was only increased to just past 10 frames per second.

This would leave two options for increasing performance: more computers or faster ones. As it turns out, the new generation of Pentium computers includes a new instruction aimed directly at performing the process described above. By combining 8 byte-sized subtractions and the accumulation of the absolute difference of those subtractions into a single instruction (who said it was a RISC world?), the performance can be increased drastically with a small increase in cost. This would clearly be the best choice for improving performance to meet the goal of real time. Minimal systems can be bought for only several hundred dollars, which means that the encoder could be assembled for only slightly more than the cost of inexpensive hardware encoders while providing better features and higher compression. The cost would still be far less than that of professional hardware encoders. As a final benefit, the hardware needed will be readily available in many settings as Pentium III based systems become the standard for personal computing as the year progresses.