You are on page 1of 6

2918

SROS: A DYNAMICALLY-SCALABLEDISTRIBUTED REAL-TIME OPERATING SYSTEM FOR ATM SWITCHING NETWORK


Sung Ik Jun', Boo Geum Jung', Young Jun Cha', Hyeoung Kyu Chang', Hyung Hwan Kim*, E m Hyang Lee*, Hyeong Ho Lee",and Young Man Kim"
*Switching System Department, Switching &Transmission Technology Laboratory

Electronics and Telecommunications Research Institute


161 Kajong-Dong, Yusong-Gu, Taejon, 305-350, Korea E-mail: sijun@em.re.kr **Departmentof Computer Science, Kookmin University, Seoul, Korea

Abstract
In this paper, we introduce a noble fault-tolerant distributed real-time operating system, SROS(Scalah1e Realtime Operating System), that is successfully implemented and working in ATM switching system for B-ISDN. SROS is based on micro-kernel for the puIpose of achieving portability, scalability and reusability. In addition, to reduce system development period and to preserve call service continuity during systedapplicalion software upgrade, SROS adopts Dynamic Code Binding(DCB) method that enables systedapplicatlon software upgrade without system halt or service disconnection. We also propose a cool development environment suitable for embedded, distributed real-time system. This environment, together with SROS mechanisms, greatly cuts down system development time.

the new version of software, are reloaded again. This cumbersome procedure of software reloading invites time loss in server development in addition to frequent service stop or termination. Second, many distributed real-time syslems should offer continuous service, otherwise loss in life and property would happen during service suspension period. For example, in AIM switching system employed for public BISDN, continuous switching and transmission servicc is an inevitable goal so that the system service discontinuity should he restrained at most a few minutes per year. There are two sources that prevent service continuity: systemfailure and sofrware upgrading. In the first source of service discontinnity(system failure), to protect the system from hardware failure, hardware redundancy of duplex architecture[5] consisting of active/sfandby modules can he applied lo B-ISDN ATM switching system. In such case, the operating system should offer duplex semer that manages duplex modules. In the second source of service discontinuity(software upgrading), to obtain service continuity during software upgrading, we need different approaches to two kinds of softwares, system program and user application. For upgrading system program like server module, a dynamic binding mechanism is necessary so that the remaining parts in operating system need no modification. As for the upgrade of user application, a checkpoint mechanism records application state that will be used for the new version of application to inherit and proceed the ongoing task. In this paper, we introduce a fault-tolerant disnibuted real-time operating system, Scalable Real-time Operating Sysfem(SROS), that is micro-kernel based and resolves all problems mentioned above: dynamic scalability, enhanced communication performance, reliability, and service continuity. The remaining paas of this paper are organized as follows. In Section 2, the architecture of ATM switching system for B-ISDN with duplex structure is presented and SROS is overviewed. Then, in Section 3, Dynamic Code Binding(DCB) method is explained that is the

1 Introduction
Since distributed real-time system becomes larger in scale and more complex recently, portability, scalability, and reusability currently form major goals of modem realtime operating system. Micro-kernel architecture provides one excellent methodology to realize these goals in an efficient way, and, thus, several modem operahng systems like M a c h [ l ] , Chorus[l3], and VxWorks[3]are based on micro-kernel. Although micro-kernel based distributed real-time operating system greatly enhances portability, scalability and reusability, the following problems still exist. First, in an embedded real-time system, frequent upgrade or exchange of server software is inevitable in offering new service functions, fixing up software hug, etc. However, in the almost all operating systems including them introduccd in the above, a server operating in the system can not be replaced with a new version of software unless the whole system and ongoing services are paused and all softwares including operating system, hound with

0-7803-4984-9198/$10.00 0 1998 IEEE,

2919

key concept of SROS realizing on-the-fly upgrade of system software. DCB and duplex structure guarantee service continuity. In Section 4, a new style of efficient system development environment for an embedded distributed realtime system is proposed. The proposed environment is distinct from the conventional development environments in the aspect that it can be utilized to develop operating system itself. Finally, in Section 5, we conclude this paper.

2 ATM Switching System and SROS


SROS(Scalab1e Real-time Operating System) is a dynamically-scalable fault-tolerant distributed real-time operating system that is implemented in ATM switching system for B-ISDN. In this section, we describe, first, the architecture of ATM switching system for B-ISDN, then, fault-tolerant duplex structure, and, finally, SROS.

system console for man-machine interface. OMP takes charge of system management, for example, call service charging, system administration and maintenance, etc. Subscribe Call and Signaling Pmcessor(SCP) is another module that is configured with a processor and IPC device. SCP manages call from subscribers including signaling process. The final type of control module is Realtime ContmZZer(RC) which is closely connected to SCP and consists o f a processor and IPC device. Its main function is monitoring subscriber status at the interface device between switching system and subscriber terminal module. As already mentioned, each control module shares ATM switch with customer subscribers without building another private control network. This architectural feature enables easy maintenance of the ATM system and highspeed message transfer between control modules.

2.2 Duplex Architecture


The system specification of ATM switching system for B-ISDN requests subscriber service continuity such that the estimated service discontinuity owing to inevitable occurrences like system failure, maintenance, software upgrading, etc., should be lower than a few minutes per year. To conquer the uninvited cause like system failure, a faulttolerant feature is necessary in the system. In this subsection, duplex architecture[S] providing such feature, that is adopted in ATM switching system for B-ISDN, is presented. The other factor causing service discontinuity is the need to impmve or change system like software upgrade, bug fixing, hardware replacement, etc., and will be discussed with its solution in the next section. Figure 2 de-

2.1 Architecture of ATM Switching System for B-ISDN


The architecture of the ATM switching system for BISDN, in which SROS manages system resources, is illustrated in Figure 1. In the system, there are three types of control modules distributed around and inter-connected by central switching network. In the large-scale system, the number of control modules mount up to several hundreds forming distributed system connected via ATM switching network. The remaining links or channels of the switch provide B-ISDN service - voice, data, image, and video transmissions - to user subscribers. The first type

...

Figure 1: Structure of ATM switching system for B-ISDN of control module is Operation and Maintenance Processor(0MP) which is composed of a processor, IPC(InterProcess Communication) device, secondary storage and

Figure 2: Duplex control architecture for main control modules picts the duplex architecture employed in ATM switching system. Each control module is composed of two identi-

2920

cal sets of processors and their external devices like IPC device, secondary storage and IO adapters(Ethernet Controller(EC), Disk Controller(DC), Duplex Interface(DI), etc.): active and standby modules. Active module is in charge of module operation and standby module monitors the activity of active module. when a malfunction of active module is detected, active module is isolated and standby module takes over the task of the former. Like in the extended system bus, DI has a role of selecting the active processor between duplicated processors. In addition, High-speed channel(HS-ch) provides synchronization and communication between the processors. Duplicated processor units also have their own exclusive power suppliers and system clocks. HS-cb is a bi-directional parallel bus used for delivering state information and backup data[5].

Figure 3: Dynamically-scalable architecture of SROS

2.3 SROS(Scalab1e Real-time Operating System)


Real-time systems are spread over from a small microcontroller to a large-scale system[b]. A real-time operating system is scalable or adaptable if it can be constructed by combining a set of necessary functional modules according to the system size and configuration. Since microkernel structure possesses high scalability, many modern operating systems including SROS employ micro-kernel. In this section, SROS, that is a dynamically-scalable distributed real-time operating system implemented in an ATM switching system for B-ISDN, is overviewed. As for some of the unique notions applied to SROS, they will he explained in detail in the following sections. Although micro-kernel structure has several desirable propexties like portability and reusability except scalability, in upgrading a system program module, all the other independent modules should also be suspended and relocated as a current module is replaced with a new one, so that switching and transmission service are discontinueing during upgrade and, thus, violating ATM switching systern specification about service continuity. To overcome this defect in conventional micro-kernel structure, SROS adopts a dynamically-scalable mechanism by which any system program module is replaced or upgraded on-thefly with no interrupt to the other services. Figure 3 shows the dynamically-scalable architecture of SROS. In the figure, micro-kernel is assigned with basic system management and control tasks: process management, semaphore and message queue for process synchronization and communication, memory management, and interrupt handling. The other system service modules are attached and detached dynamically to micro-kernel as system reformation or upgrade becomes necessary. In SROS, those modules include, among many, IPC server for inter-processor communication via network, ether server for communication through LAN, time server for time-based services, debugging server for program debugging control that will be ex-

plained in the later section, dual server for supporting and complementing duplex fault-tolerant feature, andfile system server for the management of secondary storage device. To bind and release a system program module dynamically with micro-kernel, SROS devises Dynamic Code Binding@CB) method and its implementation specification, Server-Kernel Znterface(SKI), that will be explained in the next section.

3 Dynamic Code Binding


In this section, we present DCB(Dynamic Code Binding)[8] method interfacing between kernel and syst m servers(or device driver modules), which enables one the-fly software upgrade without any interrupt to the other processes. Figure 4 shows a typical situation in SROS when some of system servers are registered. When a system server is initialized in SROS, SROS uses DCB table(DCBT) to register names and addresses of system primitives offered by the server. Then, a call to these primitives will be resolved through DCBT. For example, if server B in the figure is to use primitivefunc6 of server C, primitive namefunc6 and its arguments are delivered to micro-kernel that finds out the actual primitive address from DCBT and calls the primitive procedure. When a server is upgraded with new service, SROS detachs all its registered primitives from DCBT and, then, attach a new set of primitives instead. Since the table look-up is critical to system performance, DCBT is implemented as multi-level hash table structure. It is also possible to change binding method from DCB to conventional, static binding after server development phase is completed and nothing related to the server will change for a long foreseeable time. One of the limitation contained in the original micro-kernel architecture is performance degradation stemming from message passing between kernel and system servers[Z]. To remove this obstacle, OSFII introduced

2921

ApplicationPmgrm

UserPrimitives

User Primitives Server Functions

Server Functions

SKI(Semer-Kwne1Interface)

Micro-Kernel

Dynamic Codc Binding Tablc(DCBT1

Blndmg m m g h DCBT

sKI(Seruer-Keme1Inteface)

Dnvm Primitives

Figure 4 Dynamic Code Binding Structure

Hardware

Figure 5: System structure using SKI the notion of address collocation by which server shares system address space with kernel so as to utilize procedure calling between kernel and server. However, in address collocation method, all the other independent processes should he stopped to upgrade any system server. Since DCB method adopted in SROS resolves this defect and can collocate dynamically, guaranteeing service continuity, DCB is called dynamic collocation method, To implement DCB, an unitary interface between serverldevice driver and kernel, sKI(Server-kernel Interface)[9], is defined. SKI prescribes the processes of loading, unloading, and executing servers. When a server is loaded by loader, loader calls pre-defined server initialization function so that it initializes data structures and attaches server primitives to DCBT. When the server is unloaded, unloader calls serverfinish function to detach its registered primitives and reset modified environment. During the registered period, a call to any server primitive is resolved by name searching into DCBT by hash function. Currently, C and Assembly language are used to develop various kinds of system servers and applications. Since ATM switching system control modules, SCP, OMP and RC, are real-time systems loaded with memorypersistent system programs and applications without ample memory space available for software development steps like compile, link, etc., software is developed at host system and downloaded to target ATM system modules. In developing kernel, it is pre-allocated to a fixed address and downloaded to target memory without relocation and the execution starts. On the other hand, in server development, it is compiled and linked with relocation option. Then, dynamic loader at target system allocates memory for downloaded server and relocates its code and data to the dynamically-allocated memory. After that, dynamic loader executes the pre-defined initialization function of the server which, in its turn, attaches its primitives to DCBT and initializes server variables and data shuctures. Before leaving this section, we review SROS strnc-

ture related to SKI at another angle. Figure 5 illustrates SROS structure illuminating SKI'Srole and position. System server and device driver adopts identical SKI interface mechanism. This simple interface makes it easy to program and debug. Each system server consists of user primitives for applications and server primitives for the system servers, and its own local functions. Device drivers usually have no user primitives and the driver primitives are provided for the system servers. Sometimes, device driver is used as a part of some server. For example, LAN driver is used for ethernet server or IPC server. In that case, LAN driver provides primitives to ethernet server, such as device initialization, packet sending and receiving. In addition, driver has driverfunction to manage and control its device resource. All user primitives of kernel and servers should he predefined and enrolled as system call service routine so that AF'I(App1ication Program Interface) is to follow traditional system call by trap. Kernel primitives used for server development includes process control, memory management, semaphore, message queue and interrupt banding functions.

4 Program Development Environment


In an embedded real-time system, in which host and target system are in charge of program development and execution respectively, a cross-development environment should be constructed with communication channel between host and target in the purpose of program download, execution monitoring and program debugging[lO, 15, 12, 141. Recent increase of embedding system market raises the importance of efficient cross-development environment[7, 111.

2922

As one of the primary supporting functions in cross development environment, there is source-level cross debugger by which test program is loaded into target memory and executed under the control of debugger. Currcntly, commercial real-time operating systems like VxWorks, offer such tools. However, it is difficult to locate a development environment that supports such tools available to develop system programs like operating system. For developing system program, VxWorks provides debugging tool about device driver that, however, does not work normally in debugging other part of operating system under execution. Furthermore, since VxWorks is designed for stand-alone system, it is not proper to utilize it as development environment for distributed systcm. In this section, we propose a cross-development environment for distributed real-time system that is based on DCB described in the previous sections. In the past, as development environment was constructed after completing target operating system, there existed no available development environment during OS development. This is inevitable weakness in the monolithic-structure operating system. On the contrary, in micro-kernel based OS like SROS, a cross-development environment can be built on a minimum subset of OS component modules: micro-kernel, IPC server, hosttarget communication server, and debugging server. Figure 6 depicts micro-kernel and server structure for crossdevelopment environment. Debugging server in the target system offers basic debugging function like memory/register read/write function and execution flow control functions like break, continue, step, and trace. Target manager at host system exchanges information through communication server with target system. Since all basic system calls are managed by a few servers and microkernel, a development environment for any other server can be constructed upon them, and dynamically loading and debugging the server under development is practiced during normal OS operation without interrupt (by DCB mechanism in SROS).

~,

L
. . .
, ,

. .

+---d
L

. . . . . . .

Figure 6: Debugging architecture SROS environment

in

In summary, a cross-development environment can he constructed under minimum set of micro-kernel In contrast to SROS, with conventional OS like VxWorks, inde-

pendent loading and unloading of server is almost impossible due to the interdependence between servers by static binding. Moreover, as information exchange between distributed sites arc supported network-transparently by IPC server and ether server, a distributed development environment is constructed easily in SROS. based OS that includes only micro-kernel, IPC server, Networking server, communication server and debugging server. Upon this environment, the remaining system servers can be implemented quickly. Since ether server and IPC server produce network-transparency to the remaining parts so that the whole system would be operating like stand-alone system, software reusabilility is greatly enhanced, and design and implementation of development tool for SROS is much easier and quicker than that for conventional operating systems. Figure 6 dcscribes the detailed debugging architecture in SROS environments. The key components to support debugging facilities arc a target manager at host system and debugging server at target systems. The target manager is a host-resident module that acts as a target information server for multiple host-based client debugging tools. It receives(de1ivers)requests(resu1ts) from(to) debugging tools on the host and transfers(receives) them to(from) the corresponding debugging server on a target, that is, a single target server can provide services to many host-based debugging tools. The debugging server is a small target-resident module that not only implements basic debugging support facilities but also provides inter-target routing and simple services to support a debug protocol, a peer-level protocol between debugging server and target manager. It is a self-contained, media-independent protocol which allows multiple host-resident tools to communicate with target systems. This approach allows multiple host-based tools to debug one target concurrently by any number of users, or a single host-based tool to debug multiple concurrently cooperating targets simultaneously. In addition, Figure 6 also describes client-server debugging architecture of SROS. In the traditional client-server architecture, clients assume host tools and the server a target-resident monitor. However, we replace this configuration with more advanced one in which both the clients and server reside on host machines. Host-based tools at any host in the system(the client) are not served by a target-resident server(debugging server) directly but rather by a host-resident delegate(target manager). This architecture provides greater flexibility and functionality since all the host machines are on a single network. It also off-loads the target by keeping only the bare minimum function on it and optimizing the traffic hetween host and target. Debugging server, in addition to conventional debugging function, provides facilities to handle features of the distributed real-time system such as interprocess communication and timeout. The facilities can be summarized as roiiows.

2923

The ability to trap on any interprocess communication(1PC). The ability to modify, insert, and delete IPC messages.

References
[I] D.L. Black, et al., Microkernel Operating System
Architecture and Mach, Proceedings o USENZX f Workshop on Micm-kernels and other Kernel Architectures, Seattle Washington, pp. 11-30, April 1992. [2] M. Condict, et al., Microkernel Modularity with Integrated Kernel Performance, Collected Papers, The OSF Research Institute, pp. 1-15, June 1994. [3] J. Fogelin, Minimizing Preemptive Latency in Real-Time Kernels, WZNDAYS, G6, April 1991. [4] Sung-Ik Jun, et al., Concurrent Realtime Operating System for Telecommunication System, EIC, pp. 67-71, Japan, July 1991. [5] Sung-Ik Jun, Design and Implementation of a Fault Tolerant Control Architecture for Large Scale RealTime Systems, APSMT, Bangkok Tailand, 1993. (61 Young-Boo Kim, et al., An Architecture of Scalable ATM Switching System and its Call Processing Capacity: ETRI Joumal, vol. 18, no. 3, pp. 107-124, Oct. 1996. [7] Boo-Geum Jung, et al., Y. Micro-Kernel and CrossDevelopment Methodology for Distributed RealTime Systems, JCCZ95, pp. 646-649, Kyoungju Korea, April 1995. [SI Boo-Geum Jung, et al., Dynamic Code Binding for Scalable Operating System in Distributed Real-Time Systems, IEEE ZCCS/ZSPACS, Singapore, 1996. [9] Boo-Geum Jung, et al., A Scalable Architecture for Distributed Real-Time Operating System, ZEEE ICCSflSPACS, Sydney Australia, 1997. [lo] P.S. Dodd and C.V. Ravishankar, Monitoring and debugging distributed real-time programs, Soffware Pmctice and Experience, vol. 22, no. 10, pp. 863877, Oct. 1992. 1111 Dong-Gill Lee, et al., A New Integrated Software Development Environment Based on SDL, MSC, and CHILL for Large-scale Switching Systems, ETRZJournal, vol. 18, no. 4, pp. 265-286, Jan. 1997. [12] C.E. McDoweIl and D.P. Helmhold, Debugging Concurrent Programs, ACM Computing Surveys, vol. 21, no. 4, pp. 593-622, Dec. 1989. [13] M. Rozier, et al., Overview of the Chorus Distributed Operating System, Proceedings o USENIX f Workhop on Micro-kernels and other Kernel Architectures, Seattle Washington, pp. 39-69, April 1992. [I41 W. Schutz, Fundamental Issues in Testing Distributed Real-Time Systems, Real-lime System, vol. 7, pp. 129-257, 1994.

There are two types of errors we often encounter while debugging a distributed real-time program. The one is related to logical errors and the other is related to timing errors. To help debugging the latter, debugging server monitors and records the system internal timing behavior. This recorded information will be analyzed on off-line by host-resident debugging tools.

5 Conclusion
SROS, a dynamically-scalable distributed real-time operating system, has been implemented in ATM switching system for B-ISDN and operating successfully for a couple of years. Based on the experiences from its predecessor, Concurrent Real-time Operafing System(CROS)[4], SROS was designed and implemented from on scratch. CROS has been operating in TDX-IO, a modem telecommunication switching system, over 12 years. Although CROS had a good real-time performance and high reliability, it also exposed several defects in the field application. First, it was not easy to port CROS even to the other slightly-different version of telecommunication system since CROS was monolithic and each module is tightly coupled with the remaining ones. For example, hardware-dependent codes were scattered over all locations. Second, as some module is inserted or replaced with a new one, it is usually necessary that all parts of CROS should he examined to prevent any harmful side effect. Finally, there was no efficient commercial crossdevelopment environment for distrihuted real-time system available to us. Therefore, when we planned to design a new operating system executing in ATM switching system for BISDN, the defects mentioned above was carefully analyzed and their solutions were devised. In particular, we selected micro-kernel structure for SROS so as to restrict hardware-dependent part into a small size of micro-kernel and to produce high portability and scalability. Also, to guarantee service continuity during system server upgrading, we devised a notion of dynamic collocation, DCB(Dynamic Code Binding), and implemented DCB as SKI(Server-Kernel Interface) specification. Finally, a cross-development environment model suitable to distributed real-time system was contrived and implemented that enables on-the-fly system debugging from the early stage of developmentlike OS development phase and minimizes the interrupt to embedded real-time modules under examination so that debugging session in distributed realtime system reproduces all minute behaviors as same as those during normal execution period.

[I51 H. Garcia-Molina, et al., Debugging a distributed computing system, IEEE Transacfionson Soffware
Engineering, vol. 10, no. 2, pp. 210-219, March 1984. [I61 P.D. Varhol, Small Kernels Hit it Big, BITE, pp. 119-124, Jan. 1994.

You might also like