Professional Documents
Culture Documents
IDENTIFICATION DU DOCUMENT
Fichier Word :NOC process & roles - English.doc
ETAT DU DOCUMENT
Les corrections dsignent des volutions mineures du document (nouvelle release). Les modifications dsignent des volutions importantes du document (nouvelle version).
CONFIDENTIALITE DU DOCUMENT
Document priv (usage interne socit) Document trs sensible (destinataires uniquement)
12/03/2008
Page 1
MATRISE DU DOCUMENT
VERIFICATION DU DOCUMENT
Vrificateur Approbateur Nom Bernard Fanga Date Visas BF
LISTE DE DIFFUSION
Destinataires Coordonnes
VOLUTIONS DU DOCUMENT
Version 1.0 Date 11/10/2010 Opration et commentaire Mise--jour
12/03/2008
Page 2
SOMMAIRE
1. 1.1 1.2 1.3 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3. 3.1 3.2 3.3 3.4 4. 5. 6.
ROLES
4 4 4 4 5 6 6 6 6 6 6 7 7 9 9 10 11 11 13 15 16
Performance monitoring Status monitoring Alert management Policy monitoring Quality insurance Reporting Schedule Documentation
ESCALATION LEVEL BY CARRIER
12/03/2008
Page 3
1. ROLES
AES SONEL NOC is a team that has some important roles inside IT infrastructure system. In general, that team has to: Monitor Manage Troubleshoot
1.1 MONITOR
NOC has to monitor the entire network system. It is concerning Routers, Switches, UPS, Servers and so on, in such a way to manage the WAN, MAN and LAN in all AES-SONEL facilities. In those equipments, there are some resources such as CPU, Memory, NIC,
Storage/Flash, Power supply, level of charge of battery, Temperature, interfaces in networks equipments, Links etc. We have to check information and collect the events because they can help to be proactive.
1.2 MANAGE
Some parameters have to be optimized in those equipments. Some formulas have to be filled and approved by a committee to get the right to change something one those equipments. NOC has also the ability to adapt reports to the need of the management; that should help for taking the right decision.
1.3 TROUBLESHOOT
When an incident happens and the source is not know by anybody, NOC has to check and find the problem, and when its found, they can fix that if they have the authority or transfer it to the right service or person in charge of that task.
12/03/2008
Page 4
2. FUNCTIONS
NOC has a huge part of functionalities around the IT system. Their main functions are to: Level 1: Performance monitoring Status monitoring Policy monitoring Incident management Open, update, and close trouble tickets. Periodical activities Reporting Documentation Other duties as assigned
Level 2 Monitor data communications networks to ensure that networks are available to all system users. Monitor Datacenter infrastructures Resolve and document data communications problems. Develop and follow troubleshooting procedures in an effort to resolve problems. Contact users to correct and maintain network operations. Escalate problems as needed to engineering staff. Records daily network statistics. Open, update, and close trouble tickets. Update documentation to record new equipment installed, new sites, and changes to configurations. Coordinate installation of communications equipment. Install communications equipment. Schedule operations on IT Facilities Quality insurance Other duties as assigned
12/03/2008
Page 5
According to those functions, NOC is working as a back-end team to solve problem that concern a site, a building or the global network. End user problems are taken care by HelpDesk.
2.6 REPORTING
NOC team has to produce weekly and monthly reports about performance and status monitoring of network equipments NOC team has to produce KPI monthly reports
12/03/2008 Page 6
Monthly reports: KPI report of the network Availability of the entire network, each MAN/WAN Operator links
2.7 SCHEDULE
According to NOC team tasks, we have to put in place a NOC schedule that should be based in 02 teams working in round robbing. This is the time: Group 01: from 07H00 to 15H30 with break from 12H00 to 13H00 Group 02: from 10H00 to 18H30 with break from 13H00 to 14H00 During the month, the two groups should work alternatively: o o Group 01: 1st and 3rd weeks Group 02: 2nd and 4th weeks
On Saturday, the group which starts at 07H00 should work from 08H00 to 14H00
2.8 DOCUMENTATION
Each incident has to be documented. Each action made by suppliers or employee has to have a report of intervention.
2.9 PREREQUISITES:
LEVEL 1:
Network Operations Center Knowledge on network operations. Knowledge of layer data communications protocols. Previous experience with tools used in monitoring the Ability to troubleshoot network problems effectively in a network operations environment.
12/03/2008 Page 7
Maintain a broad knowledge of all products, service and NOC procedures. Strong interpersonal, verbal, and written communication skills. Excellent organizational, multitasking, prioritizing, and teamwork skills. Ability to work independently with little supervision. Ability to qualify for security clearance.
LEVEL 2:
Network Operations Center Knowledge on CISCO routers and switches, VSAT network operations. Knowledge of layer data communications protocols. Ability to verify that switches and routers as well as their configured network services and protocols, operate as intended within a given network specification. Previous experience with tools used in monitoring the network including datacenter management tools Ability to troubleshoot network problems effectively in a network operations environment. Maintain a broad knowledge of all products, service and NOC procedures. Strong interpersonal, verbal, and written communication skills. Excellent organizational, multitasking, prioritizing, and teamwork skills. Ability to work independently with little supervision. Ability to qualify for security clearance.
12/03/2008
Page 8
3.1 MTN
HOTLINE number to call for incidents: 7126
Level
1 2 3
Delay
Immediately hour 1 hour NOC MTN NOC Coordinator
Contact
Solution & Service Support Manager Account Manager Senior Operations Manager Senior Manager Corporate Sales Chief Technical Officer
4 5
2 hours 4 hours
Contact
NOC MTN NOC Coordinator Solution and Service Support Manager Account Manager Senior Operations Manager Senior Manager Corporate Sales Chief Technical Officer
Persons
Network Operations Center Armand Pichele Samuel PII Augustin MIAFFO Pierre Paul BISSOMBI Alain MORE
Cellular
79 00 92 13 77 55 04 61 77 55 02 58 77 55 03 51 77 55 10 97
E-mail
noc@mtncameroon.net Pichel_A@mtncameroon.net Pii_s@mtncameroon.net Miaffo_A@mtncameroon.net Bissom_p@mtncameroon.net
77 55 05 13
More_A@mtncameroon.net
Gilbert NGONO
77 55 10 01
Ngono_C@mtncameroon.net
12/03/2008
Page 9
3.2 ORANGE
HOTLINE number to call for incidents : 96 40 04 00
Level
1 2
Delay
Immediately 4 hours NOC ORANGE
Contact
Responsable infrastructure Directeur des Oprations Senior Project Manager Directeur OCMS DGA en Charge technique et administratif
Contact
NOC ORANGE Responsable infrastructure Directeur des Oprations Senior Project Manager Directeur OCMS DGA en Charge technique et administratif
Persons
Network Operations Center Martin BIYICK
Cellular
96 40 04 00
E-mail
supporttechnique.internet@orange.cm
99 94 98 81
martin.biyick@orange.cm
99 94 28 38
serge.nafteur@orange.cm
99 94 12 20
danielparfait.nlend@orange.cm
99 94 01 04
jeanmichel.canto@orange.cm
Alain MARQUIS
99 94 08 08
alain.marquis@orange.cm
12/03/2008
Page 10
3.3 CAMTEL
HOTLINE number to call for incidents: ?
Level
1a 1b 1c 2 3 4 5
Delay
Immediately
Contact
Salle dexploitation SAT3 Service Internet Douala Service Internet Yaound
Responsable technique littoral Division service aprs-vente Gestionnaire du compte AES Responsable commerciale
Persons DONFACK LOWE Bertin AKOUA Anicet KAMA Bienvenu EYOUM KOUADIO Chantal NJI AWA Mathias
Cellular 33410755
bertinlowe2002@yahoo.fr
22 02 04 53
Service Internet Yaound Reponsable technique Littoral Division service aprsvente Gestionnaire du compte AES Responsable commerciale
22 00 73 33
33 02 13 55
22 02 01 12
33 00 30 03
22 00 12 91
12/03/2008
Page 11
12/03/2008
Page 12
Level
1
Delay
Immediately NOC AES SONEL
Contact
hour
Network Team ; Infrastructures team ; Application team Chef division ; Sous-directeur de linfrastructure ; Chef de division Infrastructures et Rseaux
3/4 hour
1 hour
DSI
Contact
NOC
Personnes
Cellulair e
E-mail
sonel.noc@aes.com sylvain.bithe@aes.com intern.tbell@aes.com nicolas.tongo@aes.com
Gabriel OYONO Network Team Felix NGOH Daniel Claude MOFEN Sidonie MBWANG
Camille.kite@aes.com aimeclaude.tampoo@aes.com gabriel.oyono@aes.com felix.ngoh@aes.com
Infrastructures Team
christian.awomo@aes.com
12/03/2008
Page 13
Chef Service ITSC Application Team Leader Network and Infra Team Leader Sous-directeur DSI
Edibe.dooh@aes.com
Christian.nolaze@aes.com
Bernard FANGA
Bernard.fanga@aes.com
jeanlouis.ngamby@aes.com
12/03/2008
Page 14
ESCALATION 1: Immediately - Define the liste LIST of persons to alert: * NOC Provider * NOC AES SONEL * NETWORK AES SONEL
YES
NOC
Send the alert by mail to all destinataires LIST with the time delay of the incident
YES
Send the alert by mail to all destinataires LIST with the time delay of the incident
NO
YES
12/03/2008
Page 15
6. INCIDENT MANAGEMENT
Incident are opened every event occur when we have one link which goes down. Information needs for managing the incident are: Incident number identifier Date and hour the incident occurs Type of incident Description of the incident Multiple interventions to solve the incident (TSP & ISP / AES SONEL) Date and hour the incident is closed Time passed from the incident detection till the resolution Optionally, the copy of the work permit for intervention Reporting of the troubleshooting operations carried out At the end of each week and each month, we can get a report about how many incidents are opened, closed. They are useful for statistics and analysis of reaction when an even occurs Carrier and AES SONEL. They can also be used as penalties for payments. These are reports that can be useful: - Opened incidents, closed incidents, pourcentage of incident solved - Incidents opened and closed by provider, with min, max and average times to solve the incident - Incidents opened and closed by level of escalation
Connectivity problems occur when end stations cannot communicate with other areas of your local area network (LAN) or wide area network (WAN). Using management tools, you can often fix a connectivity problem before users even notice it. Connectivity problems include:
Loss of connectivity - When users cannot access areas of your network, your organization's effectiveness is impaired. Immediately correct any connectivity breaks. Intermittent connectivity - Although users have access to network resources some of the time, they are still facing periods of downtime. Intermittent connectivity problems can indicate that your network is on the verge of a major break. If connectivity is erratic, investigate the problem immediately. Timeout problems - Timeouts cause loss of connectivity, but are often associated with poor network performance.
Your network has performance problems when it is not operating as effectively as it should. For example, response times may be slow, the network may not be as reliable as usual, and users may be complaining that it takes them longer to do their work. Some performance problems are intermittent, such as instances of duplicate addresses. Other problems can indicate a growing strain on your network, such as consistently high utilization rates. If you regularly examine your network for performance problems, you can extend the usefulness of your existing network configuration and plan network enhancements, instead of waiting for a performance problem to adversely affect the users' productivity.
7.1.3 Solving Connectivity and Performance Problems
When you troubleshoot your network, you employ tools and knowledge already at your disposal. With an in-depth understanding of your network, you can use network software tools, such as "Ping", and network devices, such as "NMS", to locate problems, and then make corrections, such as swapping equipment or reconfiguring segments, based on your analysis. So you can:
12/03/2008
Page 17
Baseline the network's normal status to use as a basis for comparison when the network operates abnormally Precisely monitor network events Be notified immediately of critical problems on your network, such as a device losing connectivity Establish alert thresholds to warn you of potential problems that you can correct before they affect your network Resolve problems by disabling ports or reconfiguring devices
8. TROUBLESHOOTING STRATEGY
If you notice changes on your network, ask the following questions:
Is the change expected or unusual? Has this event ever occurred before? Does the change involve a device or network path for which you already have a backup solution in place? Does the change interfere with vital network operations? Does the change affect one or many devices or network paths?
After you have an idea of how the change is affecting your network, you can categorize it as critical or noncritical. Both of these categories need resolution (except for changes that are one-time occurrences); the difference between the categories is the time that you have to fix the problem. By using a strategy for network troubleshooting, it is possible to approach a problem methodically and resolve it with minimal disruption to network users. It is also important to have an accurate and detailed map of your current network environment. Beyond that, a good approach to problem resolution is:
Recognition Symptoms Understanding the Problem Identifying and Testing the Cause of the Problem Solving the Problem
The first step to resolving any problem is to identify and interpret the symptoms. You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message: WAN connection down.
12/03/2008
Page 18
8.1.1.1 User Comments Although you can often solve networking problems before users notice a change in their environment, you invariably get feedback from your users about how the network is running, such as:
They cannot print. They cannot access the application server. It takes them much longer to copy files across the network than it usually does. They cannot log on to a remote server. When they send e-mail to another site, they get a routing error message. Their system freezes whenever they try to Telnet.
8.1.1.2 Network Management Software Alerts Network management software, as described in "Your Network Troubleshooting Toolbox", can alert you to areas of your network that need attention. For example:
The application displays red (Warning) icons. Your weekly Top-N utilization report (which indicates the 10 ports with the highest utilization rates) shows that one port is experiencing much higher utilization levels than normal. You receive an e-mail message from your network management station that the threshold for broadcast and multicast packets has been exceeded.
These signs usually provide additional information about the problem, allowing you to focus on the right area. 8.1.1.3 Analyzing Symptoms When a symptom occurs, ask yourself these types of questions to narrow the location of the problem and to get more data for analysis:
To what degree is the network not acting normally (for example, does it now take one minute to perform a task that normally takes five seconds)? On what subnetwork is the user located? Is the user trying to reach a server, end station, or printer on the same subnetwork or on a different subnetwork? Are many users complaining that the network is operating slowly or that a specific network application is operating slowly? Are many users reporting network logon failures? Are the problems intermittent? For example, some files may print with no problems, while other printing attempts generate error messages, make users lose their connections, and cause systems to freeze.
12/03/2008
Page 19
Networks are designed to move data from a transmitting device to a receiving device. When communication becomes problematic, you must determine why data are not traveling as expected and then find a solution. The two most common causes for data not moving reliably from source to destination are:
The physical connection breaks (that is, a cable is unplugged or broken). A network device is not working properly and cannot send or receive some or all data.
Network management software can easily locate and report a physical connection break (layer 1 problem). It is more difficult to determine why a network device is not working as expected, which is often related to a layer 2 or a layer 3 problem. To determine why a network device is not working properly, look first for:
Valid service - Is the device configured properly for the type of service it is supposed to provide? For example, has Quality of Service (QoS), which is the definition of the transmission parameters, been established? Restricted access - Is an end station supposed to be able to connect with a specific device or is that connection restricted? For example, is a firewall set up that prevents that device from accessing certain network resources? Correct configuration - Is there a misconfiguration of IP address, subnet mask, gateway, or broadcast address? Network problems are commonly caused by misconfiguration of newly connected or configured devices.
After you develop a theory about the cause of the problem, test your theory. The test must conclusively prove or disprove your theory.
12/03/2008
Page 20
If you cannot reproduce a problem, then no problem exists unless it happens again on its own. If the problem is intermittent and you cannot replicate it, you can configure your network management software to catch the event in progress.
Although network management tools can provide a great deal of information about problems and their general location, you may still need to swap equipment or replace components of your network until you locate the exact trouble spot. After you test your theory, either fix the problem as described in "Solving the Problem" or develop another theory.
8.3.1.1 Sample Problem Analysis
This section illustrates the analysis phase of a typical troubleshooting incident. On your network, a user cannot access the mail server. You need to establish two areas of information:
What you know - In this case, the user's workstation cannot communicate with the mail server. What you do not know and need to test Can the workstation communicate with the network at all, or is the problem limited to communication with the server? Test by sending a "Ping" or by connecting to other devices. Is the workstation the only device that is unable to communicate with the server, or do other workstations have the same problem? Test connectivity at other workstations. If other workstations cannot communicate with the server, can they communicate with other network devices? Again, test the connectivity.
1 . Can the workstation communicate with any other device on the subnetwork?
If no, then go to step 2. If yes, determine if only the server is unreachable. If only the server cannot be reached, this suggests a server problem. Confirm by doing step 2. If other devices cannot be reached, this suggests a connectivity problem in the network. Confirm by doing step 3.
12/03/2008 Page 21
If no, then most likely it is a server problem. Go to step 3. If yes, then the problem is that the workstation is not communicating with the subnetwork. (This situation can be caused by workstation issues or a network issue with that specific station.)
If no, then the problem is likely a network problem. If yes, the problem is likely a server problem.
When you determine whether the problem is with the server, subnetwork, or workstation, you can further analyze the problem, as follows:
For a problem with the server - Examine whether the server is running, if it is properly connected to the network, and if it is configured appropriately. For a problem with the subnetwork - Examine any device on the path between the users and the server. For a problem with the workstation - Examine whether the workstation can access other network resources and if it is configured to communicate with that particular server.
A laptop computer that is loaded with a terminal emulator, TCP/IP stack, TFTP server, CD-ROM drive (to read the online documentation), and some key network management applications. With the laptop computer, you can plug into any subnetwork to gather and analyze data about the segment. A spare managed hub to swap for any hub that does not have management. Swapping in a managed hub allows you to quickly spot which port is generating the errors. A single port probe to insert in the network if you are having a problem where you do not have management capability. Console cables for each type of connector, labeled and stored in a secure place.
Many device or network problems are straightforward to resolve, but others yield misleading symptoms. If one solution does not work, continue with another. A solution often involves:
12/03/2008
Page 22
Upgrading software or hardware (for example, upgrading to a new version of agent software or installing Gigabit Ethernet devices) Balancing your network load by analyzing: What users communicate with which servers What the user traffic levels are in different segments
Based on these findings, you can decide how to redistribute network traffic.
Adding segments to your LAN (for example, adding a new switch where utilization is continually high) Replacing faulty equipment (for example, replacing a module that has port problems or replacing a network card that has a faulty jabber protection mechanism)
Spare hardware equipment (such as modules and power supplies), especially for your critical devices A recent backup of your device configurations to reload if flash memory gets corrupted (which can sometimes happen due to a power outage)
8. DNS servers DAILY CHECKING All servers shall be checked manually on a daily basis the following items shall be checked and recorded: 1. The amount of free space on each drive shall be recorded in a server log. 2. Services shall be checked to determine whether any services have failed. 3. The status of backup of files or system information for the server shall be checked daily. EXTERNAL CHECKS Essential servers shall be checked using either a separate computer from the ones being monitored or a server monitoring service. The external monitoring service shall have the ability to notify multiple IP personnel when a service is found to have failed. Servers to be monitored externally include: 1. 2. 3. 4. 5. The mail server The web server External DNS servers Externally used application servers. Database or file servers supporting externally used application servers or web servers.
12/03/2008
Page 24
CATEGORIES
Liens Rseaux
ACTIONS OBSERVATIONS Noter les statistiques via les dashboards des NMS Prendre connaissance des resultats du derniers Sites testing Verifier et mettre jour la liste des Incidents en cours de resolution Etablir le rapport de disponibilit via Nagios pour MTN, OCMS, Camtel, AES Communiquer les informations aux quipes connexes Verifier l'tat des UPS via Nagios ou Centreon Verifier l'tat des groupes lectrognes Verifier l'tat des Servers Verifier l'tat des systmes de refroidissement Vrifier l'intgrit des quipement Rseaux (Router, Switch, Firewall, ) SAGE 1000 BSA ORACLE DB xSQL DB CITRIX INTERNET WEB ACCESS Outlook Web Access AES SONEL CONTACT All others Internal Portal DHCP DNS Page 25
STATUTS
Infrastructures
Services
12/03/2008
12/03/2008
Page 26