You are on page 1of 8

ACCEPTED FROM OPEN CALL

Machine Learning for Networking: Workflow, Advances and Opportunities


Mowei Wang, Yong Cui, Xin Wang, Shihan Xiao, and Junchen Jiang

Abstract tion or regression tasks from labeled data, while


USL algorithms focus on classifying the sample
Recently, machine learning has been used sets into different groups (i.e., clusters) with unla-
in every possible field to leverage its amazing beled data. In RL algorithms, agents learn to find
power. For a long time, the networking and dis- the best action series to maximize the cumulat-
tributed computing system is the key infrastruc- ed reward (i.e., objective function) by interacting
ture to provide efficient computational resources with the environment. The latest breakthroughs,
for machine learning. Networking itself can also including deep learning (DL), transfer learning
benefit from this promising technology. This arti- and generative adversarial networks (GAN), also
cle focuses on the application of MLN, which provide potential research and application direc-
can not only help solve the intractable old net- tions in an unimaginable fashion.
work questions but also stimulate new network Dealing with complex problems is one of the
applications. In this article, we summarize the most important advantages of machine learning.
basic workflow to explain how to apply machine For some tasks requiring classification, regression
learning technology in the networking domain. and decision making, machine learning may per-
Then we provide a selective survey of the lat- form close to or even better than human beings.
est representative advances with explanations Some examples are facial recognition and game
of their design principles and benefits. These artificial intelligence. Since the network field
advances are divided into several network design often sees complex problems that demand effi-
objectives and the detailed information of how cient solutions, it is promising to bring machine
they perform in each step of MLN workflow learning algorithms into the network domain to
is presented. Finally, we shed light on the new leverage the powerful ML abilities for higher net-
opportunities in networking design and commu- work performance. The incorporation of machine
nity building of this new inter-discipline. Our goal learning into network design and management
is to provide a broad research guideline on net- also provides the possibility of generating new
working with machine learning to help motivate network applications. Actually, ML techniques
researchers to develop innovative algorithms, have been used in the network field for a long
standards and frameworks. time. However, existing studies are limited to the
use of traditional ML attributes, such as predic-
Introduction tion and classification. The recent development
With the prosperous development of the Internet, of infrastructures (e.g., computational devices
networking research has attracted a lot of atten- like GPU and TPU, ML libraries like Tensorflow
tion in the past several decades both in academia and Scikit-Learn) and distributed data processing
and industry. Researchers and network operators frameworks (e.g., Hadoop and Spark) provides a
can face various types of networks (e.g., wired or good opportunity to unleash the magic power of
wireless) and applications (e.g., network securi- machine learning for pursuing the new potential
ty and live streaming [1]). Each network applica- in network systems.
tion also has its own features and performance Specifically, machine learning for networking
requirements, which may change dynamically (MLN) is suitable and efficient for the following
with time and space. Because of the diversity and reasons. First, as the best known capabilities of
complexity of networks, specific algorithms are ML, classification and prediction play basic but
often built for different network scenarios based important roles in network problems such as intru-
on the network characteristics and user demands. sion detection and performance prediction [1]. In
Developing efficient algorithms and systems to addition, machine learning can also help decision
deal with complex problems in different network making, which will facilitate network scheduling
scenarios is a challenging task. [2] and parameter adaptation [3, 4], according
Mowei Wang, Yong Cui, and Recently, machine learning (ML) techniques to the current states of the environment. Sec-
Shihan Xiao are with Tsinghua have made breakthroughs in a variety of applica- ond, many network problems need to interact
University. Yong Cui is the tion areas, such as bioinformatics, speech recogni- with complicated system environments. It is not
corresponding author. tion and computer vision. Machine learning tries easy to build accurate or analytic models to rep-
Xin Wang is with Stony Brook to construct algorithms and models that can learn resent complex system behaviors such as load
University. to make decisions directly from data without fol- changing patterns of CDN [5] and throughput
Junchen Jiang is with lowing pre-defined rules. Existing machine learn- characteristics [1]. Machine learning can provide
Carnegie Mellon University. ing algorithms generally fall into three categories: an estimated model of these systems with accept-
supervised learning (SL), unsupervised learning able accuracy. Finally, each network scenario may
Digital Object Identifier: (USL) and reinforcement learning (RL). More spe- have different characteristic (e.g., traffic patterns
10.1109/MNET.2017.1700200 cifically, SL algorithms learn to conduct classifica- and network states) and researchers often need to

1 0890-8044/17/$25.00 © 2017 IEEE IEEE Network • Accepted for publication


Step 1 Step 2 Step 3 Step 4
Problem formulation Data collection Data analysis Model construction
(prediction, regression, (e.g., traffic traces, (preprocessing, feature (offline training and
clustering,decision making) performance logs, etc.) extraction) tuning)

Step 6 No Step 5
Deployment and Model validation
inference (cross validation,
(tradeoff on speed, Meet error analysis)
Yes
memory, stability and requirements?
accuracy of inference)

FIGURE 1. The typical workflow of machine learning for networking.

solve the problem for each scenario independent-


ly. Machine learning may provide new possibilities These stages are not independent but have inner relationships. This workflow is very similar to the
to construct the generalized model via a uniform
training method [3, 4]. Among efforts in MLN, traditional workflow for machine learning, as network problems are still applications that machine
deep learning has also been investigated and learning can play a role in.
applied to provide end-to-end solutions. The latest
work in [6] conducts a comprehensive survey on
previous efforts that apply deep learning technol- rience (QoE) for live streaming into a real-time
ogy in network related areas. exploration-exploitation process rather than as a
In this article, we investigate how machine prediction-based problem [7] to well match the
learning technology can benefit network design application characteristics.
and optimization. Specifically, we summarize the Data Collection: The goal of this step is to
typical workflows and requirements for apply- collect a large amount of representative network
ing machine learning techniques in the network data without bias. The network data (e.g., traffic
domain, which could provide a basic but practi- traces and session logs with performance met-
cal guideline for researchers to have a quick start rics) are recorded from different network layers
in the area of MLN. Then we provide a selective according to the application needs. For example,
survey of the important networking advances with the traffic classification problem often requires
the support of machine learning technology, most datasets containing packet-level traces labeled
of which have been published in the last three with corresponding application classes [8]. In the
years. We group these advances into several context of MLN, data are often collected in two
typical networking fields and explain how these phases. In the offline phase, collecting enough
prior efforts perform at each step of the MLN high-quality historical data is important for data
workflow. Then we discuss the opportunities of analysis and model training. In the online phase,
this emerging inter-discipline area. We hope our real-time network state and performance informa-
studies can serve as a guide for potential future tion are often used as inputs or feedback signals
research directions. for the learning model. The newly collected data
can also be stored to update the historical data
Basic Workflow for MLN pool for model adaption.
Figure 1 shows the baseline workflow for apply- Data Analysis: Every network problem has its
ing machine learning in the network field, own characteristics and is impacted by many fac-
including problem formulation, data collec- tors, but only several factors (i.e., feature) have the
tion, data analysis, model construction, model most effect on the target network performance
validation, deployment and inference. These metric. For instance, RTT and the inter-arrival time
stages are not independent but have inner of ACK may be the critical features in choosing
relationships. This workflow is very similar to the best size of the TCP congestion window [3].
the traditional workflow for machine learning, In the learning paradigm, finding proper features
as network problems are still applications that is the key to fully unleashing the potential of data.
machine learning can play a role in. In this sec- This step attempts to extract the effective features
tion, we explain each step of the MLN work- of a network problem by analyzing the historical
flow with representative cases. data samples, which can be regarded as a feature
Problem Formulation: Since the training pro- engineering process in the machine learning com-
cess of machine learning is often time consuming munity. Before feature extraction, it is important
and involves high cost, it is important to correctly to preprocess and clean raw data, through pro-
abstract and formulate the problem at the first cesses such as normalization, discretization, and
step of MLN. A target problem can be classified missing value completion. Extracting features from
into one of the machine learning categories, such cleaned data often needs domain-specific knowl-
as classification, clustering and decision making. edge and insights of the target network problem
This helps decide what kind of and the amount of [5], which is not only difficult but time-consuming.
data to collect and the learning model to select. Thus in some cases deep learning can be a good
An improper problem abstraction may provide choice to help automate feature extractions [2, 6].
an unsuitable learning model, which can result in Model Construction: Model construction
unsatisfactory learning performance. For exam- involves model selection, training and tuning. A
ple, it is better to cast the optimal quality of expe- suitable learning model or algorithm needs to be

IEEE Network • Accepted for publication 2


selected according to the size of the dataset, typi- ly used to test the overall accuracy of the model
cal characteristics of a network scenario, the prob- in order to show if the model is overfitting or
lem category, and so on. For example, accurate under-fitting. This provides good guidance on how
throughput prediction can improve the bitrate to optimize the model, e.g., increasing the data vol-
adaption of Internet video, and a Hidden-Mar- ume and reducing model complexity when there
kov model may be selected for prediction due exists overfitting. Analyzing wrong samples helps
to the dynamic patterns of stateful throughput find the reasons for errors to determine whether
[1]. Then the historical data will be used to train the model and the features are proper or the data
a model with hyper-parameter tuning, which will are representative enough for a problem [5, 8].
take a long period of time in the offline phase. The procedures in the previous steps may need to
The parameter tuning process still lacks enough be re-taken based on the error sources.
theoretical guidance, and often involves a search Deployment and Inference: When implement-
in a large space to find acceptable parameters or ing the learning model in an operational network
to tune by personal experiences. environment, some practical issues should be
Model Validation: Offline validation is an considered. Since there are often limitations on
indispensable step in the MLN workflow to eval- computation or energy resources and require-
uate whether the learning algorithm works well ments on the response time, the tradeoff between
enough. During this step, cross validation is usual- accuracy and the overhead is important for the

Networking
Steps of MLN workflow
application

Data collection
Problem Deployment and online
Objectives Specific works Data analysis Offline model construction
formulation Online inference
Offline collection
measurement
Combine data of platforms
Infor- with a few powerful VPs in Take users’ Construct RuleFit model to Optimize measurement budget
Sibyl [11]: route SL: prediction with
mation homogeneous deployment and queries as input / assign confidence to each in each round to get the best
measurement RuleFit
cognition with many limited VPs around round by round predicted path query coverage
the world

Training HMM model with


Ref [9]: traffic SL: prediction with The flow count and the traffic Take flow statistics as input and
Traffic Synthetic and real traffic traces Only observe the Kernel Bayes Rule and Recurrent
volume Hidden-Markov volume have significant obtain the output of the traffic
prediction with flow statistics flow statistics Neural Network with Long Short
prediction Model (HMM) correlation volume
Term Memory unit

Traffic SL and USL: Flow statistical Zero-day-application exists


RTC [8]: traffic Labeled and unlabeled traffic Find the Zero-day-application Inference with the trained model
classifica- clustering and features extracted and may degrade the
classification traces class and training the classifier to output the classification results
tion classification from traffic flows classification accuracy

Resource RL: decision Synthetic workload with The real time Action space is too large and
DeepRM [13]: Offline training to update the Directly schedule the arrival jobs
manage- making with different patterns is used for resource demand may has conflicts between
job scheduling policy network with the trained model
ment deep RL training of the arrival job actions

It is difficult to characterize
SL: decision Take the Layer-Wise training Record and collect the traffic
Traffic patterns labeling with Online traffic the input and output patterns
Ref [2]: routing making with Deep to initialize and the patterns in each router
routing paths computed by patterns in each to reflect the dynamic nature
strategy Belief Architectures backpropagation process to periodically and obtain the next
OSPF protocol router of large-scale heterogeneous
(DBA) fine-tune the DBA structure routing nodes from the DBAs
networks
RL: decision
Pytheas [7]: Session quality Application sessions sharing Backend cluster determines the Frontend performs the group-
making with a Session quality information with
general QoE information in the same features can be session groups using CFA [5] based exploration-exploitation
variant of UCB features in large time scale
Network optimization small time scale grouped with a long time scale strategy in real time
algorithm
adaption
Given network assumption the Directly implement the
Remy [3]: TCP RL: decision Calculate network
Collect experience from Select the most influential generated algorithm interact with Remy-generated algorithm to
congestion making with a state variables
network simulator metrics as state variables simulator to learn best actions corresponding network
control tabular method with ACK
according to states environment
Calculate the TCP assumptions are often Take trials with different sending
PCC [4]: TCP RL: decision
utility function violated. The direct rates and find the best rate
congestion making with online / /
according the performance is a better according to the feedback utility
control learning
received SACK signal function
Take session
CFA [5]: USL: clustering Datasets consisting of quality Similar sessions are with Critical feature learning in
features as input, Look up feature-quality table to
video QoE with self-designed measurements are collected similar quality determined by minutes scale and quality
such as Bitrate, respond to real-time query
Perfor- optimization algorithm from public CDNs critical features estimation in tens of seconds
CDN, Player, etc.
mance
prediction CS2P [1]: A new session is mapped to the
Take users’ s Sessions with similar features Find set of critical feature and
SL: prediction with Datasets of HTTP throughput most similar session cluster and
throughput session features tend to behave in related learn a HMM for each cluster of
HMM measurement from iQIYI corresponding HMM are used to
prediction as input pattern similar sessions
predict throughput

Config- cherryPick SL: parameter Take trials with different


Take performance under Large configuration space
uration [15]: cloud searching with configurations and decide the
current configuration as model / and heterogeneous /
extrapola- configurations Bayesian next trial direction by Bayesian
input applications
tion extrapolation optimization Optimization model

TABLE 1. Relationships between latest advances and MLN workflow.

3 IEEE Network • Accepted for publication


performance of the practical network system [7].
In addition, machine learning often works in a Traffic prediction and classification are two of the earliest machine learning applications in the net-
best-effort way and does not provide any perfor- working field. Because of the well formulated question descriptions and demands from various
mance guarantee, which requires system design-
ers to consider fault tolerance. Finally, practical subfields of networking, studies of the two topics always maintain a certain degree of popularity.
applications often require the learning system to
take real-time input, and obtain the inference and
output the corresponding policy online. Traffic Prediction and Classification
Traffic prediction and classification are two of
Overview of Recent Advances the earliest machine learning applications in the
Recent breakthroughs of deep learning and other networking field. Because of the well formulated
promising machine learning techniques have a question descriptions and demands from various
non-ignorable influence on new attempts of the subfields of networking, studies of the two topics
network community. Existing efforts have led to always maintain a certain degree of popularity.
several considerable advances in different sub- Traffic Prediction: As an important research
fields of networking. To illustrate the relationship problem, the accurate estimation of traffic volume
between these up-to-date advances and the MLN (e.g., the traffic matrix) is beneficial to congestion
workflow, in Table 1 we divide literature studies control, resource allocation, network routing, and
into several application scenarios and show how even high-level live streaming applications. There
they perform at each step of the MLN workflow. are mainly two directions of research, time series
Without ML techniques, the typical solutions analysis and network tomography, which can be
for these advances are involved with time-se- simply distinguished depending on if it conducts
ries analytics [1, 9], statistical methods [1, 5, 7, traffic prediction with direct observations or not.
8] and rule-based heuristic algorithms [2–5, 10], However, it is expensive to directly measure traf-
which are often more interpretable and easier to fic volume, especially in a large-scale high speed
implement. However, ML-based methods have a network environment.
stronger ability to provide a fine-grained strategy Many existing studies focus on reducing the
and can achieve higher prediction accuracy by measurement cost by using indirect metrics rather
extracting hidden information from historical data. than only trying different ML algorithms. There
As a big challenge of ML-based solutions, the fea- are two methods to handle this problem. One
sibility problem is also discussed in this section. is to take more human effort to develop sophis-
ticated algorithms by exploring domain-specific
Information Cognition knowledge and undiscovered data patterns. As
Since data are the fundamental resource for MLN, an example, the work in [9] attempts to predict
information (data) cognition with high efficiency is traffic volume according to the dependence
critical to capture the network characteristics and between flow counts and flow volume. Another
monitor network performance. However, due to method is inspired by the end-to-end deep learn-
the complex nature of existing networks and the ing approach. It takes some easily obtained infor-
limitations of measurement tools and architec- mation (e.g., bits of a header in the first few flow
tures, it is still not easy to access some types of packets) as direct input and extract features auto-
data (e.g., trace route and traffic matrix) within matically with the help of the learning model [10].
acceptable granularity and cost. With its capa- Traffic Classification: As a fundamental
bility for prediction, machine learning can help function component in network management
evaluate network reliability or the probability of a and security systems, traffic classification match-
certain network state. As the first example, Inter- es network applications and protocols with the
net route measurements help monitor network corresponding traffic flows. The traditional traf-
running states and troubleshoot performance fic classification methods include the port-based
problems. However, due to insufficient usable approach and the payload-based approach. The
vantage points (VP) and a limited probing bud- port-based approach has been proved to be
get, it is impossible to execute each route query ineffective due to unfixed or reused port assign-
because the query may not match any previously ments, while the payload-based approach suffers
measured path or the path may have changed. from privacy problems caused by deep packet
Sibyl [11] attempts to predict the unseen paths inspection, which can even fail in the presence
and assign confidence to them by using a super- of encrypted traffic. As a result, machine learn-
vised machine learning technique called RuleFit. ing approaches based on statistical features have
The learning relies on data acquisition, and been extensively studied in recent years, espe-
MLN also requires a new scheme of data cog- cially in the network security domain. However,
nition. In MLN, it often needs to maintain an it is not easy to consider machine learning as an
up-to-date global network state and perform real- omnipotent solution and deploy it into a real-
time responses to client demands, which needs world operational environment. For instance,
to measure and collect the information in the unlike the traditional machine learning application
core network. In order to enable the network to to identify if a figure is a cat or not, it will create a
perform diagnostics and make decisions by itself big cost with a misclassification in the context of
with the help of machine learning or cognitive network security. Generally, these studies range
algorithms, a different network architecture, the from all-known classification scenarios to a more
Knowledge Plane [12], was presented that can realistic situation with unknown traffic (e.g., zero-
achieve automatic information cognition, which day application traffic [8]). This research roadmap
has inspired the following efforts that leverage is very similar to the machine learning technology
ML or data-driven methods to enhance network that evolves from supervised learning to unsuper-
performance. vised and semi-supervised learning, which can be

IEEE Network • Accepted for publication 4


Prior knowledge Remy
Reward:
objective function
Traffic Web traffic, video conferencing
model batch processing, mixture

Range of: the bottleneck link speeds, Agent:


Network
non-queueing delays, queue sizes, Environment: State: tabular
assumptions Feedback signals:
degrees of multiplexing NS-2 network method
simulator ACK, RTT variables with greedy
search
RemyCC
f
State-action Network state: —i > cwnd parameter,
mapping given: traffic model & network assumptions Action:
cwnd parameter

FIGURE 2. Remy’s mechanism illustration [3].

treated as a pioneer paradigm to import machine Several attempts have been made to optimize
learning into networking fields. the TCP congestion control algorithm using the
reinforcement learning approach due to the dif-
Resource Management and Network Adaption ficulty of designing a congestion control algo-
Efficient resource management and network adap- rithm that can fit all network states. To make
tion are the keys to improving network system the algorithm self-adaptive, Remy [3] takes the
performance. Some example issues to address are target network assumptions and traffic model as
traffic scheduling, routing [2], and TCP congestion prior knowledge to automatically generate the
control [3, 4]. All these issues can be formulated specific algorithm, which achieves an amazing
as a decision-making problem [13]. However, it is performance gain in many circumstances. In the
challenging to solve these problems with a rule- offline phase, Remy tries to learn a mapping, i.e.,
based heuristic algorithm due to the complexity of RemyCC, between the network state and the cor-
diverse system environments, noisy inputs and diffi- responding parameters of the congestion window
culty in optimizing the tail performance [13]. Spe- (cwnd) by interacting with the network simulator.
cifically, arbitrary parameter assignments based on In the online phase, whenever an ACK is received,
experiences and action taken following predeter- RemyCC looks up its mapping table and changes
mined rules often result in a scheduling algorithm its cwnd behavior according to the current net-
that is understood by people but far from optimal. work state. The mechanism of Remy is illustrated
Deep learning is a promising solution due to in Fig. 2. Without the specific network assump-
its ability to characterize the inherent relation- tions, a performance-oriented attempt, PCC
ships between the inputs and outputs of network [4], can benefit from its online-learning nature.
systems without human involvement. In order Although these TCP-related efforts still focus on
to meet the requirements of changing network decision making, they take the first important step
environments, previous efforts in [2, 14] design toward automated protocol design.
a traffic control system with the support of deep
learning techniques. Reconsidering backbone Network Performance
router architectures and strategies, it takes the Prediction and Configuration Extrapolation
traffic pattern in each router as input and outputs Performance prediction can guide decision mak-
the next nodes in the routing path with Deep ing. Some example applications are video QoE
Belief Architectures. These advancements unleash prediction, CDN location selection, best wireless
the potential of the DL-based strategy in network channel selection, and performance extrapolation
routing and scheduling. Harnessing the powerful under different configurations. Machine learning
representational ability of deep neural networks, is a natural approach to predict system states for
deep reinforcement learning achieves great better decision making.
results in many AI problems. Typically, there are two general prediction sce-
DeepRM [13] is the first work that applies a narios. First, the system owner has the ability to
deep RL algorithm for cluster resource scheduling. get various and enough historical data, but it is
Its performance is comparable to state-of-the-art non-trivial to build a complex prediction model
heuristic algorithms but with less cost. The QoE and update it in real time, which requires a new
optimization problem can also benefit from the approach exploiting domain-specific knowledge
RL learning methodology. Unlike previous efforts, to simplify the problem (e.g., CFA [5] for video
Pytheas [7] regards this problem as an explora- QoE optimization). In prior work, CS2P [1] wants
tion-exploitation-based problem rather than a to improve video bitrate selection with accu-
prediction-based problem. As a result, Pytheas rate prediction. It finds that sessions with similar
outperforms state-of-the-art prediction-based sys- key features may have more related throughput
tems by lessening the prediction bias and delayed behavior from data analysis. CS2P learns to clus-
response. From this perspective, machine learning ter similar sessions offline and trains different Hid-
may help achieve the close-loop of “sensing-anal- den-Markov Models for each cluster to predict
ysis-decision,” especially in wireless sensor net- the corresponding throughput given the current
works, where the three actions are separated session information. CS2P reinforces the correla-
from each other at present. tion of similar sessions in the training process,

5 IEEE Network • Accepted for publication


Networking application Computation speed

Objectives Specific works Offline time cost Online time cost Device information

Training 100,000 samples with


When <400 routers: /
1000 routers:
Ref [2]: routing strategy
~100,000 s >100 ms Intel i7-6900 K

~1,000 s <1 ms The Nvidia Titan X Pascal


Network adaption
Session-grouping: find 200
Pytheas [7]: general QoE 2.4 GHz, 8 cores and 64 GB
groups per minute with 8.5 Not mentioned
optimization RAM
million sessions

Remy [3]: TCP congestion Amazaon EC2 and 80-core and


A few hours Not mentioned
control 48-core server

Quality estimation: ~30.7 s


every 1–5 min
Critical feature learning:
CFA [5]: video QoE optimization Two clusters of 32 cores
~30.1 min every 30–60 min
Query response: – 0.66 ms
every 1 ms
Performance prediction
Server side: ~150 predictions Intel i7-2.2 GHz, 16 GB RAM,
per second Mac OS X 10.11
CS2P [1]: throughput prediction Not mentioned
Client side: <10 ms per Intel i7-2.8 GHz, 8 GB RAM,
prediction Mac OS X 10.9
TABLE 2. Processing time of selective advances.

which outperforms approaches with one single lems. Other reasons that prevent the application
model. This is very similar to the above mentioned of ML techniques include the lack of labeled data,
traffic prediction problem, since they both pas- high system dynamics and high cost brought by
sively fit the runtime ground-truth with a certain learning errors.
metric. As another prediction scenario, little his-
torical data exist and it is infeasible to obtain rep- Opportunities for MLN
resentative data by conducting performance tests The prior efforts mostly focus on the generalized
due to high trial costs in real network systems. To concepts of prediction and classification and
deal with this dilemma, cherrypick [15] leverages few can get out of this scope to explore other
the Bayesian Optimization algorithm to minimize possible applications. However, with the latest
pre-run rounds with a directional guidance to breakthroughs in machine learning and its infra-
collect representative runtime data of workloads structures, new potential demands may appear in
under different configurations. network disciplines. Some opportunities are intro-
duced as follows.
Feasibility Discussion
One big challenge faced by ML-based methods Open Datasets for the Networking Community
is their feasibility. Since many networking applica- Collecting a large amount of high quality data that
tions are delay-sensitive, it is non-trivial to design contain both network profiles and performance
a real-time system with heavy computation loads. metrics is one of the most critical issues for MLN.
To make it practical, a common solution is to train However, acquiring enough labeled data is still
the model with global information for a long peri- expensive and labor intensive even in today’s
od of time and incrementally update the model machine learning community. For many reasons,
with local information in a small time scale [5, it is not easy for researchers to acquire enough
7], which trades off between the computation real trace data even if there are many existing
overhead and information staleness. In the online open datasets in the networking domain.
phase, the common case is to look up the result This reality drives us to learn from the machine
table or draw the inference with a trained model learning community to put much more effort into
to make real-time decisions. The processing time constructing open datasets like ImageNet. With
in the above advances are selectively listed in unified open datasets, performance benchmarks
Table 2, which shows that ML has practical values are an inevitable outcome to provide a standard
with the system well-designed. In addition, the platform for researchers to compare their new
robustness and generalization of a design are also algorithms or architectures with state-of-the-
important for feasibility and are discussed later. art ones. This can reduce the unrepresentative
From these perspectives, ML in its current state repeated experiments and have a positive effect
is not suitable for all networking problems. The on academic loyalty. In addition, it has been
network problems solved with ML techniques so proved in the machine learning domain that learn-
far are more or less related to prediction, classi- ing with a simulator rather than in a real environ-
fication and decision-making, while it is difficult ment is more effective and with lower cost in
to apply machine learning to other types of prob- RL scenarios [3]. In the networking domain, due

IEEE Network • Accepted for publication 6


provides a new opportunity to support flexible
The current network components are likely to be added based on people’s understanding at a time objective function and cross-layer optimization.
instant rather than a paragon of engineering. There is still enough room for us to improve network It is very convenient to change the optimization
goal just by changing the reward function in the
performance and efficiency by redesigning the network protocol and architecture. learning model, which is impossible with a tradi-
tional heuristic algorithm. Also, the system may
perceive high-level application behaviors or QoE
to the limited accessibility and high test cost of metrics as a reward, which may enable adap-
large-scale network systems, simulators with suf- tive cross-layer optimization without the network
ficient fidelity, scalability and high running speed model. In practice, it is nontrivial to design an
are also required. These items contribute to both effective reward function. The simplest reward
MLN and further development of the networking design principle is to set the direct goal that needs
domain, and public resources also make it possi- to be maximized as the reward. However, it is
ble for the community to conduct research. often difficult to capture the exact optimization
objective, and as a result we end up with an
Automated Network Protocol and Architecture Design imperfect but easily obtained metric instead. In
With a deeper understanding of the network, most cases it works well, but sometimes it leads
researchers gradually find that the existing net- to faulty reward functions that may result in unde-
work has many limitations. The network system is sired or even dangerous behavior.
totally created by human beings. The current net-
work components are likely to be added based Improving the Comprehension of Network Systems
on people’s understanding at a time instant rath- Network behavior is quite complex due to the
er than a paragon of engineering. There is still end-to-end network design principle, which gen-
enough room for us to improve network perfor- erates various protocols that have simple actions
mance and efficiency by redesigning the network in the end system but causes nontrivial in-network
protocol and architecture. behavior. From this perspective, it is not easy to
It is still quite difficult to design a protocol or figure out what factors can directly affect a certain
architecture automatically today. However, the network metric and can be simplified during an
machine learning community has made some of algorithm design process even in a mature net-
the simplest attempts in this direction and has work research domain like TCP congestion con-
achieved some amazing results, such as letting trol. However, with the help of machine learning
agents communicate with others to finish a task methods, people can analyze the output of learn-
cooperatively. Other new achievements, e.g., ing algorithms through a posterior approach to
GAN, have also shown that the machine learning find useful insights for us to understand how the
model has the ability to generate elements exist- network behaves and how to design a high per-
ing in the real world and create strategies people formance algorithm.
do not discover (e.g., AlphaGo). However, these For a detailed explanation, DeepRM [13],
generated results are still far from the possibility of a resource management framework, is a good
protocol design. There is great potential and the example. To understand why DeepRM performs
possibility to create new feasible network com- better, the authors find that DeepRM is not
ponents without human involvement, which may work-conserving but decides to reserve room
refresh human’s understanding of network sys- for those yet-to-arrive small jobs, which eventu-
tems and propose some currently unacceptable ally contributes to reducing job waiting time. For
destructive-reconstruction frameworks. other evidence, refer to CFA [5] and Remy [3]
and their following works, which provide insights
Automated Network Resource Scheduling and for key influence factors in video QoE optimiza-
Decision Making tion and TCP congestion control, respectively.
It is hard to conduct online scheduling with a
principle-based heuristic algorithm due to the Promoting the Development of Machine Learning
uncertainty and dynamics of network conditions. When applying machine learning into networking
In the machine learning community, it has been fields, due to specific requirements of network
proved that reinforcement learning has strong systems and practical implementation problems,
capability to deal with decision making problems. some inherent limitations and other emerging
The recent breakthrough of Go also proves that problems of machine learning can be pushed for-
ML can make not only coarse but precise deci- ward to a new understanding stage with the joint
sion, which is beyond people’s common sense. efforts of two research communities.
Although it is not easy to directly apply an explo- Typically, there are several problems that are
ration-exploitation strategy in highly-varying net- expected to be resolved. First, the robustness
work environments, reinforcement learning can of machine learning algorithms is a key chal-
be a candidate to replace adaptive algorithms lenge for applications (e.g., self-driving cars and
of the present network system. Related efforts network operation) in real-world environments
can refer to [3, 4, 7, 13]. In addition, reinforce- where learning errors could lead to high costs.
ment learning is highly suitable for problems The networking situation often requires hard con-
where several undetermined parameters need to straints on the algorithm output and the worst
be assigned adaptively according to the network performance guarantee. Second, a model with
state. However, these methods introduce new high generalization ability that can adapt in the
complexity and uncertainty into the network sys- high-variance and dynamic traffic circumstances
tem itself while the stability, reliability and repeat- is needed, since it is unacceptable to retrain the
ability are always the goals of network design. model every time the characteristics of network
Moreover, network scheduling with RL also traffic change. Although some of the experiments

7 IEEE Network • Accepted for publication


show that the model trained under a specific net-
work environment can, to some degree, achieve Due to the heterogeneity of networking systems, it is imperative to embrace machine learning
good performance in other environments [3], it techniques in the networking domain for potential breakthroughs. However, it is not easy for
is still not easy because most machine learning
algorithms assume that the data follow the same networking researchers to take it into practice due to the lack of machine learning related
distribution, which is not practical in networking experiences and insufficient directions.
environments. In addition, the accountability and
interpretability [3] of machine learning algorithms
[10] P. Poupart et al., “Online Flow Size Prediction for Improved
create big obstacles in practical implementations, Network Routing,” Proc. IEEE 24th Int’l. Conf. Network Pro-
since many learning models, especially for deep tocols (ICNP), 2016, pp. 1–6.
learning, are still black box. People do not know [11] I. Cunha et al., “Sibyl: A Practical Internet Route Oracle,”
why and how it behaves, hence people cannot Proc. NSDI 2016, pp. 325–44.
[12] D. D. Clark et al., “A Knowledge Plane for the Internet,”
interfere with the policy. Proc. SIGCOMM 2003, ACM, pp. 3–10.
[13] H. Mao et al., “Resource Management with Deep Rein-
Conclusions forcement Learning,” Proc. HotNets 2016, pp. 50–56.
Due to the heterogeneity of networking systems, [14] N. Kato et al., “The Deep Learning Vision for Heteroge-
neous Network Traffic Control: Proposal, Challenges, and
it is imperative to embrace machine learning tech- Future Perspective,” IEEE Wireless Commun., 2016.
niques in the networking domain for potential [15] O. Alipourfard et al., “Cherrypick: Adaptively Unearthing
breakthroughs. However, it is not easy for net- the Best Cloud Configurations for Big Data Analytics,” Proc.
working researchers to take it into practice due to NSDI 2017, pp. 469–82.
the lack of machine learning related experiences Biographies
and insufficient directions. In this article, we pres- M owei W ang received the B.Eng. degree in communication
ent a basic workflow to provide researchers with a engineering from Beijing University of Posts and Telecommuni-
cations, Beijing, China, in 2017. He is currently working toward
practical guideline to explore new machine learn- his Ph.D. degree in the Department of Computer Science and
ing paradigms for future networking research. For Technology, Tsinghua University, Beijing, China. His research
a deeper comprehension, we summarize the lat- interests are in the areas of data center networks and machine
est advances in machine learning for networking, learning.
which covers multiple important network tech- Yong Cui received the B.E. degree and the Ph.D. degree, both
niques, including measurement, prediction and in computer science and engineering, from Tsinghua University,
scheduling. Moreover, numerous issues are still China, in 1999 and 2004, respectively. He is currently a full pro-
open and we shed light on the opportunities that fessor in the Computer Science Department in Tsinghua Univer-
sity. He has published over 100 papers in refereed conferences
need further research effort from both the net- and journals with several Best Paper Awards. He has co-authored
working and machine learning perspectives. seven Internet standard documents (RFC) for his proposal on IPv6
technologies. His major research interests include mobile cloud
Acknowledgment computing and network architecture. He served or serves on the
editorial boards of IEEE TPDS, IEEE TCC and IEEE Internet Com-
This work is supported by NSFC (no. 61422206), puting. He is currently a working group co-chair in IETF.
TNList and the “863” Program of China (no.
2015AA016101). We would also like to thank Xin Wang received the B.S. and M.S. degrees in telecommuni-
Keith Winstein from Stanford University for his cations engineering and wireless communications engineering,
respectively, from Beijing University of Posts and Telecommu-
helpful suggestions to improve this article. nications, Beijing, China, and the Ph.D. degree in electrical and
computer engineering from Columbia University, New York,
References NY. She is currently an associate professor in the Department
[1] Y. Sun et al., “CS2P: Improving Video Bitrate Selection and of Electrical and Computer Engineering, State University of New
Adaptation with Data-Driven Throughput Prediction,” Proc. York at Stony Brook, Stony Brook, NY. Before joining Stony
SIGCOMM 2016, ACM, pp. 272–85. Brook, she was a member of technical staff in the area of mobile
[2] B. Mao et al., “Routing or Computing? The Paradigm Shift and wireless networking at Bell Labs Research, Lucent Technol-
Towards Intelligent Computer Network Packet Transmission ogies, New Jersey, and an assistant professor in the Department
Based on Deep Learning,” IEEE Trans. Computers, 2017. of Computer Science and Engineering, State University of New
[3] K. Winstein and H. Balakrishnan, “TCP Ex Machina: Comput- York at Buffalo, Buffalo, NY. Her research interests include algo-
er-Generated Congestion Control,” Proc. ACM SIGCOMM rithm and protocol design in wireless networks and communica-
Computer Commun. Rev., vol. 43, no. 4, ACM, 2013, pp. tions, mobile and distributed computing, and networked sensing
123–34. and detection. She has served on the executive committee and
[4] M. Dong et al., “PCC: Re-Architecting Congestion Control technical committee of numerous conferences and funding
for Consistent High Performance,” Proc. NSDI 2015, pp. review panels, and serves as an associate editor for IEEE Trans-
395–408. actions on Mobile Computing. She achieved the NSF CAREER
[5] J. Jiang et al., “CFA: A Practical Prediction System for Video Award in 2005 and the ONR Challenge Award in 2010.
QoE Optimization,” Proc. NSDI 2016, pp. 137–50.
[6] Z. Fadlullah et al., “State-of-the-Art Deep Learning: Evolv- S hihan X iao received the B.Eng. degree in electronic and
ing Machine Intelligence Toward Tomorrow’s Intelligent information engineering from Beijing University of Posts and
Network Traffic Control Systems,” IEEE Commun. Surveys & Telecommunications, Beijing, China, in 2012. He is currently
Tutorials, 2017. working toward his Ph.D. degree in the Department of Comput-
[7] J. Jiang et al., “Pytheas: Enabling Data-Driven Quality of er Science and Technology, Tsinghua University, Beijing, China.
Experience Optimization Using Group-Based Exploration-Ex- His research interests are in the areas of wireless networking
ploitation,” Proc. NSDI 2017, pp. 393–406. and cloud computing.
[8] J. Zhang et al., “Robust Network Traffic Classification,” IEEE/
ACM Trans. Networking (TON), vol. 23, no. 4, 2015, pp. Junchen Jiang is a Ph.D. candidate in the Computer Science
1257–70. Department at Carnegie Mellon University, Pittsburgh, PA, USA,
[9] Z. Chen, J. Wen, and Y. Geng, “Predicting Future Traffic where he is advised by Prof. Hui Zhang and Prof. Vyas Sekar.
Using Hidden Markov Models,” Proc. IEEE 24th Int’l. Conf. He received the Bachelor’s degree in computer science and
Network Protocols (ICNP) 2016, pp. 1–6. technology from Tsinghua University, Beijing, China, in 2011.

IEEE Network • Accepted for publication 8

You might also like