You are on page 1of 6

Data Sourcing

Below is a summary of the aspects I considered during data sourcing. These include data sources, formats, and
maximum size, as well as supported data types.

All three services can train models on uploaded text files. Both AWS and MS Azure can also read data from tables in
their storage services.
AWS supports the largest datasets for batch training.
Google supports update calls, which one can use to incrementally train the model - that is, to do online training.
MS Azure supports the widest variety of data sources and formats.
There is no clear winner for now, as each service has its strengths and weaknesses.

Data Preprocessing: The table below lists whether certain data preprocessing operations can be performed using
these services. The operations covered here are commonly used, but this is not an exhaustive list of all preprocessing
techniques.
Keep in mind that you can perform most, if not all these operations using Python or some other language before
sending the data to the service. What is being assessed here is whether these operations can be performed within
the service.

It may happen that some transformations are performed behind the scenes, before the actual training takes place.
In this table we are referring to explicitly applying the transformations on the data.
In AWS, most operations are performed using the so-called recipes. Recipes are JSON-like scripts used to transform
the data before feeding it to a machine learning model.
All the above transformations other than data visualization, data splitting and missing value imputation are done
using recipes. For instance, quantile_bin(session_length, 5) would discretize the session_length variable into 5 bins.
You can also perform the operations to groups of variables; groups themselves are also defined in the recipes.
Missing value imputation is also indicated as being possible within AWS. Although the transformation is not directly
implemented, one can train a simple model - a linear regression, for instance - to predict the missing values. This
model can then be chained with the main model. For this reason, I consider AWS as allowing missing value
imputation.
In MS Azure, transformations are applied sequentially using the built-in modules. The binning example above could
be done using the 'Quatize Data' module. One can choose which variable or variables are affected.
R and Python scripts can also be included to apply custom transformations.
When using Google, most of the data processing will have to be done before feeding the data to the service.

Strings with more than one word are separated into multiple features within Google Prediction API. 'load the data'
would be split into 'load', 'the', and 'data'. This type of processing is common in Natural Language Processing (NLP)
applications such as document summarization and text translation.
You may choose to make all the data processing before sending the data to any of these services. Though this may
mean more work, it is also a way to give you more control - you know exactly what you are doing to your data.
Aspects to Consider: Which service works best for your application? For now, the short answer really is 'it depends'.
A number of factors needs to be considered:
These services support data loaded form their own storage services, so how you store your data can prove to be a
decisive factor.
Can you handle batch training? If yes, evaluate the typical size of your dataset. On the other hand, if your dataset is
really large, or if you want to keep updating the model as you go, consider online training.
When implementing data transformation tools on your side, not having any built-in data transformation tools may
not be a problem at all.
If that is not a possibility, know which transformations you need to perform on your data, and understand whether
the service you choose offers them. Pay special attention to missing values and text features, as typical application
data are sure to have both.
Final Thoughts: Personally, I found MS Azure's flexibility both in data sourcing and preprocessing attractive. I did not
use the custom R or Python scripts, mostly because I did not need to.
However, I do like to know exactly what I am doing to the data I feed a model with. Although I was able to quickly
transform data using MS Azure, I would still do the data transformation using my own tools. This gives me full control,
and allows me to exploit my data's specific traits to perform operations in the most efficient way.
Google provides what I believe to be a key feature in ML applications: incremental training. It allows you to use
virtually infinite data. It takes the weight of assessing when to retrain a model off your shoulders.
When it comes to data processing, Amazon lies somewhere between the other two: it has some functionalities, but
not many. But given how recent this service is - it was launched little more than a month ago - I see potential. If the
service continues to evolve, it may become a very versatile tool.
Model Building: Training machine learning models is the core part of these services. The table below presents an
overview of which type of models you can build with each service, and a few other aspects.

All services have algorithms to solve both classification and regression problems. MS Azure also has algorithms to
perform clustering and anomaly detection.
As was seen in Part I, Google allows a model to be trained incrementally.
In AWS, training data has to be provided in a single batch. However, the model is trained stochastically. 1 Because
of this, the order by which the data samples are presented matters. Be sure to shuffle your data before using it to
train a model.
In terms of algorithms, MS Azure offers the widest range of model choices. AWS uses linear models only. Some
people on Reddit and CrossValidated suggest, it is Google's choice of models as well.
Linear models are not necessarily worse than the more complex non-linear ones. Linear models scale better with
the size of the dataset, are faster to train, and are easier to tune.
Additionally, linear models perform as well as non-linear models in a variety of problems. Linear models are known
for being very successful in a variety of Natural Language Processing (NLP) tasks such as Part-of-Speech tagging.
A final note before moving on to model evaluation: as far as I can tell, no service allows you to inspect or to export
a trained model. Whatever model you build using these services, you have to use it through the service provider,
and pay for that.
Model Evaluation: Below, you will find listed the performance metrics that each service provides for evaluating a
model. One can easily obtain these and other metrics oneself from the true and predicted labels.

When it comes to model assessment, the differences between the three services are small.
You can do basic model assessment on all three platforms using the accuracy or mean squared error. In
classification problems, you can use the confusion matrix to estimate other performance metrics such as precision
and recall.
Both AWS and MS Azure allow you to visually assess your models, below are a couple of screenshot of the
interfaces. You can adjust the decision threshold when doing two-class classification; the performance metrics for
any given threshold are recomputed for you.

If you wish to see individual classifications on the training examples, you have to perform a separate prediction on
that data. You can then use the true and predicted output to perform your own performance analysis using R or
Python, for instance.
At this point it is only natural to ask the following:
Does any of these services build better models than the other ones?
The short answer is 'No.'

I used all three services to solve a toy two-class classification problem. The training data had 9400 patterns, one
third of which belong to the positive class. Each training pattern had 94 numeric features. Below is a sample of the
training data.

To ensure the comparison was done using similar machine learning algorithms, I compared AWS and Google with
MS Azure's logistic regression. I also compared the performance of these services with Scikit Learn's logistic
regression model. The performance was assessed on a separate test set with 7800 examples, one third of them
positives. The results on this data set are shown below. All models performed equally well, the difference being
only on the individual classification of the patterns.

Final Thoughts: Hopefully these posts will serve as a foundation for people who are trying to include machine
learning in their product.

You might also like