Professional Documents
Culture Documents
hyperion.com
white paper
2 hyperion.com
white paper
preparing for data mining algorithms that are shipped in the box with the Analytic
Services Data Mining Framework, the Naïve Bayes and the
cube is the data source Decision Tree algorithms have the capability to handle both
The algorithms in the Data Mining Framework are designed to categorical as well as numerical mining attribute types and
work on data present within an Analytic Services cube. The treat them accordingly.
design of the cube should take into consideration the data One of the key steps in Data Mining is the data auditing or
needs for all kinds of analyses (OLAP and Data Mining) that the data conditioning phase. This involves putting together,
the user is interested in performing. Once the data is brought cleansing, categorizing, normalizing, and proper encoding of
into the cube environment it can then be accessed through the data. This step is usually performed outside the Data Mining
Data Mining Framework for predictive analytics. tool. The effectiveness of the Data Mining algorithm is largely
The Data Mining Framework uses MDX expressions to dependent on the quality and completeness of the source data.
identify sections within the cube to obtain input data for the In some cases, for various mathematical reasons, the available
algorithm as well as to write back the results. The Data Mining input data may also need to be transformed before it is
Framework can only take regular dimension members as brought into a Data Mining environment. Transformations
mining attributes. What this implies is that only data that is may sometimes also include splitting or combining of input
referenced through regular dimension members (not through data columns. Some of these transformations may be done on
attribute dimensions or user defined attributes) can be the input dataset outside the Data Mining Framework by
presented as input data to the Data Mining Framework. using standard data manipulation techniques available in ETL
Accordingly, the data that is required for predictive analytics tools or RDBMS environments. For the current case the input
should be modeled within the standard dimensions and data does not need any mathematical transformation, but
measures within a cube. some encoding is needed to convert data into a format that can
In the case study being discussed in this paper, the primary be processed within the Analytic Services OLAP environment.
business requirement was to build a classification model for In the current problem at the ABC University, the available
prediction. Since there were no other accompanying business set of input data consisted of both ‘string’ and ‘number’
requirements, the design of the Analytic Services cube was data types. The list below gives some of the input data,
primarily driven by the Data Mining analytics need. For which needed encoding of ‘string’ type input into ‘number’
example, we have not used any attribute dimension modeling type input:
in the case study. However, in the generic case it is more likely • Identity related data – like Gender, City, State, Ethnicity
that the cube caters to both regular OLAP analytics and • Data related to the application process – like Application
predictive analytics within the same dimensional model. Status, Primary Source of contact, Applicant Type, etc.
preparing mining attributes • Date related data – like Application Date, Source Date, etc.
(Dates were available in the original dataset as strings,
The available input data can broadly be of two data types –
specifically they had two different formats – “yymmdd”
‘number’ or ‘string’. However, since measures in Analytic
and “mm/dd/yy”, and they had to be encoded into a number.)
Services are essentially stored in the database in a numerical
format, the ‘string’ type input data will have to be encoded into In the current case study, these encodings were done
a ‘number’ type data before being stored in Analytic Services. outside the Analytic Services environment by the construction
For example, if the gender information is available as a string of look-up master tables where the ‘string’ type input were
stating ‘Male’ or ‘Female’ it needs to be first encoded into a listed in a tabular format and the records were sequentially
numeric – like ‘1’ or ‘0’, before being stored as a measure in the numbered. Subsequently, the ‘string’ type input was referred to
Analytic Services OLAP database. by its corresponding numeric identifier during data load into
Mining attributes can be of two types – ‘categorical’ or Analytic Services. Table 2 shows a few samples of how such
‘numerical’. Mining attributes that describe discrete mapping files will look like.
information content like gender (‘Male’ or ‘Female’), zip code
State State AppliedStatus Application Status
(95054, 94304, 90210, etc.), customer category (‘Gold’, ‘Silver’,
ID Name ID
‘Blue’), status information (‘Applied’, ‘Approved’, ‘Declined’, 1 VT 3 Applied
‘On Hold’), etc. are termed ‘categorical’ attribute types. 2 CA 4 Offered Admission
Mining attributes that describe continuous information 3 MA 5 Paid Fees
content like sales, revenue, income, etc. are termed ‘numerical’ 4 MI 6 Enrolled
attribute types. The Analytic Services Data Mining Framework 5 NH
has the capability of working with algorithms that can handle 6 NJ
both categorical and numerical attribute types. Among the Table 2: Typical mapping of numeric identifiers
hyperion.com 3
white paper
preparing the cube • Data load is performed just as it is normally done for any
After all the input data has been identified and made ready, the Analytic Services cube.
next step is to design an outline and load the data into an At this stage we have:
Analytic Services cube. • Designed an Analytic Services cube
In the context of the current case the Analytic Services • Loaded it with relevant data
outline created was as follows:
It should be noted that the steps described so far are
• All the input data (measures in the OLAP context) were
generic to Analytic Services cube building and did not need
organized together into five groups (a two level hierarchy
any specific support from the Analytic Services Data Mining
created in the measures dimension) based on a logical
Framework.
grouping of measures. The details of each of the measure
are explained in the table below -Table 3: Analytic Services
outline expanded.
Measures related to information about the applicants’ identity were organized into this
group. Some of these measures were transformed from ‘string’ type to ‘number’ type
to facilitate modeling it within the Analytic Services database context.
Measures related to various test scores and high school examination results were
organized into this group.
Measures related to the context of the applicants application processing have been
organized together into this group.
Measures providing information about the financial support and funding associated
with the applicant.
4 hyperion.com
white paper
identifying the optimal set of attributes, grouped by the input attribute type – categorical or
mining attributes numerical.
It is necessary to reduce the number of attributes / variables
presented to an algorithm so that the information content is Categorical Type Numerical Type
FARecieved StudBudget
enhanced and the noise minimized. This is usually performed
AppStatus TotalAward
using supporting mathematical techniques to ensure that the
Applicant Type
most significant attributes are retained within the dataset that
is presented to the algorithm. It should be noted here that the Table 4: Optimal set of mining attributes identified
choice of significant attributes are more driven by the
particular data rather than by the problem itself. Attribute At this stage we have:
analysis or attribute conditioning is one of the initial steps in • Designed an Analytic Services cube
the Data Mining process and is currently performed outside • Loaded it with relevant data
the Data Mining Framework. The main objective during this • Identified the optimal subset of measures (mining attributes)
exercise is to identify a subset of mining attributes that are
modeling the problem
highly correlated with the predicted attribute; while ensuring
We will now use the Data Mining Framework to define an
that the correlation within the identified subset of attributes is
appropriate model (for the business problem) based on the
as low as possible.
Analytic Services cube and the identified subset of mining
The Analytic Services platform provides for a wide variety
attributes (measures). Setting up the model includes selecting
of tools and techniques that can be used in the attribute
the algorithm, defining algorithm parameters and identifying
selection process. One method to identify an optimal set of
the input data location and output data location for the
attributes is to use certain special data reduction techniques
algorithm.
implemented within Analytic Services through Custom
Defined Functions (CDFs). Additionally, users can use other choosing the algorithm
data visualization tools like Hyperion Visual Explorer to arrive The next step in the Data Mining process is to pick the
at a decision on the effectiveness of specific attributes in appropriate algorithm. There are a set of six basic algorithms
contributing to the overall predictive strength of the Data provided in the Data Mining Framework – Naïve Bayes,
Mining algorithm. Depending on the nature of the problem Regression, Decision Tree, Neural Network, Clustering and
the users may choose to utilize an appropriate tool and Association Rules. The Analytic Services Data Mining
technique in deciding the optimal set of attributes. Framework also allows for the inclusion of new algorithms
One of the advantages of working with the Analytic through a well defined process described in the vendor guide
Services Data Mining Framework is the inherent capability in that is part of the Data Mining SDK. The six basic algorithms
Analytic Services to support customized methods for attribute are a sample set that is shipped with the product to provide a
selection by the use of Custom Defined Functions (CDFs). starting point for using the Data Mining Framework.
This is essential since the process of mining attribute selection Choosing an algorithm for a specific problem needs basic
can vary significantly across various problems and having an knowledge of the problem domain and the applicability of
extensible toolkit comes in very handy to be able to customize specific mathematical techniques to efficiently solve problems
a method to suit a specific problem. in that domain.
In the current case at ABC University, a CDF was used to The specific problem that is being discussed in this paper
identify the correlation effects amongst the available set of falls into a class of problems termed as classification problems.
mining attributes. A thorough analysis of various subsets of The need here is to classify each applicant into a discrete set of
the available mining attributes was performed to identify a classes on the basis of certain numerical and categorical
subset that is highly correlated with the predicted mining information available about the applicant. The ‘class’ referred
attribute and at the same time has low correlation scores to in this context is the status of the applicants application
within the subset in itself. Since some Data Mining algorithms looked at from an enrollment perspective: “will enroll” or “will
(like Naïve Bayes, Neural Net) are quite sensitive to inter- not enroll”. There is historical data available indicating which
attribute dependencies, an attempt was made to outline the kind (with a specific combination of categorical and
clusters of mutually dependent attributes, with a certain numerical factors associated with them) of applicants that
degree of success. From each cluster a single, most convenient, have gone ahead and accepted offers from the ABC University
attribute was selected. For this case study, an expert made the and subsequently enrolled into the programs. There is data
decision, but this process can be generalized to a large degree. available for the negative case as well – i.e. applicants that did
An optimal set of five mining attributes was identified after not eventually enroll into the program.
this exercise. Table 4 shows the list of identified mining
hyperion.com 5
white paper
Given the fact that this problem can be looked at as a in effectively using the Data Mining functionality to provide
classification problem and the fact that there is historical predictive solutions to business problems.
information available, one of the algorithms that is suitable for 1. Building the Data Mining model
the analysis is the Naïve Bayes classification algorithm. We 2. Testing the Data Mining model
chose Naïve Bayes for modeling this particular business 3. Applying the Data Mining model
problem.
Each of these steps, performed using the Data Mining
deciding on the algorithm parameters Wizard in the Administration Services Console, uses MDX
Every algorithm has a set of parameters that control the expressions to define the context within the cube to perform
behavior of the algorithm. Algorithm users need to choose the the data mining operation. Various accessors, specified as
parameters based on their knowledge of the problem domain MDX expressions, identify data locations within the cube. The
and the characteristics of the input data. Analytic Services framework uses the data in the locations as input to the
provides adequate support for such preliminary analysis of algorithm or writes output to the specified location.
data using Hyperion Visual Explorer or the Analytic Services Accessors need to be defined for each of the algorithms so
Spreadsheet Client. Users are free to analyze the data using any as to let the algorithm know specific contexts for each of the
tool convenient and determine their choices for the various following:
algorithm parameters. • (the attribute domain) the expression to identify the fac-
Each of the algorithms has a set of parameters that tors of our analysis that will be used for prediction [In the
determine the way the algorithm will process the input data. current context this expression pertains to the mining
For the current case, the algorithm chosen is Naïve Bayes and attributes that we identified]
it has four parameters that need to be specified – “Categorical, • (the sequence domain) the expression to identify the
Numerical, RangeCount, Threshold”. The details of each of the cases/records that need to be analyzed [In the current
parameters and the implications of setting them are described context this expression will identify the list of applicants]
in the online help documentation. • (the external domain) the expression to identify if multiple
Out of the selected list of attributes we have a few that are models need to be built [Not relevant in the current
of categorical type and hence our choice for the ‘Categorical’ context]
parameter is a ‘yes’. Similarly, there are attributes that are of • (the anchor) the expression to specify the additional
numerical type and hence the choice for ‘Numerical’ restrictions from dimensions that are not really partici-
parameter also is a ‘yes’. The data was analyzed using a pating in this data mining operation [In the current con-
histogram plot to understand the distribution before deciding text all the dimensions of the cube that we used have
on the value to be provided for the ‘RangeCount’ parameter. relevance to the problem. Accordingly, the anchor in the
This parameter needs to be large enough to allow for the current context only helps restrict the algorithm scope to
algorithm to use all the variety available in the data and at the the right measure in the ‘Measures’ dimension]
same time should be small enough to prevent over fitting. Additional details for each of these expressions can be
From the analysis of the input data for this particular case, obtained from the online help documentation.
setting this parameter ‘12’ seemed reasonable. The
‘RangeCount’ controls the binning1 process in the algorithm. building the data mining model
It should be emphasized that the binning schemes (including To access the Data Mining Framework, you will need to bring
bin count) really depend on the specific circumstances and up the Data Mining Wizard in the Administration Services
may vary to a great degree between different problems. Console, and choose the appropriate application and database
At this stage we have: as shown in Figure 1 on the next page.
• Designed an Analytic Services cube
• Loaded it with relevant data
• Identified the optimal subset of measures (mining attributes)
• Chosen the algorithm suitable for the problem
• Identified the parameter values for the chosen algorithm
6 hyperion.com
white paper
In the next screen (Figure 2 below), depending on whether you choose the appropriate task option.
are building a new model or revising an existing model, you
hyperion.com 7
white paper
This will bring up the wizard screen for setting the The Naïve Bayes algorithm requires that we declare upfront
algorithm parameters and the accessor information associated if we plan to use either or both of ‘Categorical’ and ‘Numerical’
with the chosen algorithm, in this case Naïve Bayes. The user predictors. In the context of the current case, we have both
will select a node in the left pane to see and provide values for categorical and numerical attribute types and hence the choice
the appropriate options and fields displayed in the right pane. is ‘True’ for both these parameters. ‘RangeCount’ was decided
As shown in Figure 3, select “Choose mining task settings” to at 12. ‘Threshold’ was fixed at 1e-4, a very small value. Figure
set how to handle missing data in the cube. The choice in this 4 shows the completed screen for the parameters setting.
case is to replace with ‘As NaN’ (Not-A-Number).
8 hyperion.com
white paper
The Naïve Bayes algorithm has two predictor accessors – were used for the case being discussed. All the information
‘Numerical Predictor’ and ‘Categorical Predictor’ and one provided during this stage of model building is preserved in a
target accessor. Figure 5 shows the various domains that need template file so as to facilitate reuse of the information if
to be defined for the accessors. Table 5 shows the values that necessary.
Table 5: Setting up accessors for the “build” mode while using Naive Bayes algorithm
hyperion.com 9
white paper
Once the accessors are defined, the Data Mining Wizard model that is developed by the use of the algorithm. Testing
will prompt the user to provide names for the template and the model on this test dataset and comparing the outcomes
model that will be generated at this stage. Figure 6 shows the predicted by the model against the known outcomes
screen in which the model and template names need to be (historical data) is also one among the multiple processes
defined. supported by the Data Mining Wizard. A ‘test’ mode template
At this stage we have: can be created by a process similar to creating a ‘build’ mode
• Built a Data Mining model built using the Naïve Bayes template as described in the previous section. While building
algorithm the ‘test’ mode template the user needs to provide a
‘Confidence’ parameter to let the Data Mining Framework
testing the data mining model know the minimum confidence level necessary to declare the
The next step will be to test the newly built model to verify that model as a valid one. We specified a value of 0.95 for the
it satisfies the level of statistical significance that is needed for ‘Confidence’ parameter. The exact steps in the wizard and
the model to be put to use. Ideally, a part of the input data descriptions of the various parameters can be obtained from
(with valid known outcomes – historical data) will be set aside the online help documentation.
as a test dataset to verify the goodness of the Data Mining
10 hyperion.com
white paper
Once the process is completed the results of the test appear If the ‘Test’ accessor has a value 1.0 then the test is deemed
(the name of which was specified in the last step of the Data successful and the model is declared ‘good’ or ‘valid’ for
Mining Wizard) against the ‘Model Results’ node. Figure 7 prediction. Figure 9 shows the result of test for the case being
shows the node in the Administration Services Console discussed in this paper.
‘Enterprise View’ pane where the ‘Mining Results’ node is At this stage we have:
visible. • Built a Data Mining model built using the Naïve Bayes
The model can be queried within the Administration algorithm
Services Console interface to obtain a list of the model • The model has been verified as valid with 95% confidence
accessors by using the “Query Result” functionality. Invoking
“Show Result” for the ‘Test’ accessor will indicate the result of
the test. Figure 8 below shows the list of model accessors in the
result set of a model based on the Naïve Bayes algorithm used
in the test mode.
Figure 7: Model Results node in the Administration Figure 8: Model accessors for result set associated with a
Services Console interface model based on Naive Bayes algorithm
hyperion.com 11
white paper
Table 6: Setting up accessors for the “apply” mode while using Naive Bayes algorithm
12 hyperion.com
white paper
means additional promotional expenditure in trying to follow Mining Wizard. The details of each of these transformations,
up on an applicant who will eventually not enroll. The what they do and how to use them can be obtained from the
importance of each should be analyzed in the context of the Analytic Services online help documentation. This list of
business and the model needs to be rebuilt if necessary with a transformations is further extensible through the import of
different training set (historical data) or with a different set of custom Java routines written specifically for the purpose. The
attributes. details of how to write Java routines to be imported as
Figure 10 below shows the confusion matrix constructed additional transforms can be obtained from the vendor guide
using the data set that was analyzed as part of this case study. that is shipped as part of the Data Mining SDK
It is evident from the confusion matrix that the model
mapping
predicted that 1550 (1478 + 72) students will enroll. Of that,
only 1478 actually enrolled and 72 did not enroll. This implies In some cases when the model has been developed for a
that there were 72 false positives. Similarly, the model different context and needs to be used elsewhere, the
predicted that 9805 (9356 + 449) students will not enroll. Of ‘Mapping’ functionality is useful. Through this functionality
that, only 9356 actually did not enroll, whereas 449 actually the user can provide information to the Data Mining
did enroll. This implies that there were 449 false negatives. Framework on how to interpret the existing model accessors
in the new context in which it is being deployed. More
information on using this functionality can be obtained from
the online help documentation.
Success rate of the model: 95.41% (only 521 incorrect using the data mining framework
predictions in 11355 cases) in batch mode
There is also a batch mode interface to access the
additional functionality functionalities provided in the Data Mining Framework.
The Analytic Services Data Mining Framework offers more Scripts written using the MaxL command interface can be
functionality that can be used when deploying models in real used to do almost all the functionality that is exposed through
business scenarios. Some of the further steps that can be the Data Mining Wizard. Details of the MaxL commands and
considered include: their usage can be obtained from the online help
transformations documentation.
The Data Mining Framework also offers the ability to apply a building custom applications
transform to the input data just before it is presented to the Custom applications can be developed using Analytic Services
algorithm. Similarly, the output data can be transformed as the backend database and developer tools provided along
before being written into the Analytic Services cube. The Data with Hyperion Application Builder. The functionality
Mining Framework offers a basic list of transformations – exp, provided by the Data Mining Framework can be invoked
log, pow, scale, shift, linear that can be used through the Data through APIs.
hyperion.com 13
white paper
,
. . . .
© Copyright 2005 Hyperion Solutions Corporation. All rights reserved. “Hyperion,” the Hyperion “H” logo, and Hyperion’s product names are trademarks of Hyperion. References to
other companies and their products use trademarks owned by the respective companies and are for reference purpose only. 5164_0805
hyperion.com