Professional Documents
Culture Documents
Commercial Viewpoint
Lots of data is being collected and warehoused
Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions
Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
Mining?
Look up phone
number in phone directory
Query a Web
search engine for information about Amazon
Overview:
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
The analytical techniques used in data mining are often well-known mathematical algorithms and techniques. What is new is the application of those techniques to general business problems made possible by the increased availability of data and inexpensive storage and processing power. Also, the use of graphical interfaces has led to tools becoming available that business experts can easily use.
Genetic algorithms - Optimization techniques based on the concepts of genetic combination, mutation, and natural selection. Nearest neighbor - A classification technique that classifies each record based on the records most similar to it in an historical database.
Building a mining model is part of a larger process that includes everything from asking questions about the data and creating a model to answer those questions, to deploying the model into a working environment. This process can be defined by using the following six basic steps:
To answer these questions, you might have to conduct a data availability study, to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you might have to redefine the project.
Preparing Data:
The second step in the data mining process, is to consolidate and clean the data that was identified in the Defining the Problem step. Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as incorrect or missing entries. For example, the data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home. Data cleaning is not just about removing bad data, but about finding hidden correlations in the data, identifying sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis. For example, should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price? Incomplete data, wrong data, and inputs that appear separate, but are in fact
strongly correlated, can influence the results of the model in ways you do not expect. Therefore, before you start to build mining models, you should identify these problems and determine how you will fix them.
Exploring Data:
The third step in the data mining process, is to explore the prepared data. You must understand the data in order to make appropriate decisions when you create the mining models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. For example, you might determine by reviewing the maximum, minimum, and mean values that the data is not representative of your customers or business processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis for your expectations. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help you improve the model. Data that strongly deviates from a standard distribution might be skewed, or might represent an accurate picture of a reallife problem, but make it difficult to fit a model to the data.
By exploring the data in light of your own understanding of the business problem, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems or gain a deeper understanding of the behaviors that are typical of your business.
Building Models :
The fourth step in the data mining process, is to build the mining model or models. You will use the knowledge that you gained in the Exploring Data step to help define and create the models. You define which data you want to use by creating a mining structure. The mining structure defines the source of data, but does not contain any data until you process it. When you process the mining structure, Analysis Services generates aggregates and other statistical information that can be used for analysis. This information can be used by any mining model that is based on the structure It is important to remember that whenever the data changes, you must update both the mining structure and the mining model. When you update a mining structure by reprocessing it, Analysis Services retrieves data from the source, including any new data if the source is dynamically updated, and repopulates the mining structure. If you have models that are based on the
structure, you can choose to update the models that are based on the structure, which means they are retrained on the new data, or you can leave the models as is. For more information, see Processing Data Mining Objects.
so many variables. From his existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long distance calls. For instance, he might learn that his best customers are unmarried females between the age of 34 and 42 who make in excess of $60,000 per year. This, then, is his model for high value customers, and he would budget his marketing efforts to accordingly.
y Customer segmentation Grouping customers or clients together, even by their own self-determined characteristics, can allow large organizations to manage marketing campaigns or even just organize their service professionals around similar groupings. y Targeted ads Marketers use data mining to deliver customized ads online, but organizations always want to know how to tailor any communications to be based on what they already know about their customers or clients. y Forecasting Time-series analysis takes data from the past, and provides a look into the future, even when there are seasonal increases or declines.
Conclusion:
Data mining is an active research field, and you could spend years reading peer-reviewed articles and textbooks on different aspects of the topic. The field has been historically dominated by academic people, and there's much careful thought behind the not only the algorithms but the statistical philosophies of analysis and synthesis. Though I have provided data mining training, and teach at the university level, I consider myself a lifelong student of this topic. You might be or become an important part of that story. I encourage you to share what you know and learn.