Professional Documents
Culture Documents
INTRODUCTION
Data: Data are any facts, numbers, or text that can be processed by a computer. E.g. sales, cost, inventory, forecast Information: The patterns, associations, or relationships among all this data can provide information. E.g. analysis of retail point of sale transaction data can yield information on which products are selling and when.
Continue
Knowledge: Information can be converted into knowledge about historical patterns and future trends. E.g. summary information on retail supermarket sales can be analysed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
Continue
Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in groups. E.g. a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. Clusters: Data items are grouped according to logical relationships or consumer preferences. e.g. data can be mined to identify market segments or consumer affinities.
Continue
Associations: Sequential Patterns: Data is mined to anticipate behaviour patterns and trends. E.g. an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data Mining Consists of Five Major Elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyse the data by application software. Present the data in a useful format, such as a graph or table.
Continue
Understand the application domain Collect and create the target dataset Clean and transform the target dataset Select features, reduce dimensions Apply data mining algorithms Interpret, evaluate, and visualize patterns
Continue
Distributed Data Mining (DDM) is concerned with the application of the classical Data Mining procedure in a distributed computing environment trying to make the best of the available resources (communication network, computing units and databases). Data Mining takes place both locally at each distributed site and at a global level where the local knowledge is fused in order to discover global knowledge.
The first phase normally involves the analysis of the local database at each distributed site. Then, the discovered knowledge is usually transmitted to a merger site, where the integration of the distributed local models is performed. The results are transmitted back to the distributed databases, so that all sites become updated with the global knowledge. In the latter case the attributes differ among the distributed databases. In certain applications a key attribute might be present in the heterogeneous databases, which will allow the association between tuples. In other applications the target attribute for prediction might be common across all distributed databases.
One trend that can be noticed during the last years is the implementation of DDM systems using emerging distributed computing paradigms such as Web services and the application of DDM algorithms in emerging distributed environments, such as mobile networks, sensor networks, grids and peer-topeer networks.
Uncertainty handling Dealing with missing value Dealing with noisy data Efficiency of algorithm used Constraining Knowledge Discovered to only useful or interesting knowledge Size and complexity of data Data selection Understandability of discovered knowledge Consistency between Data and Discovered Knowledge
Citibank came in at 3rd reporting 24 crore worth of cyber frauds followed by Axis (15.9 crore) and HSBC (13.8 crore).
In July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States.
Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems at Target Corporation exposed data from about 40 million credit cards. The information stolen included names, account number, expiry date and Card security code. From 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers at Neiman-Marcus.
42%
44% 36% 34% 33% 31% 36% 37% 26% 24% 25% 25%
37%
37% 33% 31% 30% 30% 27% 27% 23% 22% 20% 19%
France
Indonesia Sweden Germany
20%
18% 12% 13%
18%
14% 11% 10%
Telecom Industry: Biological Industry: Identifying co-occuring gene sequences and linking genes to different stages of deases development Scientific Application: Accumulation of huge volumes of highdimensional data, stream data and heterogeneous data Intrusion Detection:
Continue
In July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States. Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems at Target Corporation exposed data from about 40 million credit cards. The information stolen included names, account number, expiry date and Card security code From 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers at NeimanMarcus.
Modeling Strategies
Data mining strategies fall into two broad categories: supervised learning and unsupervised learning. Supervised learning
when there exists a target variable with known values and about which predictions will be made by using the values of other variables as input.
Unsupervised learning
there does not exist a target variable with known values, but for which input variables do exist.
Prediction algorithms determine models or rules to predict continuous or discrete target values for given input data. For example, a prediction problem could attempt to predict the value of the S&P 500 Index, given some input data such as a sudden change in a foreign exchange rate.
Classification algorithms determine models to predict discrete values for given input data. A classification problem might involve trying to determine if transactions represents fraudulent behavior based on some indicators such as, the type of establishment at which the purchase was made, the time of day the purchase was made, and the amount of the purchase.
Exploration uncovers dimensionality in input data. For example, trying to uncover groups of similar customers based on spending habits for a large, targeted mailing is an exploration problem.
Affinity analysis determines which events are likely to occur in conjunction with one another. Retailers use affinity analysis to analyze product purchase combinations.
Predictive modeling
Predictive modeling is the use of statistical and mathematical techniques to discover patterns in data in order to make predictions
Optimize website profitability by making appropriate offers to each visitor. Predict customer response rates in marketing campaigns. Defining new customer groups for marketing purposes. Predict customer defections: which customers are likely to switch to an alternative supplier in the near future.
Distinguish between profitable and unprofitable customers. Improve yields in complex production processes by finding unexpected relationships between process parameters and defect rates. Identify "wedge issues" and target political campaigns. Identify suspicious (unusual) behavior, as part of a fraud detection process.
Telecom Industry: Biological Industry: Identifying co-occuring gene sequences and linking genes to different stages of deases development Scientific Application: Accumulation of huge volumes of highdimensional data, stream data and heterogeneous data Intrusion Detection:
Quick and correct access to useful information which makes companies to concentrate more on decision making and other important processes made data mining so efficient and popular. Different industries or organizations use data mining to its maximum strength and what they try to get from their data could be market trends, industry research, sales promotion, competitor analysis, medical research etc.
Retailers can get to know about useful and correct trends about their customers and their purchasing behavior. This knowledge can be utilized to market the product in a better way, attract targeted customers more, come up with products that would be liked by customers, manage super market shelves or space in a better way, introduce coupons or discount offers on certain products, increase sales, set price strategies and so on.
All retailers or organizations that concentrate more on customer satisfaction go for data mining techniques. In law enforcement, data mining is helpful to identify criminal suspects by analyzing crime type, behaviour, habits etc. of other criminals who are already in the list. In healthcare, data mining techniques are used to identify certain diseases and to decide the treatment methods that are effective.
Financial institutions like banks and credit companies use data mining for identifying fraudulent customers, fraud medical claims, and risk management and so on. Weather forecast is an area where data mining is widely used. In the identification and classification and age determination of sky objects, data mining plays a major role. In development of new medicines, data mining is used to foresee the effectiveness of the developed medicines.
Presence of duplicate records, missing data values, presence of unneeded data fields, lack of proper data standards and lack of timely data updates, human errors etc could affect the quality of data and thus data mining process. Removal of duplicate records, entering appropriate values for missing records (0 rather than making an entry null), removal of unneeded data fields, identifying and removing logically wrong values (200 as age, 01/01/1100 as birth date etc), standardizing data formats, updating data fields in a timely manner etc are completed as part of data cleaning process.
Interoperability is another major data mining issue. As data could be collected from heterogeneous resources, types of data would be different and it would be practically impossible to standardize all these different kinds of data. Different databases or data mining software need to be interoperable so that data could be analysed and summarized correctly to make the best use of data mining. Suppose, government comes with a mission to share the information of different government departments in order to improve inter department collaboration.
As existing databases of different departments would be of different type, the project would have to overcome the issues of interoperability. As larger amounts of private and sensitive information about companies or individuals would have to be stored and used for different data mining activities, security and privacy have become a major issue to be addressed before data mining becomes completely mature.
This could also lead to illegal access of confidential data and also to disclosure of implicit details of individuals or companies which they do not actually want to come out. The correct selection of data mining method is very important to get correct results. Performance issues are also to be resolved as performance is the most expected factor in case of data mining.
THANK YOU