Kuvempu University Data Warehousing

Page.1Q1.Explain the meaning of data cleaning and data formating.
Ans-Data cleaning This step complements the previous one. It is also the most time consuming due to a lot of possible techniques that can be implemented so as to optimize data quality for future modeling stage. Possible techniques for data cleaning include: Data normalization. For example decimal scaling into the range (0,1), or standard deviation normalization. Data smoothing. Discretization of numeric attributes is one example, this is helpful or even necessary for logic based methods. Treatment of missing values. There is not simple and safe solution for the cases where some of the attributes have significant number of missing values. Generally, it is good to experiment with and without these attributes in the modelling phase, in order to find out the importance of the missing values. Data formatting Final data preparation step which represents syntactic modifications to the data that do not change its meaning, but are required by the particular modelling tool chosen for the DM task. These include: reordering of the attributes or records: some modelling tools require reordering of the attributes (or records) in the dataset: putting target attribute at the beginning or at the end, randomizing order of records (required by neural networks for example) changes related to the constraints of modelling tools: removing commas or tabs, special characters, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters. Q2.What is metadata ? Explain various purpose in which metadata is used. Ans-Meta data is data about data. Since data in a dataware house is both voluminous and dynamic, it needs constant monitoring. This can be done only if we have a separate set of data about data is stored. This is the purpose of meta data. Meta data is useful for data transformation and load data management and query generation. This chapter introduces a few of the commonly used meta data functions for each of them. Meta data, by definition is data about data or data that describes the data. In simple terms, the data warehouse contains data that describes different situations. But there should also be some data that gives details about the data stored in a data warehouse. This data is metadata. Metadata, apart form other things, will be used for the following purposes. 1. data transformation and loading 2. data management 3. query generation Q3.Write the steps in designing of fact tables. Ans-DESIGNING OF FACT TABLES The above listed methods, when iterated repeatedly will help to finally arrive at a set of entities that go into a fact table. The next question is how big a fact table can be? An answer could be that it should be big enough to store all the facts, still making the task of collecting data from this table reasonably fast. Obviously, this depends on the hardware architecture as well as the design of the database. A suitable hardware architecture can ensure that the cost of collecting data is reduced by the inherent capability of the hardware on the other hand the database designed should ensure that whenever a data is asked for, the time needed to search for the same is minimum. In other words, the designer should be able to balance the value of information made available by the database and cost of making the same data available to the user. A larger database obviously stores more details, so is definitely useful, but the cost of storing a larger database as well as the cost of searching and evaluating the same becomes higher. Technologically, there is perhaps no limit on the size of the database. How does one optimize the cost- benefit ratio? There are no standard formulae, but some of the following facts can be taken not of. i. Understand the significance of the data stored with respect to time. Only those data that are still needed for processing need to be stored. For example customer details after a period of time may become irrelevant. Salary details paid in 1980s may be of little use in analyzing the employee cost of 21st century etc. As and when the data becomes obsolete, it can be removed. ii. Find out whether maintaining of statistical samples of each of the subsets could be resorted to instead of storing the entire data. For example, instead of storing the sales details of all the 200 towns in the last 5 years, one can store details of 10 smaller towns, five metros, 10 bigger cities and 20 villages. After all data warehousing most often is resorted to get trends and not the actual figures. The subsets of these individual details can always be extrapolated Q3.List and explain the aspects to be looked into while designing the summary tables. Ans-ASPECTS TO BE LOOKED INTO WHILE DESIGNING THE SUMMARY TABLES The main purpose of using summary tables is to cut down the time taken to execute a specific query. The main methodology involves minimizing the volume of data being scanned each time the query is to be answered. In other words, partial answers to the query are already made available. For example, in the above cited example of mobile market, if one expects i) the citizens above 18 years of age ii) with salaries greater than 15,000 and iii) with professions that involve traveling are the potential customers, then, every time the query is to be processed (may be every month or every quarter), one will have to look at the entire data base to compute these values and then combine them suitably to get the relevant answers. The other method is to prepare summary tables, which have the values pertaining toe ach of these sub-queries, before hand, and then combine them as and when the query is raised . Itcan be noted that the summaries can be prepared in the background (or when the number of queries running are relatively sparse) and only the aggregation can be done on the fly. Summary table are designed by following the steps given below i) Decide the dimensions along which aggregation is to be done. ii) Determine the aggregation of multiple facts. iii) Aggregate multiple facts into the summary table.
Q4.Explain the role of access control issues in data mart design. Ans-ROLE OF ACCESS CONTROL ISSUES IN DATA MART DESIGN This is one of the major constraints in data mart designs. Any data warehouse, with its huge volume of data is, more often than not, subject to various access controls as to who could access which part of data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each can access only his data . In the data ware house, the data pertaining to all these marts are stored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated. However, in certain other cases the demarcation may not be so clear. In such cases, a judicious analysis of the privacy constraints so as to optimize the privacy of each data mart is maintained. Data marts, as described in the previous sections can be designed, based on several splits noticeable either in the data or the organization or in privacy laws. They may also be designed to suit the user access tools. In the latter case, there is not much choice available for design parameters. In the other cases, it is always desirable to design the data mart to suit the design of the ware house itself. This helps to maintain maximum control on the data base instances, by ensuring that the same design is replicated in each of the data marts. Similarly the summary informations on each of the data mart can be a smaller replica of the summary of the data ware house it self. Q5.List the application and reasons for the growing popularity of data mining. Ans-REASONS FOR THE GROWING POPULARITY OF DATA MINING a) Growing Data Volume The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. It becomes impossible for human analysts to cope with such overwhelming amounts of data. b) Limitations of Human Analysis Two other problems that surface when human analysts process data are the inadequacy of the human brain when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis. A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this helps, sometimes this hurts, but it is almost impossible to get rid of this fact. c) Low Cost of Machine Learning One additional benefit of using automated data mining systems is that this process has a much lower cost than hiring an many highly trained professional statisticians. While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data. Q6What is data mining ? What kind of data can be mined ? Ans-There are many definitions for Data mining. Few important definitions are given below. Data mining refers to extracting or mining knowledge from large amounts of data. Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. WHAT KIND OF DATA CAN BE MINED? In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object oriented databases, data warehouses, transactional databases, unstructured and semi structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. Here are some examples in more detail Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements. Relational Databases: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. Q7.Give the top level syntax of the data mining query languages DMQL. Ans-A data mining language helps in effective knowledge discovery from the data mining systems. Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks from data characterization to mining association rules, data classification and evolution analysis. Each task has different requirements. The design of an effective data mining query language requires a deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining tasks. Q8. Explain the meaning of data mining with apriori algorithm Ans-APriori algorithm data mining discovers items that are frequently associated together. Let us look at the example of a store that sells DVDs, Videos, CDs, Books and Games. The store owner might want to discover which of these items customers are likely to buy together. This can be used to increase the stores cross sell and upsell ratios. Customers in this particular store may like buying a DVD and a Game in 10 out of every 100 transactions or the sale of Videos may hardly ever be associated with a sale of a DVD. With the information above, the store could strive for more optimum placement of DVDs and Games as the sale of one of them may improve the chances of the sale of the other frequently associated item. On the other hand, the mailing campaigns may be fine tuned to reflect the fact that offering discount coupons on Videos may even negatively impact the sales of DVDs offered in the same campaign. A better decision could be not to offer both DVDs and Videos in a campaign. To arrive at these decisions, the store may have had to analyze 10,000 past transactions of customers using calculations that seperate frequent and consequently important associations from weak and unimportant associations.
iv) Determine the level of aggregation and the extent of embedding. v) Design time into the table. vi) Index the summary table. 2.Q9.Explain the working principle of decision tree used for data mining. Ans-DATA MINING WITH DECISION TREES Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved. In some applications, the accuracy of a classification or prediction is the only thing that matters; if a direct mail firm obtains a model that can accurately predict which members of a prospect pool are most likely to respond to a certain solicitation, they may not care how or why the model works. Decision tree working concept Decision tree is a classifier in the form of a tree structure where each node is either: a leaf node, indicating a class of instances, or a decision node that specifies some test to be carried out on a single attribute0value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Example: Decision making in the Bombay stock market Suppose that the major factors affecting the Bombay stock market are: what it did yesterday; what the New Delhi market is doing today; bank interest rate; unemployment rate; Indias prospect at cricket. Q10. What is Bayes theorem ? Explain the working procedure of Bayesisan classifier. Ans-Bayes Theorem Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data sample X belongs to a specified class C. For classification problems, we want to determine P (H/X), the probability that the hypothesis H holds given the observed data sample X. P(HX) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose the world of data samples consists of fruits, described by their color and shape. Suppose that X is red and round, and that H is the hypothesis that X is an apple. Then P(HX) reflects our confidence that X is an apple given that we have seen that X is red and round. In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given data sample is an apple, regardless of how the data sample looks. The posterior probability, P(HX), is based on more information (Such as background knowledge) than the prior probability, P(H), which is independent of X. Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X. . P(X), P(H), and P(XH) may be estimated from the given data, as we shall see below. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(HX), from P(H), P(X), and P(XH). Bayes theorem is P(HX) = P(XH) P(H) P(X) Q11.Explain how neural network can be used for data mining ? Ams-A neural processing element receives inputs from other connected processing elements. These input signals or values pass through weighted connections, which either amplify or diminish the signals. Inside the neural processing element, all of these input signals are summed together to give the total input to the unit. This total input value is then passed through a mathematical function to produce an output or decision value ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital 0/1 output. If the input signal matches the connection weights exactly, then the output is close to 1. If the input signal totally mismatches the connection weights then the output is close to 0. Varying degrees of similarity are represented by the intermediate values. Now, of course, we can force the neural processing element to make a binary (1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs, we are retaining more information to pass on to the next layer of neural processing units. In a very real sense, neural networks are analog computers. Each neural processing element acts as a simple pattern recognition machine. It checks the input signals against its memory traces (connection weights) and produces an output signal that corresponds to the degree of match between those patterns. In typical neural networks, there are hundreds of neural processing elements whose pattern recognition and decision making abilities are harnessed together to solve problems. Backpropagation Backpropagation learns by iteratively processing a set of training samples, comparing the networks prediction for each sample with the actual known class label. For each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the means squared error between the networks prediction and the actual class. These modifications are made in the backwards direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage, and the learning process stops. The algorithm is summarized in Figure each step is described below. Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The
K-means algorithm This algorithm has as an input a predefined number of clusters, that is the k from its name. Means stands for an average, an average location of all the members of a particular cluster. When dealing with clustering techniques, one has to adopt a notion of a high dimensional space, or space in which orthogonal dimensions are all attributes from the table of data we are analyzing. The value of each attribute of an example represents a distance of the example from the origin along the attribute axes. The coordinates of this point are averages of attribute values of all examples that belong to the cluster. The steps of the K-means algorithm are given below. 1. Select randomly k points (it can be also examples) to be the seeds for the centroids of k clusters. 2. Assign each example to the centroid closest to the example, forming in this way k exclusive clusters of examples. 3. Calculate new centroids of the clusters. For that purpose average all attribute values of the examples belonging to the same cluster (centroid). Q12.Explain the STAR-FLAKE schema in detail. Ans-STAR FLAKE SCHEMAS One of the key factors for a data base designer is to ensure that a database should be able to answer all types of queries, even those that are not initially visualized by the developer. To do this, it is essential to understand how the data within the database is used. In a decision support system, which is what a data ware is supposed to provide basically, a large number of different questions are asked about the same set of facts. For example, given a sales data question like i) What is the average sales quantum of a particular item? ii) Which are the most popular brands in the last week? iii) Which item has the least tumaround item. iv) How many customers returned to procure the same item within one month. Etc.,. Can be asked. They are all based on the sales data, but the method of viewing the data to answer the question is different. The answers need to be given by rearranging or cross referencing different facts. Q13.Explain the method for designing dimension tables. Ans-DESIGNING DIMENSION TABLES After the fact tables have been designed, it is essential to design the dimension tables. However, the design of dimension tables need not be considered a critical activity, though a good design helps in improving the performance. It is also desirable to keep the volumes relatively small, so that restructuring cost will be less. Now we see some of the commonly used dimensions. Star dimension They speed up the query performance by denormalising reference information into a single table. They presume that the bulk of queries coming are such that they analyze the facts by applying a number of constraints to a single dimensioned data. For example, the details of sales from a stores can be stored in horizontal rows and select one/few of the attributes. Suppose a cloth store stores details of the sales one below the other and questions like how many while shirts of size 85" are sold in one week are asked. All that the query has to do is to put the relevant constraints to get the information. This technique works well in solutions where there are a number of entitles, all related to the key dimension entity. Q14.Explain the Horizontal partioning in briefly. Ans-Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. Obviously, the exact details of optimization depends on the hardware platforms. Normally the following guidelines are useful: i. maximize the processing, disk and I/O operations. ii. Reduce bottlenecks at the CPU and I/O The following mechanisms become handly 4.3.1 Maximising the processing and avoiding bottlenecks One of the ways of ensuring faster processing is to split the data query into several parallel queries, convert them into parallel threads and run them parallel. This method will work only when there are sufficient number of processors or sufficient processing power to ensure that they can actually run in parallel. (again not that to run five threads, it is not always necessary that we should have five processors. But to ensure optimality, even a lesser number of processors should be able to do the job, provided they are able to do it fast enough to avoid bottlenecks at eh processor). Normalisation The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data (The student is advised to refer any book on data base management systems for details, if interested). The idea is to ensure that when combined, the data available is consistent. However, in data warehousing, one may even tend to break the large table into several denormalized smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way It avoids the overheads of joining the data during queries. To make things clear consider the following table Q16.Explain the need of data mart in detail. Ans-THE NEED FOR DATA MARTS In a crude sense, if you consider a data ware house as a store house of data, a data mart is a retail outlet of data. Searching for any data in a huge store house is difficult, but if the data is available, you should be positively able to get it. On the other hand, in a retail out let, since the volume to be searched from is small, you can be able to access the data fast. But it is possible that the data you are searching for may not be available there, in which case you have to go back to your main store house to search for the data. Coming back to technical terminology, one can say the following are the reasons for which data marts are created. i) Since the volume of data scanned is small, they speed up the query processing. ii) Data can be structured in a form suitable for a user access too iii) Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.
biases are similarly initialized to small random numbers. Each training sample, X, is processed by the following steps.
3.IDENTIFY THE ACCESS TOOL REQUIREMENTS Data marts are required to support internal data structures that support the user access tools. Data within those structures are not actually controlled by the ware house, but the data is to be rearranged and up dated by the ware house. This arrangement (called populating of data) is suitable for the existing requirements of data analysis. While the requirements are few and less complicated, any populating method may be suitable, but as the demands increase (as it happens over a period of time) the populating methods should match the tools used. As a rule, this rearrangement (or populating) is to be done by the ware house after acquiring the data from the source. In other words, the data received from the source should not directly be arranged in the form of structures as needed by the access tools. This is because each piece of data is likely to be used by several access tools which need different populating methods. Also, additional requirements may come up later. Hence each data mart is to be populated from the ware house based on the access tool requirements of the data ware house. Q17.Explain the Data warehouse process manager in detail. Ans-DATAWARE HOUSE PROCESS MANAGERS These are responsible for the smooth flow, maintainance and upkeep of data into and out of the database. The main types of process managers are Load manager Warehouse manager and Query manager We shall look into each of them briefly. Before that, we look at a schematic diagram that defines the boundaries of the three types of managers. Load manager This is responsible for any data transformations and for loading of data into the database. They should effect the following Data source interaction Data transformation Data load. The actual complexity of each of these modules depend on the size of the database. It should be able to interact with the source systems to verify the received data. This is a very important aspect and any improper operations leads to invalid data affecting the entire warehouse. The concept is normally achieved by making the source and data ware house systems compatible. Ware house Manager The warehouse manager is responsible for maintaining data of the ware house. It should also create and maintain a layer of meta data. Some of the responsibilities of the ware house manager are o Data movement Meta data management o Performance monitoring o Archiving. Data movement includes the transfer of data within the ware house, aggregation, creation and maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also its functions. Query Manager We shall look at the last of manager, but not of any less importance, the query manager. The main responsibilities include the control of the following. o Users access to data o Query scheduling Query Monitoring These jobs are varied in nature and have not been automated as yet. The main job of the query manager is to control the users access to data and also to present the data as a result of the query processing in a format suitable to the user. The raw data, often from different sources, need to be compiled in a format suitable for querying. The query manager will have to act as a mediator between the user on one hand and the meta data on the other. Q18.Explain the Data warehouse Delivery process in detail. Ans-THE DATA WAREHOUSE DELIVERY PROCESS This section deals with the dataware house from a different view point - how the different components that go into it enable the building of a data ware house. The study helps us in two ways: i) to have a clear view of the data ware house building process. ii) to understand the working of the data ware house in the context of the components. Now we look at the concepts in details. i. IT Strategy The company must and should have an overall IT strategy and the data ware housing has to be a part of the overall strategy. This would not only ensure that adequate backup in terms of data and investments are available, but also will help in integrating the ware house into the strategy. In other words, a data ware house can not be visualized in isolation. ii. Business case analysis This looks at an obvious thing, but is most often misunderstood. The overall understanding of the business and the importance of various components there in is a must. This will ensure that one can clearly justify the appropriate level of investment that goes into the data ware house design and also the amount of returns accruing. Unfortunately, in many cases, the returns out of the ware housing activity are not quantifiable. At the end of the year, one cannot say - I have saved / generated 2.5 crore Rs. because of data ware housing - sort of statements. Data ware house affects the business and strategy plans indirectly - giving scope for undue expectations on one hand and total neglect on the other. Hence, it is essential that the designer must have a sound understanding of the overall business, the scope for his concept
Q19.Briefly explain the system management tools. Ans- SYSTEM MANAGEMENT TOOLS The most important jobs done by this class of managers includes the following 1. Configuration managers 2. schedule managers 3. event managers 4. database mangers 5. back up recovery managers 6. resource and performance a monitors. We shall look into the working of the first five classes, since last type of managers are less critical in nature. Configuration manager This tool is responsible for setting up and configuring the hardware. Since several types of machines are being addressed, several concepts like machine configuration, compatibility etc. are to be taken care of, as also the platform on which the system operates. Schedule manager The scheduling is the key for successful warehouse management. Almost all operations in the ware house need some type of scheduling. Every operating system will have its own scheduler and batch control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data warehouse. Event manager An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a measurable and observable occurrence). The action to be taken is also normally specific to the event. Database manager The database manger normally will also have a separate (and often independent) system manager module. The purpose of these managers is to automate certain processes and simplify the execution of others. Some of operations are listed as follows. Ability to add/remove users o User management o Manipulate user quotas o Assign and deassign the user profiles Q20.What is schema? Distinguish between facts and dimensions. Ans- schema- A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales details of that item where as a tax practitioner looking at the same data will be interested only in the amounts received by the company and the profits made. Distinguish between facts and dimensions The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction. i) Look for the fundamental transactions in the entire business process. These basic entities are the facts. ii) Find out the important dimensions that apply to each of these facts. They are the candidates for dimension tables. iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of facts attached to it. iv) Ensure that dimensions do not include these candidates that are actually facts. Q21.Explain how to categorize data mining system. Ans- CATEGORIZE DATA MINING SYSTEMS There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc. b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc. Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. Q22 ..A DATA MINING QUERY LANGUAGE A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining. A data mining GUI may consist of the following functional components a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries. b) Presentation of discovered patterns This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.
(data ware house) in the project, so that he can answer the probing questions. Q24.With the help of a block diagram explain the typical process flow in a data Warehouse. Ans- TYPICAL PROCESS FLOW IN A DATA WAREHOUSE Any data ware house must support the following activities i) Populating the ware house (i.e. inclusion of data) ii) day-to-day management of the ware house. iii) Ability to accommodate the changes. The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated. The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the quality of data becomes an important element. Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows. Q25.How the Naive Bayesian classification works. Ans- Naive Bayesian Classification The nave Bayesian classifier, or simple Bayesian classifier, works as follows: 1. Each data sample is represented by an n-dimensional feature vector, X = (x1, x2, . . . . xn), depicting n measurements made on the sample from n attributes, respectively, A1, A2, . .An. 2. Suppose that there are m classes, C1, C2, . Cm. Given an unknown data sample, X (i.e., having no class label), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the nave Bayesian classifier assigns an unknown sample X to the class Ci if and only if P(CiX) > P(CjX) for 1 j m, j I Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By Bayes theorem P(CiX) = P(XCi) P(Ci) P(X) 3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(XCi). Otherwise, we maximize P(XCi) P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = si/s where si is the number of training samples of class Ci, and s is the total number of training samples. Training Bayesian Belief Networks In the learning or training of a belief network, a number of scenarios are possible. The network structure may be given in advance or inferred from the data. The network variables may be observable or hidden in all or some of the training samples. The case of hidden data is also referred to as missing values or incomplete data. If the network structure is known and the variables are observable, then training the network is straightforward. It consists of computing the CPT entries, as is similarly done when computing the probabilities involved in native Bayesian classification. Neural Network Topologies The arrangement of neural processing units and their interconnections can have a profound impact on the processing capabilities of the neural networks. In general, all neural networks have some set of processing units that receive inputs from the outside world, which we refer to appropriately as the input units. Many neural networks also have one or more layers of hidden processing units that receive inputs only from other processing units. A layer or slab of processing units receives a vector of data or the outputs of a previous layer of units and processes them in parallel. The set of processing units that represents the final result of the neural network computation is designated as the output units. There are three major connection topologies that define how data flows between the input, hidden, and output processing units. Backpropagation Backpropagation learns by iteratively processing a set of training samples, comparing the networks prediction for each sample with the actual known class label. For each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the means squared error between the networks prediction and the actual class. These modifications are made in the backwards direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage, and the learning process stops. The algorithm is summarized in Figure each step is described below. Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Nonlinear Regression Polynomial regression can be modeled by adding polynomial terms to the basic linear model. By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. Transformation of a polynomial regression model to a linear regression model. Consider a cubic polynomial relationship given by Y = + 1X1 + 2X2 + 3X3 To convert this equation to linear form, we define new variables: X 1 = X X2 = X2 X3 = X3 Using the above Equation can then be converted to linear form by applying the above assignments, resulting in the equation Y = + 1X1 + 2X2 + 3X3 which is solvable by the method of least squares.
4.Q.Enlist the desirable schemes required for a good architecture of data mining systems.
Ans- ARCHITECTURES OF DATA MINING SYSTEMS A good system architecture will enable the system to make best use of the software environment , accomplish data mining tasks in an efficient and timely manner, interoperate and exchange information with other information systems, be adaptable to users different requirements and evolve with time. To know what are the desired architectures for data mining systems, we view data mining is integrated with database/data warehousing and coupling with the following schemes a) no-coupling b) loose coupling c) semitight coupling d) tight-coupling No-coupling It means that data mining system will not utilize any function of a database or data warehousing system. Here in this system , it fetches data from a particular source such as a file , processes data using some data mining algorithms and then store the mining result in another file. This system has some disadvantages 1) Database system provides a great deal of flexibility and efficiency at storing , organizing, accessing and processing data. Without this in a file, Data mining system may spend a more amount of time finding, collecting , cleaning and transforming data. Qno--CLUSTERING IN DATA MINING Requirements for clustering Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Representing data by fewer clusters necessarily loses certain fine details (akin to lossy data compression), but achieves simplification. It represents many data objects by few clusters, and hence, it models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clustering is unsupervised learning of a hidden data concept. Data mining deals with large databases that impose on clustering analysis additional severe computational requirements. Requirements for clustering Clustering is a challenging and interesting field potential applications pose their own special requirements. The following are typical requirements of clustering in data mining. Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributed: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Nominal, Ordinal and Ratio-Scaled Variables Nominal Variables A nominal variable is a generalization of the binary variable in that it can take on more than two states. For example, map_color is a nominal variable that may have, say, five states: red, yellow, green, pink and blue. Nominal variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M nominal states. For an object with a given state value, the binary variable representing that state is set to 1, while variable map_color, a binary variable can be created for each of the five colors listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0.. Ordinal Variables A discrete ordinal variable resembles a nominal variable, except that the M states of the ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively. For example professional ranks are often enumerated in a sequential order, such as assistant, associate, and full. A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but their actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver, bronze) is often more essential than the actual values of a particular measure. Ratio-Scaled Variables A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or Ae-Bt, Where A and B are positive constants. Typical examples include the growth of a bacteria population, or the decay of a radioactive element. To compute the dissimilarity between objects described by ratio-scaled variable There are three methods to handle ratio-scaled variables for computing the dissimilarity between objects. Neural Network Topologies The arrangement of neural processing units and their interconnections can have a profound impact on the processing capabilities of the neural networks. In general, all neural networks have some set of processing units that receive inputs from the outside world, which we refer to appropriately as the input units. Many neural networks also have one or more layers of hidden processing units that receive inputs only from other processing units. A layer or slab of processing units receives a vector of data or the outputs of a previous layer of units and processes them in parallel. The set of processing units that represents the final result of the neural network computation is designated as the output units. There are three major connection topologies that define how data flows between the input, hidden, and output processing units. Feed-Forward Networks Feed-forward networks are used in situations when we can bring all of the information
to bear on a problem at once, and we can present it to the neural network. It is like a pop quiz, where the teacher walks in, writes a set of facts on the board, and says, OK, tell me the answer. You must take the data, process it, and jump to a conclusion. In this type of neural network, the data flows through the network in one direction, and the answer is based solely on the current set of inputs. 5. Explain the concept of data warehousing and data mining. Ans. A data warehouse is a collection of a large amount of data and these data is the pieces of information Which is use to suitable managerial decisions. (a storehouse of data) eg:- student data to the details of the citizens of a city or the sales of previous years or the number of patients that came to a hospital with different ailments. Such data becomes a storehouse of information. Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. The main concept of datamining using a variety of techniques to identify nuggets of information or decision making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. Q15. Define data mining query in term of primitives. Ans: a) Growing Data Volume: The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. b) Limitations of Human Analysis: Two other problems that surface when human analysts processdata are the inadequacy of the human brain when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis. c) Low Cost of Machine Learning: While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data. Qno-List various applications of Data mining in various fields.
YaExplain in brief the data mining applications

Ans: Data mining has many varied field of application which is listed below: Retail/Marketing: Identify buying patterns from customers.Find associations among customer demographic characteristics Predict response to mailing campaigns Market basket analysis Banking: Detect patterns of fraudulent credit card useIdentity loyal customers Predict customers, determine credit card spending Identify stock trading Insurance and Health Care: Claims analysis Identify behavior pattern of risky customers.Identify fraudulent behavior Transportation: Determine the distribution schedules among outlets.Analyze loading Medicine: Characterize patient behavior to predict office visits. Identify successful medical therapies for different illnesses. Q20.What are the guidelines for KDD environment. Ans: The following are the guidelines for KDD environment are:1. Support extremely large data sets: Data mining deals with extremely large data sets consisting of billions of records and without proper platforms to store and handle these volumes of data, no reliable data mining is possible. Parallel servers with databases optimized for decision support system oriented queries are useful. Fast and flexible access to large data sets is of very important. 2. Support hybrid learning: Learning tasks can be divided into three areas: a.classification tasks b. knowledge engineering tasks c. problem-solving tasks. All algorithms can not perform well in all the above areas as discussed in previous chapters. Depending on our requirement one has to choose the appropriate one. 3. Establish a data warehouse: A data warehouse contains historic data and is subject oriented and static, that is, users do not update the data but it is created on a regular time-frame on the basis of the operational data of an organization. 4. Introduce data cleaning facilities: Even when a data warehouse is in operation, the data is certain to contain all sorts of heterogeneous mixture. Special tools for cleaning data are necessary and some advanced tools are available, especially in the field of deduplication of client files. 5. Facilitate working with dynamic coding: Creative coding is the heart of the knowledge discovery process. The environment should enable the user to experiment with different coding schemes, store partial results make attributes discrete, create time series out of historic data, select random sub-samples, separate test sets and so on. Q21. Explain data mining for financial data analysis. Ans: Financial data collected in the banking and financial industries are often relatively complete, reliable and of high quality, which facilitates systematic data analysis and data mining. The various issues are a) Design and construction of data warehouses for multidimensional data analysis and data mining: Data warehouses need to be constructed for banking and financial data. Multidimensional data analysis methods should be used to analyze the general properties of such data. Data warehouses, data cubes, multifeature and discoverydriven data cubes, characteristic and comparative analyses and outlier analyses all play important roles in financial data analysis and mining. b) Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan payment performance and customer
Q23. What is the importance of period of retention of data? Ans: A businessman says he wants to the data to be retained for as long as possible 5, 10, 15 years the longer the better. The more data we have, the better the information generated. But such a view thing is unnecessarily simplistic. If a company wants to have an idea of the recorder levels, details of sales for last 6 months to one year may be enough. Sales pattern of 5 years is unlikely to be relevant today. So, It is important to determine the retention period for each function but once it is drawn, it becomes easy to decide on the optimum value of data to be stored. Q25. Give the advantages and disadvantages of equal segment partitioning. Ans: The advantage is that the slots are reusable. Suppose we are sure that we will no more need the data of 10 years back, then we can simply delete the data of that slot and use it again. Of course there is a serious draw back in the scheme if the partitions tend to differ too much in size. The number of visitors visiting a till station, say in summer months, will be much larger than in winter months and hence Purchase recommendations can e advertised on the web, in weekly flyers or on the sales receipts to help improve customer service, aid customers in selecting items and increase sales 37. Define aggregation. Explain steps require designing summary table. Ans: Association: - A collection of items and a set of records, which containsome number of items from the given collection, an association function is anoperation against this set of records which return affinities or patterns that existamong the collection of items. Summary table are designed by following the steps given as follows: a) decide the dimensions along which aggregation is to be done. b) Determine the aggregation of multiple facts. c) Aggregate multiple facts into the summary table. d) Determine the level of aggregation and the extent of embedding. e) Design time into the table. f) Index the summary table. Q30.Explain horizontal and vertical partitioning and differentiate them. Ans: HORIZONTAL PARTITIONING-This is essentially means that the table is partitioned after the first few thousand entries, and the next few thousand entries etc. This is because in most cases, not all the information in the fact table needed all the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries. a) Partition by time into equal segments : This is the most straight forward method of partitioning by months or years etc. This will help if the queries often come regarding the fortnightly or monthly performance / sales etc. b) Partitioning by time into different sized segments: This is very useful technique to keep the physical table small and also the operating cost low. VERTICAL PARTITIONING- A vertical partitioning schema divides the table vertically. Each row is divided into 2 or more partitions. i) We may not need to access all the data pertaining to a student all the time. For example, we may need either only his personal details like age, address etc. or only the examination details of marks scored etc. Then we may choose to split them into separate tables, each containing data only about the relevant fields. This will speed up accessing. Q27. Explain data mining for retail industry application. Ans: The retail industry is a major application area for data mining since it collects huge amount of data on sales, customer shopping history, goods transportation, and consumption and service records and so on. The quantity of data collected continues to expand rapidly, due to web or e-commerce. a) Design and construction of data warehouses on the benefits of data mining: The first aspect is to design a warehouse. Here it involves deciding which dimensions and levels to include and what preprocessing to perform in order to facilitate quality and efficient data mining. b) Multidimensional analysis of sales, customers, products, time and region: The retail industry requires timely information regarding customer needs, product sales, trends and fashions as well as the quality, cost, profit and service of commodities. It is therefore important to provide powerful multidimensional analysis and visualization tools, including the construction of sophisticated data cubes according to the needs of data analysis. 36.Explain multi dimensional schemas. Ans: This is a very convenient method of analyzing data, when it goes beyond the normal tabular relations. For example, a store maintains a table of each item it sells over a month as a table, in each of its 10 outlets. This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by its outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items one behind the other. Then it becomes a 3 dimensional view. Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional cuboid of data. There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be thought of as approaching a multi-dimensioned unit of data from a multidimensional volume of the schema. A lot of designing effort goes into optimizing such searches. Q26. Explain the Query generation. Ans: Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses. We simply list a few of the commonly used meta data for the query. The names are self explanatory. o QueryTable accessed- Column accessed, Name, Reference identifier. o Restrictions appliedColumn name, Table name, Reference identifier ,Restrictions. o Join criteria appliedColumn name, Table name, Reference Identifier, Column name, Table name, Reference identifier. o Aggregate function used-Column name, Reference identifier, Aggregate function. o Syntax o Resources o Disk
credit rating. Data mining methods, such as feature selection and attribute relevance ranking may help identify important factors and eliminate irrelevant ones. c) Classification and clustering of customers for targeted marketing: Classification and clustering methods can be used for customer group identification and targeted marketing.. Upload By:Abhimnayu kumar singh

Kuvempu University Data Warehousing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kuvempu University Data Warehousing

Uploaded by

Copyright:

Available Formats

Page.1Q1.Explain the meaning of data cleaning and data formating.

YaExplain in brief the data mining applications

You might also like