You are on page 1of 3

Name- Priyanshu Srivastava Registration No- 520926927 Sub.

Name- Data Mining

Course: MCA LC Code- 0918 Sub. Code- MC0088

Q.1 Differentiate between Data Mining and Data Warehousing.

Ans:
The primary differences between data mining and data warehousing are the system designs, methodology used, and the purpose. Data mining is the use of pattern recognition logic to identity trends within a sample data set and extrapolate this information against the larger data pool . Data warehousing is the process of extracting and storing data to allow easier reporting .Data mining is a general term used to describe a range of business processes that derive patterns from data. Typically, a statistical analysis software package is used to identify specific patterns, based on the data set and queries generated by the end user. A typical use of datamining is to create targeted marketing programs, identify financial fraud, and to flag unusual patterns in behavior as part of a security review.It is important to note that the primary purpose of data mining is to spot patterns in the data. The specifications used to define the sample set has a huge impact on the relevance of the output and the accuracy of the analysis. Returning to the example above, if the data set is limited to customers within a specific geographical area, the results and patterns will differ from a broader data set. Although both data mining and data warehousing work with large volumes of information, the processes used are quite different.

Q.2 Explain briefly about Business Intelligence.

Ans.
Business intelligence is information about a company's past performance that is used to help predict the company's future performance. It can reveal emerging trends from which the company might profit. Data Mining allows users to sift through the enormous amount of information available in data warehouses. It is from this sifting process that business intelligence gems may be found. Data mining is not an intelligence tool or framework. Business intelligence, typically drawn from an enterprise data warehouse, is used to analyze and uncover information about past performance on an aggregate level. Data mining is more intuitive, allowing for increased insight beyond data warehousing. An implementation of data mining in an organization will serve as a guide to uncovering inherent trends and tendencies in historical information. It will also allow for statistical predictions, groupings and classifications of data. Business organizations can gain a competitive advantage with well-designed business intelligence (BI) infrastructure. Think of the BI infrastructure as a set of layers that begin with the operational systems information and Meta data and end in delivery of business intelligence to various business user communities. Based on the overall requirements of business intelligence, the data integration layer is required to extract, cleanse and transform data into load files for the information warehouse. This layer begins with transaction-level operational data and Meta data about these operational systems. Typically this data integration is done using a relational staging database and utilizing flat file extracts from source systems.The product of a good data-staging layer is high-quality data, a reusable infrastructure and meta data supporting both business and technical users. The information warehouse is usually developed incrementally over time and is architected to include key business variables and business metrics in a structure that meets all business analysis questions required by the business groups. 1. The information warehouse layer consists of relational and/or OLAP cube services that allow business users to gain insight into their areas of responsibility in the organization. 2. Customer Intelligence relates to customer, service, sales and marketing information viewed along time periods, location/geography, and product and customer variables. 3. Business decisions that can be supported with customer intelligence range from pricing, forecasting, promotion strategy and competitive analysis to up-sell strategy and customer service resource allocation. 4. Operational Intelligence relates to finance, operations, manufacturing, distribution, logistics and human resource information viewed along time periods, location/geography, product, project, supplier, carrier and employee. 5. The most visible layer of the business intelligence infrastructure is the applications layer, which delivers the information to business users. 6. Business intelligence requirements include scheduled report generation and distribution, query and analysis capabilities to pursue special investigations and graphical analysis permitting trend identification. This layer should enable business users to interact with the information to gain new insight into the underlying business variables to support business decisions. 7. Presenting business intelligence on the Web through a portal is gaining considerable momentum. Portals are usually organized by communities of users organized for suppliers, customers, employers and partners. 8. Portals can reduce the overall infrastructure costs of an organization as well as deliver great self-service and information access capabilities. Q. 3 Explain the concepts of Data Integration and Transformation. Ans: Data integration involves combining data residing in different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people frequently refer to data integration as Enterprise Information integration" (EII). Data integration systems are formally defined as a triple where is the global (or mediated) schema, is the heterogeneous set of source schemas, and is the mapping that maps queries between the source and the global schemas. Both and are expressed in languages over alphabets composed of symbols for each of their respective relation. The mapping consists of assertions between queries over and queries over . When users pose queries over the data integration system, they pose queries over and the mapping then asserts connections between the elements in the global schema and the source schemas. Transformation.In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:

1. Smoothing, which works to remove the noise from data. Such techniques include binning, clustering, and regression. 2. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily salesdata may be aggregated so as to compute monthly and annual total amounts. This step is typically used inconstructing a data cube for analysis of the data at multiple granularities. 3. Generalization of the data, where low level or \primitive" (raw) data are replaced by higher level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher level concepts, like city or county. Similarly, values for

numeric attributes, like age, may be mapped to higher level concepts, like young, middle-aged, and senior. 4. Normalization, where the attribute data are scaled so as to fall within a small specied range, such as -1.0 to 1.0, or 0 to 1.0.

Normalization is particularly useful for classification algorithms involving neural networks, or distance measurements such as nearest-neighbor classification and clustering. If using the neural network back propagation algorithm for classification mining, normalizing the input values for each attribute measured in the training samples will help speed up the learning phase. For distance-based methods, normalization helps prevent attributes with initially large ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary attributes).There are many methods for data normalization. We study three: min-max normalization, z-score normalization,and normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxAare the minimum and maximum values of an attribute A. Min-max normalization maps a value v of A to v0 in the range [new_ minA; new_ maxA] by computing. Min-max normalization preserves the relationships among the original data values. It will encounter an out of bounds" error if a future input case for normalization falls outside of the original data range for A. Q.4 Differentiate between database management systems (DBMS) and data mining.

Ans:
A database management system (DBMS), sometimes just called a database manager, is a program that lets one or more computer users create and access data in a database. The DBMS manages user requests (and requests from other programs) so that users and other programs are free from having to understand where the data is physically located on storage media and, in a multi-user system, who else may also be accessing the data. In handling user requests, the DBMS ensures the integrity of the data (that is, making sure it continues to be accessible and is consistently organized as intended) and security (making sure only those with access privileges can access the data). The most typical DBMS is a relational database management system (RDBMS). A standard user and program interface is the Structured Query Language (SQL). A newer kind of DBMS is the object-oriented database management system (ODBMS). A DBMS can be thought of as a file manager that manages data in databases rather than files in file systems. In IBM's mainframe operating systems, the nonrelational data managers were (and are, because these legacy application systems are still used) known as access methods. A DBMS is usually an inherent part of a database product. On PCs, Microsoft Access is a popular example of a single- or small-group user DBMS. Microsoft's SQL Server is an example of a DBMS that serves database requests from multiple (client) users. Other popular DBMSs (these are all RDBMSs, by the way) are IBM's DB2, Oracle's line of database management products, and Sybase's products. IBM's Information Management System (IMS) was one of the first DBMSs. A DBMS may be used by or combined with transaction managers, such as IBM's Customer Information ControlSystem(CICS).Data Mining Statistics is a branch of Mathematics. Statistics techniques are incorporated into Data mining methods. Data mining methods or techniques find the relations between variables or data in the given data base and express these relations using statistical nomenclature. Without statistics, there would be no data mining, as statistics are the foundation of most technologies on which data mining is built. Classical statistics embrace concepts such as Regression Analysis, Standard Distribution, Standard Deviation, Standard Variance, Discriminant Analysis, Cluster Analysis, and Confidence Intervals, all of which are used to study data and data relationships. These are the very building blocks with which more advanced statistical analyses are underpinned. Certainly, within the heart of today's data mining tools and techniques, classical statistical analysis plays a significant role. Note: Data Mining has its roots from Statistics, Artificial Intelligence and Machine Learning.

Q.5 Differentiate between K-means and Hierarchical clustering Ans:


Hierarchical clustering Merge-arrow.svg It has been suggested that this article or section be merged into Hierarchical clustering. (Discuss) Proposed since October 2009. Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations. Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters. Any non-negative-valued function may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. Agglomerative hierarchical clustering For example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance. Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below). Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: The maximum distance between elements of each cluster (also called complete-linkage clustering): The minimum distance between elements of each cluster (also called single-linkage clustering): The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA): The sum of all intra-cluster variance. The increase in variance for the cluster being merged (Ward's criterion). The probability that candidate clusters spawn from the same distribution function (V-linkage). Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). k-means clustering The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. Example: The data set has three dimensions and the cluster has two points: X = (x1,x2,x3) and Y = (y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where The algorithm steps are: Choose the number of clusters, k. Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers. Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the distance measures discussed above. Recompute the new cluster centers. Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed). The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments (the k-means++ algorithm addresses this problem by seeking to choose better starting clusters). It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Another disadvantage is the requirement for the concept of a mean to be definable which is not always the case. For such datasets the k-medoids variants is appropriate. An alternative, using a different criterion for which points are best assigned to which centre is k-medians clustering.

Q.6 Differentiate between Web content mining and Web usage mining.

Ans:
Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web 1. Web Content Mining 2. Web Usage Mining. A. Web Content Mining Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. It is also quite different from Data mining because Web data are mainly semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web content mining is also different from Text mining because of the semi-structure nature of the Web, while Text mining focuses on unstructured texts. Web content mining thus requires creative applications of Data mining and / or Text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. Web content mining could be differentiated from two points of view: Agent-based approach or Database approach. The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. Web Content Mining Problems/Challenges Data/Information Extraction: Extraction of structured data from Web pages, such as products and search results is a difficult task. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are used to solve this problem. Web Information Integration and Schema Matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. Identifying or matching semantically similar data is a very important problem with many practical applications. Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain. Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. All these tasks present major research challenges and their solutions. b. Web usage mining Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further depending on the kind of usage data considered: Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time. Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the categories above.

You might also like