You are on page 1of 6

BT9001 Data Mining

Question 1 - What is data mining? Write Data Mining applications.


Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. In short : Data Mining is the discovery of knowledge of analyzing enormous set of data; by extracting the meaning of the data and then predicting the future trends and also helps companies to take sound decisions, based on knowledge and information. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data Mining Applications

Question 2 - What is OLAP? Write the benefits of OLAP.


OLAP: Online Analytical Processing is a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining. OLAP is a design paradigm, a way to seek information out of the physical data store. OLAP is all about summation. It aggregates information from multiple systems, and stores it in a multi-dimensional format. These could be a star schema, snowflake schema or a hybrid kind of a schema. The chief component of OLAP is the OLAP server, which sits between a client and database management systems (DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the data. There are OLAP servers available for nearly all the major database systems. Benefits of OLAP One main benefit of OLAP is consistency of information and calculations. No matter how much or how fast data is processed through OLAP software or servers, the reporting that results is presented in a consistent presentation, so analysts and executives always know what to look for where. This is especially helpful when comparing information from previous reports to information contained in new ones and projected future ones. It avoids the lengthy discussions about who has the correct information. "What if" scenarios are some of the most popular uses of OLAP software and are made eminently more possible by multidimensional processing. Another benefit of multidimensional data presentation is that it allows a manager to pull down data from an OLAP database in broad or specific terms. In other words, reporting can be as simple as comparing a few lines of data in one column of a spreadsheet or as complex as viewing all aspects of a mountain of data. Also, multidimensional presentation can create an understanding of relationships not previously realized. OLAP creates a single platform for all the information and business needs; planning, budgeting, forecasting, reporting and analysis. Last but not least, the learning curve to use OLAP is minimal. The most used interface to analyze data stored in OLAP technology is the well known and loved spreadsheet. And all of this, of course, can be done in the blink of an eye.

Question 3 - Describe the key features of a Data Warehouse.


Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. A data warehouse refers to a database that is maintained separately from an organizations operational databases. Data warehouse systems allow for the integration of a variety of application systems. The data warehouse has become an increasingly important platform for data analysis and online analytical processing and will provide an effective platform for data mining. Therefore, prior to presenting a systematic coverage of data mining technology in the remainder of this book, we devote this unit to an overview of data warehouse technology. Such an overview is essential for understanding the data mining technology. Key features of a Data Warehouse: Subject oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operation and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on V line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. Time - variant: Data are stored to provide information from a historical perspective (e.g., the past 5 - 10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.

Question 4 - What is Business Intelligence? Explain the components of BI architecture.


Todays business is getting increasingly global, while the boundaries between functions are dissolving. BI is getting merged with processes to provide an integrated user experience. From being the prerogative of only the top management a few years ago, BI today is all pervasive, aiding executives to fine tune strategies at the operational level. Seamless integration of BI with MS-Office and on-demand delivery is a significant productivity driver. Advanced BI systems handle both structured and unstructured data to deliver tremendous value to users. Real-time BI reduces latency time to quickly deliver up-to-date results and innovative service models are revolutionizing the concept of information delivery. Components of BI architecture: The information warehouse layer consists of relational and/or OLAP cube services that allow business users to gain insight into their areas of responsibility in the organization. Customer Intelligence relates to customer, service, sales and marketing information viewed along time periods, location/geography, and product and customer variables. Business decisions that can be supported with customer intelligence range from pricing, forecasting, promotion strategy and competitive analysis to up-sell strategy and customer service resource allocation. Operational Intelligence relates to finance, operations, manufacturing, distribution, logistics and human resource information viewed along time periods, location/geography, product, project, supplier, carrier and employee. The most visible layer of the business intelligence infrastructure is the applications layer, which delivers the information to business users. Business intelligence requirements include scheduled report generation and distribution, query and analysis capabilities to pursue special investigations and graphical analysis permitting trend identification. This layer should enable business users to interact with the information to gain new insight into the underlying business variables to support business decisions. Presenting business intelligence on the Web through a portal is gaining considerable momentum. Portals are usually organized by communities of users organized for suppliers, customers, employers and partners. Portals can reduce the overall

Question 5 Describe Data Cleaning and its importance.


Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. Maintaining databases has become an essential part of many organizations these days. Database maintenance helps organizations in carrying out their daily work processes and procedures in an efficient and smooth manner. Besides, the databases have necessary information which is critical to various processes of the projects. Hence, maintaining the accuracy and validity of the data is of utmost importance which needs to be performed on a regular basis. Importance of Data Cleaning: Data cleansing brings many benefits to your organization which includes accuracy and organizing the data apart from streamlining the entire process of the organization. There are further following reasons for using Data Cleaning techniques: Application Errors: This type of data errors is basically mechanical which happens due to the inability of the legacy system to automatically validate certain user inputs. It is often difficult to prevent this type of errors from taking place in the system as it is a part and parcel of every legacy system. Human Errors: Human Errors are a major source of data manipulation in the legacy system. Again, this can take place because of the inability of the legacy system to validate data entered manually by users. However, some of these errors are rather logical nature. Let's take the example of a date field which refers to the purchase data of any product. It may so happen that the user inputs a date which might be valid from his prospective, but it could wrong from the business aspect, i.e. the user might have input a date on which a business transaction might not have taken place. Deliberate Manipulation: It could happen for two reasons. Firstly, the user might input data forcibly to conform to the requirement of the legacy system. Secondly, the user can try to purposely manipulate the data in the legacy system to fulfill his ends. Target System Model Definition: If the target system model dictates the in a certain format which cannot be found in the legacy system this kind of errors can crop up.

Question 6 - What is Data Mining? How does it work?


Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. In short Data mining streamlines the transformation of masses of information into meaningful knowledge, which is essential or bottom-line of Business intelligence. Data Mining works: As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are so many variables. From his existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long distance calls. For instance, he might learn that his best customers are unmarried females between the age of 34 and 42 who make in excess of $60,000 per year. This, then, is his model for high value customers, and he would budget his marketing efforts to accordingly. Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential Patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

You might also like