You are on page 1of 10

CS1004 DATA WAREHOUSING & MINING UNIT 5 1.

What are the features of object-relational and object oriented data bases? Both kinds of systems deal with the efficient storage and access of vast amounts of disk-based complex structured objects.They organise a large set of data objects into classes , which in turn organised into class/sub class hierarchies.each object in a class is associated with a. an objectidentifier,b. a set of attributes,c. a set of methods that specify that computational routines or rulesassociated with each object class. 2. How data mining is performed on complex data types? Vast amounts of data are stored in various complex forms.The complex data type include objects,spatial data, multimedia data, text data and web data. Multidimensional analysis and data mining can be performed by a. class based generalization of complex objects including set valued,list valued,class-subclass hierarchies, and class composition hierarchies b. constructing object data cube c. performing generalization -based mining. 3. Give an example of star -schema of spatial data warehouse. There are 3000 weather probes distributed in British Clombia (BC), Canada, each recording daily temperature and precipitation for a designated small area and transmitting signals to a provincial weather station with a spatial data warehouse that supports spatial OLAP, a user can view weather patterns on a map by mouth, by region,etc.

4. How a spatial data warehouse is constructed? As with relational data, we can integerate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject oriented, integerated, time-variant and non volatile collection of both spatial and non spatial data. 5. What are spatial association rules? Similar to mining of associations rules in transaction and relational databases, spatial association rules can be mined in spatial databases. A spatial association rule is of the form A=>B [s%,c%],where A and B are sets od spatial or non spatial prediactes, s% is the support of the rule, and c% is the confidence of the rules. Eg. : is_a(X,"school") ^ close_to(X,"spors_center") => close_to(X,"park") [0.5%,80%] This rule states that 80% of schools are close to the park and 0.5% of the data belongs to such a case. Various kinds of spatial predicates can constitute a spatial association rule. Eg. : a. Distance information(such as close_to and far_away) b. topological relations (like intercept, overlap, and disjoint) and c.spatial or (such as left_of and right_of). 6. What are different categories of mining associations in multimedia data? Associations rules involving multimedia objects can be mined in image and video databases. the three categories in mining associations are a. Association betweem image content and non image content features: A rule like " If at least 50% of the upper part of the picture is blue, then it is likely to represent " Here the image content is limited to the keyword "sky". b. Association among image contents that are not related to spatial relationships. Eg: If a picture contains two blue squares , then it is likely to contain one red circle as well.

c. Associations among image contents related to spatial relationships. Eg: A rule like" If a red triangle is in between two yellow suares, then it is likely a big oval-shaped object is underneath. To mine association among multimedia objects, we can treat each image as a transaction and find frequently occuring patterns among different images. 7. Explain Audio & Video Data mining. There are great demands for effective content-based retreival and data mining methods for audio and video data.Examples include: Besides still images, an in commisionable amount of audio visual informaionis in digital archives , in the world wide web, in broadcast data streams, and in personal and professional datbases. Typical examples include searching for and multimedia editing of particular video clips in a TV studio, detecting suspicious persons or secures in surveilance videos. To facilitate the recoeding , searchand analysis of audio and video information from multimedia data, the following industry standards are available a. MPEGV(Moving Picture Expert Group) b. JPEG(Joint Photography Experts Group ) 8. What are the different text retrieval methods? a. Document Selection methods- A query specifies constraint function selecting relevant documents.A typical method is the Boolean retreival model in which the document is represented by a set of keywords and Boolean expressions. Eg: "Database systems but not Oracle" b. Document ranking methods - Used to rank all documents in the order of relavance.In these methods, we may match the keywords in the query with

those in the documents and score each document based on how well it matches the query. The goal is to approximate the degree of relevance of documents with a score computed based on information such as the frequency of words in the document and the whole collection. 9. How can automated document classification be performed? There are tremendous number of online documents available. Automated document classification is an important text mining task as need exists to automatically prganize documents into classes to facilitate document retrrreival and subsequent analysis.A general procedure for automated document classification First a set of pre classified document is taken as a trainiing set. The training set is thenanalyzed in order to derive a classification scheme.Such a classification scheme often needs to be refined with a testing process. The so-derived classification scheme can be used for classification of other online documents. A few typical classification methods used in text classification are a. Nearest-neighbour classification b. Feature selection methods c. Bayesian classification. 10. Explain breifly some data classification methods. a. Nearest-neighbor classification: Using the k-nearest-neighbor classification which is based on the intution that similar documents are expected to be assigned the same class label. i)We can simply index the training documents and associate each with a class label. ii)The class label of Text document can be determined based on class label distribution of k nearest neighbors. By timing k and incorporationg refinements, this kind of classification can acheive accuracy of a best classification. statistically uncorrelated with the class labels. c. bayesian classification first trains the model by calculating a generative document distribution P(d/c) to each C of document d and then tests which class is most likely to generate the test document. 11. What are the different methods of document clustering?

Document clustering is one of the most crucial techniques for organizing documents in an unsupervised manner ( class label not unown earlier) a. Spectral clustering method: first performs spectral embedding (dimensionality reduction) on the original data, and then applies the traditional cluatering algorithm (eg k-means) on the reduced document space. b. The mixture modal clustering method : models the text datawith a mixture model(invloving mutilnormal component models) Clustering involves two steps (1). estimating the model parameters based on the text data and any additional prior knowledge and (2) infering the clusters based on the estimated model parameters. c. The latent semantic indexing (LPI) : These are linear dimensionality reduction methods.We can acquire tranformation vectors or embedding function through which we use function and embed all of the data to lowerdimensional space. 12. What is time series data base? A time series database consists of sequences of values or events obtained over repeated measurements of time(hourly ,daily , weekly) .TIme- series databases are popular in many applications, such as stock market analysis ,economic and sales forecasting , budgetory analysis.workload projections , process and quality control , natural phenomena (such as atmosphere) temperature wind, earth quake), scientific and engineering experiments and medical treatments. The amount of time-series data is increasing rapidly (giga bytes/day) such as in stock trading or even per minute (such as NASA space programs). Need exists to find correlation relationships within time series data as well as analysing huge numbers of regular patterns, trends, bursts(such as sudden sharp changes) and outline with fast or even real-time response. 13. What is trend analysis? A time series data involving a variable Y, representing, say, the closing price of a share in a stock market, can be viewed as a function of time t, that is , Y = f(t). Such a function can be illustrated as a time -series graph. How can we study the time series data ?

There are two goals (1)Modelling time series (to gain insight into the mechanisms or underlying forces that generate time series. (2)forecasting time series (to predict the future values of the time series variables. Trend analysis consists of following 4 major components 1)trend or long-term movements- displayed by a trend curve or a trend line. 2)Cyclic movements or cyclic variations - refer to cycles - the long-term oscillations about a trend line or curve. 3)Seasonal movement5s or variations - These are systematic or claendar related. Eg: Events that recur annually - sudden increase in sales of items before christmas. The observed increase in water consumpton during summer b. A feature selection preocess can be used to remove terms in the training documents that are 4)irregular or random movements Eg:floods, personal changes within companies. .

fig : TIme series data of stock price Dashed curve shows the trend 14. What are the basic measures for text retrieval? a.Precision - This is the percentage of retreived documents that are

relevant to the query ( ie correct response) precision = | { Relevant} n(intersection) {retreived} | | { Retreived} | b. Recall - This is the percentage of documents that are relavant to the query and retreived recall = | {Relevant} n {retreived} | | {Relevant} |

15. What is an object cube? In an object database, data generalization and multidimensional analysis are not applied to induvidual objects but to classes of objects. The attribute oriented induction method developed for mining characteristics of relational databases can be extended to mine data characteristics in object databases.The generalization of multidimensional attributes of a complex object class can be performed by a complex object class can be performed by examining each attribute (or dimension ) generalizing each attribute to simple - valued data, and constructing a multidimensional data cube,called an object cube.once an object is constructed , multidimensional analysis and data mining can be performed on it in a manner similar to that for relational data cubes.

16. What are the challanges faced in web data mining ? 1) The web seems to be too large for effective datawarehousing and data mining. 2) the complexity of web is far greater than that of any text document collection. Web pages lack a unifying structure.They contain far more authorityu style and content variations. 3) The web is highly dynamic information source. news, stock market , weather etc are updated regularly on the web. 4) The web serves a broad diversity of user communities. The internet currently connects more than 100 million work stations. users can easily get lost by grouping in the " darkness" of the network. 5) Only small portion of the information on the web is truely relevant or useful. 17. Whatr are the data mining applications? 1) Intrusion detection 2) Association & correlation analysis. 3) Analysis of stream data. 4) Distributed data mining. 5) Visualizing & querying tools. 18. What are the recent trends in data mining ? 1)Applications exploration - in financial analysis,telecommunications,biomedicine, countering terrorism etc. 2)Scalable and interactive data mining methods - constraint based mining 3) integeration of DM systems with db,dw, and web dB systems. 4) Standardiztion of data mining language. 5) Visual data mining. 6) New methods for mining complex types of data. 7) Bilogical data mining - mining DNA and prtein sequences etc. 8) Dm applications in software engineering 9) Web mining 10) Distributed data mining 11) Real-time or time critical DM 12) graph mining 13) Privacy protection and information security in DM 14) Multi relational & multi database DM

19. What is web usage mining? Besides mining web contents and web structures , another important task for web mining is web usage mining which mines weblog records to discover user access patterns of web pages. This helps to identify high potential custimers for electronic commerce , improve web server performance etc. A web sewrver usually registers a weblog entry , for every access of web page. It includes URL requestes , IP address from which the request originated and a time stamp. 20. What are similarity based retreival in image data bases? a. Description based retreival systems - which bulids indices and perform object retreival based on image description such as keywords , captions, size and time of evaluation. b. Content based retreival systems - which supports retreival based on the image content , such as color histogram , texture , pattern , image topology , and the shape of the objects and their in the image. 21. What are the approaches used for similarity based retreival in image data bases? 1) Colo-histogram based signature : It is based on the color composition of tha image, It does not contain aany information about shape,image topology or texture. 2) Multifeature composed signature : The signature of the image includes multiple features - color histogram,shape,image topology and texture. The extracted features are stored as meta data and images are indexed based on meta data. 3) Wavelet based signature : This approach uses the dominant wavelet coeffiecients of an image as signature.Wavelets capture shape,texture, and image topology information in a single unified frame work. 4) Wavelet-based signature with region-based granularity : The computation and comparison of signatures are at the granularity of regions.

You might also like