The primary objective of this research is to develop a process to accurately predict useful data from the huge amount of available data using data mining techniques. Data Mining is the process of finding treads, patterns and correlations between fields in large RDBMS. It permits users to analyse and study data from multiple dimensions and approaches, classify it, and summarize identified data relationships. Our focus in this thesis is to use education data mining procedures to understand higher education system data better which can help in improving efficiency and effectiveness of education. In order to achieve a decisional database, many steps need to be taken which are explained in this thesis. This work investigates the efficiency, scalability, maintenance and interoperability of data mining techniques. In this research work, data-results obtained through different data mining techniques have been compiled and analysed using variety of business intelligence tools to predict useful data. An effort has also been made to identify ways to implement this useful data efficiently in daily decision process in the field of higher education in India.
Mining in educational environment is called Educational Data Mining. Han and Kamber describes data mining software that allow the users to analyze data from different dimensions, categorize it and Summarize the relationships which are identified during the mining process. New methods can be used to discover knowledge from educational databases. Every data has a lot of hidden information. The processing method of data decides what type of information data produce. In India education sector has a lot of data that can produce valuable information. This information can be used to increase the quality of education. But educational institution does not use any knowledge discovery process approach on these data. Information and communication technology puts its leg into the education sector to capture and compile low cost information. Now a day a new research community, educational data mining (EDM), is growing which is intersection of data mining and pedagogy. First chapter of the thesis elaborates the knowledge data discovery process, data mining concept, history and application of data mining in various industries.
Table of Contents
Declaration
Certificate
Abstract
Acknowledgements
List of Tables
List of Figures
CHAPTER 1: Concepts, Applications and Trends In Data Mining
1.1 KNOWLEDGE DATA DISCOVERY
1.2 DATA MINING PROCESS
1.3 DATA MINING TECHNIQUE
1.3.1 Anomaly Detection
1.3.2 Association
1.3.3 Classification
1.3.4 Clustering
1.3.5 Regression
1.3.6 Summarization
1.4 HISTORY OF DATA MINING
1.5 DATA MINING PROJECT CYCLE
1.6 HOW DOES DATA MINING DIFFER FROM STATISTICAL APPROACH
1.7 APPLICATION OF DATA MINING
1.8 REFERENCES
CHAPTER 2: Educational Data Mining
2.1 Introduction
2.1.1 Basic Concepts
2.1.2 Pre Processing in EDM
2.1.3 Data Mining in EDM
2.1.4 Post Processing of EDM
2.2 Main Applications of EDM methods
2.3 OPEN Issues in EDM
2.4 MOTIVATIONAL WORK
2.5 FACT ANALYSIS IN EDM
2.6 CONCLUSION
2.7 REFERENCES
CHAPTER 3: Classification Model of Prediction for Placement of Students
3.1 Abstract
3.2 INTRODUCTION
3.3 DATA mining
3.3.1 Naïve Bayesian Classification
3.3.2 Multilayer Perceptron
3.3.3 C4.5 Tree
3.4 BACKGROUND AND RELATED WORK
3.5 DATA MINING PROCESS
3.5.1 Data Preparations
3.5.2 Data selection and Transformation
3.5.3 Implementation of Mining Model
3.5.4 Results
3.5.5 Discussion
3.6 Conclusions
3.7 References
CHAPTER 4: Data Mining Techniques in EDM for Predicting the Performance of Students
4.1 Abstract
4.2 Introduction
4.3 BACKGROUND AND RELATED WORK
4.4 Data Mining Techniques
4.4.1 OneR (Rule Learner)
4.4.2 C4.5
4.4.3 MultiLayer Perceptron
4.4.4 Nearest Neighbour algorithm
4.5 DATA MINING PROCESS
4.5.1 Data Preparations
4.5.2 Data selection and transformation
4.5.3 Implementation of Mining Model
4.5.4 Results and Discussion
4.6 Conclusions
4.7 References
CHAPTER 5: Analysis and Mining of Educational Data for Predicting the Performance of Students
5.1 Abstract
5.2 Introduction
5.3 BACKGROUND AND RELATED WORK
5.4 Data Mining Techniques
5.4.1 ID3 (Iterative Dichotomiser 3)
5.4.2 C4.5
5.4.3 Bagging
5.5 DATA MINING PROCESS
5.5.1 Data Preparations
5.5.2 Data selection and transformation
5.5.3 Implementation of Mining Model
5.5.4 Results and Discussion
5.6 Conclusions
5.7 References
CHAPTER 6: Evaluation of Teacher’s Performance: A Data Mining Approach
6.1 Abstract
6.2 Introduction
6.3 Data mining
6.3.1 Naïve Bayes Classification
6.3.2 ID3 (Iterative Dichotomise 3)
6.3.3 CART
6.3.4 LAD Tree
6.4 BACKGROUND AND RELATED WORK
6.5 DATA MINING PROCESS
6.5.1 Data Preparations
6.5.2 Data selection and Transformation
6.5.3 Implementation of Mining Model
6.5.4 Results
6.5.5 Discussion
6.6 Conclusions
6.7 References
CHAPTER 7:CONCLUSIONS AND DIRECTIONS FOR FUTURE RESEACH
7.1 SUMMARY OF RESULTS
7.2 DIRECTIONS FOR FUTURE RESEARCH
Appendices
Bibliography
PUBLISHED RESEARCH PAPERS
LIST OF FIGURES
Fig 1.1 Data mining in the process of knowledge discovery
Fig 1.2 Data mining as a step in the process of knowledge discovery
Fig 1.3 Role of data model
Fig 1.4 Data Modelling
Fig 1.5 Diagram of Apriori Algorithm
Fig 1.6 Rule Generation for Apriori Algorithm
Fig 1.7 Graphical Representation of Classification
Fig 1.8 Graphical Representation of ID3 algorithms
Fig 1.9 Graphical Representation of CART algorithms
Fig 1.10 Graphical Representation of Clustering
Fig 1.11 Graphical Representation of Cluster Analysis
Fig 1.12 Graphical Representation of Regression
Fig 1.13 Graphical Representation of Regression
Fig 1.14 Graphical Representation of Summarization
Fig 1.15 Data mining project life cycle
Fig 3.1 Results
Fig 3.2 Comparison between Parameters
Fig 3.3 Decision Tree
Fig 4.1 Visualization of the Students Categorization
Fig 4.2 Efficiency of different models
Fig 4.3 Comparison between Parameters
Fig 5.1: Visualization of the Students Categorization
Fig 5.2 Efficiency of different models
Fig 5.3 Comparison between Parameters
Fig 5.4 Importance (Chi-squared) plot for predictors
Fig 6.1 LAD Tree
LIST OF TABLES
Table 1.1: Transactional Data to demonstrate association rule
Table 1.2: Item sets for Apriori Principle
Table 1.3: Item sets and Support values for Apriori Principle
Table 1.4: Evolution of data mining
Table 2.1: Works about applying data mining techniques in educational systems
Table 2.2: Some specific educational data mining, statistics and visualization tools
Table 2.3: Interestingness measure of association
Table 3.1 Student related variables
Table 3.2: Result of tests and average rank
Table 3.3: Performance of the classifiers
Table 3.4: Training and Simulation Error
Table 3.5 Comparison of evaluation measures
Table 3.6: Confusion matrix
Table 4.1: student related variables
Table 4.2: Performance of the classifiers
Table 4.3: Training and Simulation Error
Table 4.4: Comparison of evaluation measures
Table 4.5: Confusion matrix
Table 5.1: student related variables
Table 5.2: Performance of the classifiers
Table 5.3: Training and Simulation Error
Table 5.4: Classifiers Accuracy for ID3
Table 5.5: Classifiers Accuracy for C4.5
Table 5.6: Classifiers Accuracy for Bagging
Table 6.1: Student related variables
Table 6.2: Result of tests and average rank
Table 6.3: Performance of the classifiers
DECLARATION
I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgment has been made in the text.
Signature of Research Scholar
Name : Ajay Kumar Pal
Enrollment No. 1250106123
CERTIFICATE
Certified that Ajay Kumar Pal (enrollment no1250106123) has carried out the research work presented in this thesis entitled “Data mining applications: A comparative study for predicting student's performance” for the award of Doctor of Philosophy from Sai Nath University, Ranchi under my supervision. The thesis embodies results of original work, and studies as are carried out by the student himself and the contents of the thesis do not form the basis for the award of any other degree to the candidate or to anybody else from this or any other University/Institution.
Signature
Dr. Saurabh Pal
Sr. Lecturer,
Dept. of MCA,
VBS Purvanchal University, Jaunpur
Date: 15/01/1014
Abstract
The primary objective of this research is to develop a process to accurately predict useful data from the huge amount of available data using data mining techniques. Data Mining is the process of finding treads, patterns and correlations between fields in large RDBMS. It permits users to analyse and study data from multiple dimensions and approaches, classify it, and summarize identified data relationships. Our focus in this thesis is to use education data mining procedures to understand higher education system data better which can help in improving efficiency and effectiveness of education. In order to achieve a decisional database, many steps need to be taken which are explained in this thesis. This work investigates the efficiency, scalability, maintenance and interoperability of data mining techniques. In this research work, data-results obtained through different data mining techniques have been compiled and analysed using variety of business intelligence tools to predict useful data. An effort has also been made to identify ways to implement this useful data efficiently in daily decision process in the field of higher education in India.
Mining in educational environment is called Educational Data Mining. Han and Kamber describes data mining software that allow the users to analyze data from different dimensions, categorize it and Summarize the relationships which are identified during the mining process. New methods can be used to discover knowledge from educational databases. It is proposed that we are to explore the following areas in detail.
- to measure the student’s performance through multidimensional attributes. Each dimension and its associated factors are designed to predict the student’s behavior.
- to predict students preference in certain fields of study.
- To study the placement of students.
- To study the performance of teachers.
Every data has a lot of hidden information. The processing method of data decides what type of information data produce. In India education sector has a lot of data that can produce valuable information. This information can be used to increase the quality of education. But educational institution does not use any knowledge discovery process approach on these data. Information and communication technology puts its leg into the education sector to capture and compile low cost information. Now a day a new research community, educational data mining (EDM), is growing which is intersection of data mining and pedagogy.
First chapter of the thesis elaborates the knowledge data discovery process, data mining concept, history and application of data mining in various industries.
Second chapter discuss roadmap of research done in educational data mining in various segment of education sector. It is a literary review and covers the research done in the field of educational data mining. This chapter contains various tools use in the field of data mining and successful application.
Chapter third covers the objective of effective advertisement method to attract students in institutions. In this chapter support, confidence and cosine analysis is used to find the best advertisement methods. Newspaper, hording, pamphlets, radio, advertisement van and personal contact is considered as advertisement method. It is concluded that hording and personal contact is more effective method in comparison with other advertisement methods.
Chapter four discusses the objective of predicting the performance of students. For this purpose previous year students’ data sets is used. It will help in reducing the drop out ratio and also improve the performance institution. In this chapter Bayesian classification method is used.
Chapter fifth covers impact of language on the performance of students in class room. In this chapter support, confidence, added value lift, correlation and conviction analysis is used. Hindi, English and mix medium language is considered for the purpose of analysis. It is concluded at the end of chapter that mix medium class has largest interest.
Chapter 6 covers the objective of measuring the quality of a teacher and suggests them about the qualities they have. In this chapter a psychometric test is organized to collect opinion about a teacher from students. Eight groups are formed from different teaching qualities. Students score is subtraction of standard deviation from average score of that group. This chapter provides a model to analyze the teacher qualities scores are used to choose adept teacher to perform the given duty.
Chapter seven concludes the thesis and findings of the research work.
ACKNOWLEDGEMENTS
The present research work would not have been completed without the help of many generous and inspiring people. At this point I want to thank and recognize for their contributions the following individuals. I would like to express my deepest appreciation to my supervisor reverend Dr Saurabh Pal, not only for giving me valuable feedback, guidance, support and motivation throughout my academic endeavors, but also setting the highest of standards for my academic work. It is under his encouragement, direction and guidance that this research work could be completed. Like all other students I felt suspicious regarding the completion of this research work, but under the guidance and direction of Dr Saurabh Pal, a great scholar and successful teacher of the computer science, the subject became easy and understandable for me and now it has been completed.
In completing a particular kind of great work the blessings of the parents are of special importance. The blessings of my father Jaynath Pal would always remain with me. I had always been getting mental freedom and encouragements time and again. His unique contribution was in the correction and betterment of the errors, sentence construction as well as proper shape of letters. Likewise I feel highly obliged and thankful to my reverend mother Shakuntala Pal Brother Mr. Ashish Pal,Saurabha Pal my wife Mrs. Pal Sarika who have extended their utmost cooperation and encouragements to me at several occasions.
Thereafter I cannot forget my sincere obligations and gratitude to Dr. Saurabh Pal of the MCA department of the V B S Purvanchal University, Jaunpur and Mr Umesh Kumar Pandey of the BCA Department of the Pt Sukhraj Raghunathi Institute of Education and Technology, Ranjeetpur Chilbila, Pratapgarh (UP). They have extended their valuable guidance and utmost cordial cooperation to me at times. I am particularly thankful to Er. Chandan Kumar, Mr. Ravi Prakash Paddey CSE Department , IET Dr. R M L Awadh University Faizabad. He has made his contributions in writing this thesis with his guidance, direction and blessings. In the same context I feel duty bound to offer my sincere gratefulness to Er. Abhishek Bajaj, Mr. Sanjeet Pandey, Er. Vineet Singh, Assistant Professors, IET Dr. R M L Awadh University Faizabad. He has also encouraged and cooperated me in completing this research work in ways without number.
I express my special obligation to Prof. L. K. Singh Director, all the members of the teaching staff, my students of IET Dr. R M L Awadh University Faizabad (UP). I acknowledge it sincerely that without their cooperation probably I would have been forced to face certain complicated problems.
I cannot neglect and forget my faithful friends who added their knowledge and dedication according to their capacities in completing this research work from the very process of the beginning of this research work till the end.
In the end I feel highly thankful to all those known and unknown persons who have directly or indirectly cooperated and contributed me in completing this research work.
Ajay Kumar Pal
CHAPTER – 1 Concepts, Applications and Trends in Data Mining
1 . 1 KNOWLEDGE DATA DISCOVERY
Most authors have different definitions for data mining and knowledge discovery. Goebel and Gruenwald [1] define knowledge discovery in databases (KDD) as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” and data mining as “the extraction of patterns or models from observed data.” Berzal et al. [2] define KDD as “the non-trivial extraction of potentially useful information from a large volume of data where the information is implicit (although previously unknown).” G&G’s model of KDD, paraphrased below, shows data mining as one step in the overall KDD process:
1. Identify and develop an understanding of the application domain.
2. Select the data set to be studied.
3. Select complimentary data sets. Integrate the data sets.
4. Code the data. Clean the data of duplicates and errors. Transform the data.
5. Develop models and build hypotheses.
6. Select appropriate data mining algorithms.
7. Interpret results. View results using appropriate visualization tools.
8. Test results in terms of simple proportions and complex predictions.
9. Manage the discovered knowledge.
Although data mining is only a part of the KDD process, data mining techniques provide the algorithms that fuel the KDD process. The KDD process shown above is a never-ending process. Data mining is the essence of the KDD process. If data mining is being discussed, it is understood that the process of KDD is being used. In this work, we will focus on data mining algorithms.
Adriaans and Zantinge [3] emphasize that the KDD community reserves the term data mining for the discovery stage of the KDD process. Their definition of KDD is as follows: “... the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data.” Similarly, Berzal et al. [2] define data mining as “a generic term which covers research results, techniques and tools used to extract useful information from large databases.” Also, Adriaans and Zantinge [3] point out that KDD draws on techniques from the fields of expert systems, machine learning, statistics, visualization, and database technology.
In recent years the amount of data collected and stored by electronic devices has risen tremendously. All of these data collection are far too large to be examined manually and even methods for automatic data analysis based on classical statistics and machine learning. These often face problems when processing large, dynamic data collection consisting of complex objects. To analyze these large amounts of collected information, the area of knowledge discovery in databases provides techniques which extract interesting patterns in a reasonable amount of time. Therefore, KDD employs methods at the cross point of machine learning, statistics and database systems [4].
In recent years the amount data that is collected by advanced information system has increased tremendously. To analyze these huge amounts of data, the interdisciplinary field of knowledge discovery in database (KDD) has emerged. The core step of KDD is called data mining. Data mining applies efficient algorithm to extract interesting patterns and regularities from the data [4].
The term itself (knowledge discovery) appeared around 1989 and is attributed to “knowledge discovery in database is the nontrivial process of identifying valid, novel, potential useful and ultimately understandable patterns in data.” The term pattern is used here in a broader sense. It may be embrace relationships, correlations, trends, descriptors or rare events etc [5].
Krzystof et al. presented following figure 1.1 which includes the main functional phases of the knowledge discovery [3]
Abbildung in dieser Leseprobe nicht enthalten
Fig 1.1 Data mining in the process of knowledge discovery[10]
According to Krzystof et al. the KDD process is arranged into a stream of steps:
- Understanding the domain in which the directory will be carried out.
- Forming the data set, its cleaning and warehousing
- Extracting patterns, essentially, this is the essence of data mining.
- Post processing of the discovered knowledge
- Putting the results of knowledge discovery.
Han and Kamber depicted knowledge discovery process in the figure 1.2 [6].
Han and Kamber explained that it consists of an iterative sequence of the following steps [6]:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis tasks are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation)
5. Data Mining (in order to extract data patterns)
6. Pattern evaluation ( to identify the truly interesting patterns)
7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data pre-processing where data are prepared for mining.
Fig 1.2 Data mining as a step in the process of knowledge discovery [6]
Han and Kamber produce architecture of a typical data mining system that has the following major components [6]:
1. Database, data warehouse, world wide web or other information repository: (Data cleaning and data integration techniques may be performed on the data)
2. Database or data warehouse server (responsible for fetching the relevant data based on the users data mining request)
3. Knowledge base: ( used to guide the search or evaluate the interestingness or resulting patterns)
4. Data mining engine: (a set of functional modules for tasks such as characterization, association, classification etc.)
5. Pattern evaluation module: (employs interesting measure)
6. User interface (allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search.
1.2 DATA MINING PROCESS
Data mining is a process of discovering and identifying numerous models, summaries, relations, structures and derived values from a given collection of data. It is essential to realize that the problem of determining or approximating dependencies from data is only one part of the general experimental process. The general experimental procedure used to data-mining problems involves the following steps:
1. Define the problem and formulate the hypothesis.
2. Collect the data.
3. Pre-process the data.
4. Estimate the data model.
5. Interpret the data model and draw conclusions.
Data warehouse comprises of five types of data where the sorting is accommodated to the time-dependent data sources: old detail data, current detail data, lightly summarized data, highly summarized data and meta-data. To formulate these five types of data in data warehouse, the fundamental types of data transformation are standardized. There are four main types of transformations, and each has its own features:
1. Simple Transformations of data- These transformations are the building blocks of other complex transformations. They include manipulation of data that are focused on one field at a time, without taking into account their values in related fields.
2. Cleansing and validating the data - These transformations ensure regular configuration and use of a field and its related groups of fields. This class of transformations also includes checks for valid values in a particular field.
3. Data Integration – It involves collecting data from one or multiple operational data sources and maps them field by field onto a new data structure in the data warehouse.
4. Aggregation and summarization of data – It is a process of reducing large instances of data found in the operational environment into fewer instances in the warehouse environment. This can be done by aggregation and summarization. Summarization is a simple accumulation of values with one or multiple data dimensions. Aggregation refers to the addition of different business elements into a common total.
Even though the implementation of a data warehouse is a complicated task but there are few basic steps. A three-stage data-warehousing development process is briefly explained through the following simple steps:
Modelling – To take the time to recognize business processes, the information requirements of these processes, and the decisions those are presently made within processes.
It is a process of making a data model for an information system by understanding business processes, reviewing information requirements of these processes and applying formal data modelling techniques.
Abbildung in dieser Leseprobe nicht enthalten
Fig 1.3 Role of data model
[Matthew, 1996[7]]
Abbildung in dieser Leseprobe nicht enthalten
Fig 1.4 Data Modelling
[Matthew, 1996[7]]
Data model offers a framework for data used within information systems by giving specific description and arrangement. If a data model is used steadily through systems then compatibility of data can be attained. When same data structures are used to collect and access data then diverse applications can share data effortlessly.
Building – To create requirements for tools that support the decision support systems required for the targeted business process, to create data model to help defining the information requirements; and to resolve problems into data specifications and data store like data mart or a more comprehensive data warehouse.
Deploying – To understand and implement the type of data to be warehoused and the numerous business intelligence tools to be engaged; to begin by training users.
1.3 DATA MINING TECHNIQUE
Various data mining techniques have been used for different data mining applications. Data mining involves six common classes of tasks: [8]
1.3.1 Anomaly detection
Anomaly detection, also referred to as outlier detection [9] refers to detecting patterns in a given data set that do not conform to an established normal behaviour. [10]
The detected patterns are called anomalies. They are frequently interpreted to critical, complex and actionable statistics in numerous application domains. Three broad groups of anomaly detection techniques exist.
Unsupervised anomaly detection techniques identify anomalies in an unlabelled test data set under the hypothesis that the most of the instances in the data set are normal.
Supervised anomaly detection techniques detect anomalies in a labelled as ("normal" and "abnormal”) and it requires a classifier.
Semi-supervised anomaly detection techniques create a model representing normal behaviour from “normal” data set, and then testing the probability of ‘abnormal” test set by using learnt model.
Anomaly detection is used in several areas like as intrusion detection, event detection, fraud detection, fault detection, system health monitoring. It is regularly used in pre-processing to eliminate anomalous data from the dataset.
Few Anomaly detection techniques: -
1) Distance based techniques like k-nearest neighbour K-nearest neighbour algorithm (k-NN) is a technique for categorizing objects based on closest training examples in the feature space. K-NN is a type of instance-based learning, or lazy learning where the function is only estimated locally and all calculation is postponed until classification. The k-nearest neighbour algorithm is among the simplest of all machine learning algorithms. Object is classified by a majority vote of its neighbours, with the object being given to the class most common amongst its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbour.
2) One Class Support Vector Machines.
Support vector machines are supervised machine learning models that analyse data and recognize patterns, used for classification and regression analysis. The elementary SVM takes a set of input data and forecasts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.
3) Replicator Neural Networks
4) Control charts such as the real-time contrasts chart.
1.3.2 Association
Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction. Association rules are usually required to satisfy a user specified minimum support and a user specified minimum confidence at the same time.
For example an association rule would be "If a customer buys Pencil and Eraser, he is 80% likely to also purchase Notebook."
Association rules are if/then statements that help uncover relationships between apparently unrelated data in a relational database or other information repository. An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent.
Association rules are created by analysing data for regular if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.
In data mining, association rules are useful for analysing and predicting customer behaviour. They play an important part in shopping basket data analysis, product clustering, and catalogue design and store layout.
Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in set of items in large databases [11]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Various association mining techniques and algorithms will be briefly introduced and compared later. Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database; those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence.
Programmers use association rules to build programs capable of machine learning. Machine learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to become more efficient without being explicitly programmed.
Rakesh Agrawal [11] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys Pencil and Eraser together, he is likely to also buy Notebook. Such information can be used to identify what are the products that customers frequently purchase together and businesses can have corresponding marketing campaign/activities such as promotional pricing or product placements to sell more products to make more profit.
Association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Association rule learning typically does not consider the order of items either within a transaction or across transactions. Applications include discovering affinities for market basket analysis and cross-marketing, catalogue design, loss-leader analysis, store layout and customer segmentation based on buying patterns.
To illustrate the concepts, let’s take an example of Stationary transaction.
Following are the transactions to find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.
Table-1.1 Transactional Data to demonstrate association rule
Abbildung in dieser Leseprobe nicht enthalten
Input: a collection of transactions. Output: rules to predict the occurrence of any item(s) from the occurrence of other items in a transaction.
1) Item set – It is collection of one or more items. Example: {Notebook, Pencil, Eraser} k-item set – Item set that contains k items
2) Support count (s) – Frequency of occurrence of an item set. Example s({Notebook, Pencil, Eraser}) = 2
3) Support – Fraction of transactions that contain an item set Example. s({Notebook, Pencil, Eraser}) = 2/5 Association Rule – It is an implication expression of the form X ® Y, where X and Y are item sets. Example: {Notebook, Eraser} ® {Ruler}
Rule Evaluation Metrics
Support - Fraction of transactions that contain both X and Y
Confidence - Measures how often items in Y appear in transactions that contain X
Abbildung in dieser Leseprobe nicht enthalten
Confidence can be interpreted as an estimate of the probabilityAbbildung in dieser Leseprobe nicht enthalten, the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. [12]
Example – {Notebook, Eraser } =>Ruler
Support (s) = s(Notebook, Pencil, Eraser) / |Total number of Transaction| = 2/5 = 0.4
Confidence (c) = s(Notebook, Pencil, Eraser)/ s(Notebook, Eraser) = 2 / 3 = 0.67
Abbildung in dieser Leseprobe nicht enthalten
Observations:
a) All the above rules are binary partitions of the same item set: {Notebook, Eraser, Ruler}
b) Rules originating from the same item set have identical support but can have different confidence
c) Hence, the support and confidence requirements can be separated.
A set of transactions, where each transaction is a set of literals (called items), an association rule is an expression of the form X à Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. An example of an association rule is: “67% of transactions that contain Notebook and Eraser will also contain Eraser; 40% of all transactions contain all these items". Here 67% is called the confidence of the rule, and 40% the support of the rule. Both the left hand side and right hand side of the rule can be sets of items. The problem is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints.
This example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
The lift of a rule is defined as
Abbildung in dieser Leseprobe nicht enthalten
Or the ratio of the observed support to that expected if X and Y were independent.
The rule {Notebook, Eraser} =>Ruler. Thus, Lift = [Abbildung in dieser Leseprobe nicht enthalten]
Lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.
The conviction of a rule is defined as [Abbildung in dieser Leseprobe nicht enthalten.]
The rule {Notebook, Eraser} =>Ruler. Thus, Conviction = 1 – 0.6 / 1 – 0.67 = 1.212 It can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.212 shows that the rule {Notebook, Eraser} =>Ruler would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.
1.3.2.1 Apriori Algorithm
Apriori is a classic algorithm for learning association rules. It is designed to operate on databases containing transactions. Apriori algorithm is to break up the requirement of computing support and confidence as a two separate tasks.
1) Frequent item sets are generated i.e those item sets which holds the criteria of minimum support.
2) Rule generation is made possible by evaluation the confidence measure.
Let’s visualize the approach diagrammatically as shown below:
Abbildung in dieser Leseprobe nicht enthalten
Fig 1.5 Diagram of Apriori Algorithm [13]
Measures could be classified into two categories – subjective and objective. A subjective measure often involves some heuristics and involves domain expertise to eliminate not interesting rules while objective measures are domain independent. Support and confidence are good examples of objective measures. Objective measures could be either symmetric binary or asymmetric binary.
[...]
- Citation du texte
- Saurabh Pal (Auteur), 2014, Data Mining Applications. A Comparative Study for Predicting Student's Performance, Munich, GRIN Verlag, https://www.grin.com/document/378847
-
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X.