|Home | Articles | Forum | Glossary | Books|
3. Data Pre-processing
4. Data Warehousing and Online Analytical Processing
5. Data Cube Technology
6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
7. Advanced Pattern Mining
8. Classification: Basic Concepts
9. Classification: Advanced Methods
10. Cluster Analysis: Basic Concepts and Methods
11. Advanced Cluster Analysis
12. Outlier Detection
13. Data Mining Trends and Research Frontiers
Analyzing large amounts of data is a necessity. Even popular science books, like "super crunchers," give compelling cases where large amounts of data yield discoveries and intuitions that surprise even experts. Every enterprise benefits from collecting and analyzing its data: Hospitals can spot trends and anomalies in their patient records, search engines can do better ranking and ad placement, and environmental and public health agencies can spot patterns and abnormalities in their data. The list continues, with cyber security and computer network intrusion detection; monitoring of the energy consumption of household appliances; pattern analysis in bioinformatics and pharmaceutical data; financial and business intelligence data; spotting trends in blogs, Twitter, and anymore. Storage is inexpensive and getting even less so, as are data sensors. Thus, collecting and storing data is easier than ever before.
The problem then becomes how to analyze the data. This is exactly the focus of this guide. We give encyclopedic coverage of all the related methods, from the classic topics of clustering and classification, to database methods (e.g., association rules, data cubes) to more recent and advanced topics (e.g., SVD/PCA, wavelets, support vector machines).
The exposition is extremely accessible to beginners and advanced readers alike. The guide gives the fundamental material first and the more advanced material in follow-up sections. It also has numerous rhetorical questions, which I found extremely helpful for maintaining focus.
We have used the first previous versions of this guide in data mining courses at UCLA and plan to continue to do so with this guide. The new version has significant additions: Notably, it has more than 100 citations to works from 2008 onward, focusing on more recent material such as graphs and social networks, sensor networks, and outlier detection. This guide has a new section for visualization, has expanded outlier detection into a whole section, and has separate sections for advanced methods-for example, pattern mining with top-k patterns and more and clustering methods with bi-clustering and graph clustering.
Overall, it is an excellent guide on classic and modern data mining methods, and it is ideal not only for teaching but also as a reference guide.
We are deluged by data-scientific data, medical data, demographic data, financial data, and marketing data. People have no time to look at this data. Human attention has become the precious resource. So, we must find ways to automatically analyze the data, to automatically classify it, to automatically summarize it, to automatically discover and characterize trends in it, and to automatically flag anomalies. This is one of the most active and exciting areas of the database research community. Researchers in areas including statistics, visualization, artificial intelligence, and machine learning are contributing to this field. The breadth of the field makes it difficult to grasp the extraordinary progress over the last few decades.
Several years ago, this guide organized and presented Data Mining. It heralded a golden age of innovation in the field. This revision of their guide reflects that progress; more than half of the references and historical notes are to recent work. The field has matured with many new and improved algorithms, and has broadened to include many more data types: streams, sequences, graphs, time-series, geospatial, audio, images, and video. We are certainly not at the end of the golden age- indeed research and commercial interest in data mining continues to grow-but we are all fortunate to have this modern compendium.
The guide gives quick introductions to database and data mining concepts with particular emphasis on data analysis. It then covers in a section-by- section tour the concepts and techniques that underlie classification, prediction, association, and clustering. These topics are presented with examples, a tour of the best algorithms for each problem class, and with pragmatic rules of thumb about when to apply each technique.
The Socratic presentation style is both very readable and very informative. I certainly learned a lot from reading the first edition and got re-educated and updated in reading the second edition.
We have been leading contributors to data mining research. This is the text they use with their students to bring them up to speed on the field. The field is evolving very rapidly, but this guide is a quick way to learn the basic ideas, and to understand where the field is today. I found it very informative and stimulating, and believe you will too.
The computerization of our society has substantially enhanced our capabilities for both generating and collecting data from diverse sources. A tremendous amount of data has flooded almost every aspect of our lives. This explosive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. This has led to the generation of a promising and flourishing frontier in computer science called data mining, and its various applications. Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams.
This guide explores the concepts and techniques of knowledge discovery and data mining. As a multidisciplinary field, data mining draws on work from areas including statistics, machine learning, pattern recognition, database technology, information retrieval, network science, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. We focus on issues relating to the feasibility, usefulness, effectiveness, and scalability of techniques for the discovery of patterns hidden in large data sets. As a result, this guide is not intended as an introduction to statistics, machine learning, database systems, or other such areas, although we do provide some background knowledge to facilitate the reader’s comprehension of their respective roles in data mining. Rather, the guide is a comprehensive introduction to data mining.
It is useful for computing science students, application developers, and business professionals, as well as researchers involved in any of the disciplines previously listed.
Data mining emerged during the late 1980s,made great strides during the 1990s, and continues to flourish into the new millennium. This guide presents an overall picture of the field, introducing interesting data mining techniques and systems and discussing applications and research directions. An important motivation for writing this guide was the need to build an organized framework for the study of data mining-a challenging task, owing to the extensive multidisciplinary nature of this fast-developing field. We hope that this guide will encourage people with different backgrounds and experiences to exchange their views regarding data mining so as to contribute toward the further promotion and shaping of this exciting and dynamic field.
Organization of this Guide
Since the publication of the first two editions of this guide, great progress has been made in the field of data mining. Many new data mining methodologies, systems, and applications have been developed, especially for handling new kinds of data, including information networks, graphs, complex structures, and data streams, as well as text, Web, multimedia, time-series, and spatiotemporal data. Such fast development and rich, new technical contents make it difficult to cover the full spectrum of the field in a single guide. Instead of continuously expanding the coverage of this guide, we have decided to cover the core material in sufficient scope and depth, and leave the handling of complex data types to a separate forthcoming guide.
The third edition substantially revises the first two editions of the guide, with numerous enhancements and a reorganization of the technical contents. The core technical material, which handles mining on general data types, is expanded and substantially enhanced. Several individual sections for topics from the second edition (e.g., data pre processing, frequent pattern mining, classification, and clustering) are now augmented and each split into two sections for this new edition. For these topics, one section encapsulates the basic concepts and techniques while the other presents advanced concepts and methods.
Sections from the second edition on mining complex data types (e.g., stream data, sequence data, graph-structured data, social network data, and multi-relational data, as well as text, Web, multimedia, and spatiotemporal data) are now reserved for a new guide that will be dedicated to advanced topics in data mining. Still, to support readers in learning such advanced topics, we have placed an electronic version of the relevant sections from the second edition onto the guide’s web site as companion material for the third edition.
The sections of the third edition are described briefly as follows, with emphasis on the new material.
Section 1 provides an introduction to the multidisciplinary field of data mining. It discusses the evolutionary path of information technology, which has led to the need for data mining, and the importance of its applications. It examines the data types to be mined, including relational, transactional, and data warehouse data, as well as complex data types such as time-series, sequences, data streams, spatiotemporal data, multimedia data, text data, graphs, social networks, and Web data. The section presents a general classification of data mining tasks, based on the kinds of knowledge to be mined, the kinds of technologies used, and the kinds of applications that are targeted. Finally, major challenges in the field are discussed.
Section 2 introduces the general data features. It first discusses data objects and attribute types and then introduces typical measures for basic statistical data descriptions. It overviews data visualization techniques for various kinds of data. In addition to methods of numeric data visualization, methods for visualizing text, tags, graphs, and multidimensional data are introduced. Section 2 also introduces ways to measure similarity and dissimilarity for various kinds of data.
Section 3 introduces techniques for data preprocessing. It first introduces the concept of data quality and then discusses methods for data cleaning, data integration, data reduction, data transformation, and data discretization.
Sections 4 and 5 provide a solid introduction to data warehouses, OLAP (online analytical processing), and data cube technology. Section 4 introduces the basic concepts, modeling, design architectures, and general implementations of data warehouses and OLAP, as well as the relationship between data warehousing and other data generalization methods. Section 5 takes an in-depth look at data cube technology, presenting a detailed study of methods of data cube computation, including Star-Cubing and high dimensional OLAP methods. Further explorations of data cube and OLAP technologies are discussed, such as sampling cubes, ranking cubes, prediction cubes, multi-featured cubes for complex analysis queries, and discovery-driven cube exploration.
Sections 6 and 7 present methods for mining frequent patterns, associations, and correlations in large data sets. Section 6 introduces fundamental concepts, such as market basket analysis, with many techniques for frequent item set mining presented in an organized way. These range from the basic a priori algorithm and its variations to more advanced methods that improve efficiency, including the frequent pattern growth approach, frequent pattern mining with vertical data format, and mining closed and max frequent item sets. The section also discusses pattern evaluation methods and introduces measures for mining correlated patterns. Section 7 is on advanced pattern mining methods. It discusses methods for pattern mining in multi level and multidimensional space, mining rare and negative patterns, mining colossal patterns and high-dimensional data, constraint-based pattern mining, and mining com pressed or approximate patterns. It also introduces methods for pattern exploration and application, including semantic annotation of frequent patterns.
Sections 8 and 9 describe methods for data classification. Due to the importance and diversity of classification methods, the contents are partitioned into two sections.
Section 8 introduces basic concepts and methods for classification, including decision tree induction, Bayes classification, and rule-based classification. It also discusses model evaluation and selection methods and methods for improving classification accuracy, including ensemble methods and how to handle imbalanced data. Section 9 discusses advanced methods for classification, including Bayesian belief networks, the neural network technique of back propagation, support vector machines, classification using frequent patterns, k-nearest-neighbor classifiers, case-based reasoning, genetic algorithms, rough set theory, and fuzzy set approaches. Additional topics include multiclass classification, semi-supervised classification, active learning, and transfer learning.
Cluster analysis forms the topic of Sections 10 and 11. Section 10 introduces the basic concepts and methods for data clustering, including an overview of basic cluster analysis methods, partitioning methods, hierarchical methods, density-based methods, and grid-based methods. It also introduces methods for the evaluation of clustering.
Section 11 discusses advanced methods for clustering, including probabilistic model based clustering, clustering high-dimensional data, clustering graph and network data, and clustering with constraints.
Section 12 is dedicated to outlier detection. It introduces the basic concepts of outliers and outlier analysis and discusses various outlier detection methods from the view of degree of supervision (i.e., supervised, semi-supervised, and unsupervised methods), as well as from the view of approaches (i.e., statistical methods, proximity-based methods, clustering-based methods, and classification-based methods). It also discusses methods for mining contextual and collective outliers, and for outlier detection in high-dimensional data.
Finally, in Section 13, we discuss trends, applications, and research frontiers in data mining. We briefly cover mining complex data types, including mining sequence data (e.g., time series, symbolic sequences, and biological sequences), mining graphs and networks, and mining spatial, multimedia, text, and Web data. In-depth treatment of data mining methods for such data is left to a guide on advanced topics in data mining, the writing of which is in progress. The section then moves ahead to cover other data mining methodologies, including statistical data mining, foundations of data mining, visual and audio data mining, as well as data mining applications. It discusses data mining for financial data analysis, for industries like retail and telecommunication, for use in science and engineering, and for intrusion detection and prevention. It also discusses the relationship between data mining and recommender systems. Because data mining is present in many aspects of daily life, we discuss issues regarding data mining and society, including ubiquitous and invisible data mining, as well as privacy, security, and the social impacts of data mining. We conclude our study by looking at data mining trends.
Throughout the text, italic font is used to emphasize terms that are defined, while bold font is used to highlight or summarize main ideas. Sans serif font is used for reserved words. Bold italic font is used to represent multidimensional quantities.
This guide has several strong features that set it apart from other texts on data mining.
It presents a very broad yet in-depth coverage of the principles of data mining. The sections are written to be as self-contained as possible, so they may be read in order of interest by the reader. Advanced sections offer a larger-scale view and may be considered optional for interested readers. All of the major methods of data mining are presented.
The guide presents important topics in data mining regarding multidimensional OLAP analysis, which is often overlooked or minimally treated in other data mining guides.
The guide also maintains web sites with a number of online resources to aid instructors, students, and professionals in the field. These are described further in the following.
Note to the Teacher/Instructor
This guide is designed to give a broad, yet detailed overview of the data mining field. It can be used to teach an introductory course on data mining at an advanced undergraduate level or at the first-year graduate level. Sample course syllabi are provided on the guide’s web sites in addition to extensive teaching resources such as lecture slides, instructors’ manuals, and reading lists.
Depending on the length of the instruction period, the background of students, and your interests, you may select subsets of sections to teach in various sequential orderings. For example, if you would like to give only a short introduction to students on data mining, you may follow the suggested sequence in the Index above. Notice that depending on the need, you can also omit some sections or subsections in a section if desired.
Depending on the length of the course and its technical scope, you may choose to selectively add more sections to this preliminary sequence. For example, instructors who are more interested in advanced classification methods may first add “ Section 9.
Classification: Advanced Methods”; those more interested in pattern mining may choose to include “ Section 7. Advanced Pattern Mining”; whereas those interested in OLAP and data cube technology may like to add “ Section 4. Data Warehousing and Online Analytical Processing” and “ Section 5. Data Cube Technology.” Alternatively, you may choose to teach the whole guide in a two-course sequence that covers all of the sections in the guide, plus, when time permits, some advanced topics such as graph and network mining. Material for such advanced topics may be selected from the companion sections available from the guide’s web site, accompanied with a set of selected research papers.
Individual sections in this guide can also be used for tutorials or for special topics in related courses, such as machine learning, pattern recognition, data warehousing, and intelligent data analysis.
Each section ends with a set of exercises, suitable as assigned homework. The exercises are either short questions that test basic mastery of the material covered, longer questions that require analytical thinking, or implementation projects. Some exercises can also be used as research discussion topics. The bibliographic notes at the end of each section can be used to find the research literature that contains the origin of the concepts and methods presented, in-depth treatment of related topics, and possible extensions.
Note to the Student/Learner
We hope that this guide will spark your interest in the young yet fast-evolving field of data mining. We have attempted to present the material in a clear manner, with careful explanation of the topics covered. Each section ends with a summary describing the main points. We have included many figures and illustrations throughout the text to make the guide more enjoyable and reader-friendly. Although this guide was designed as a “textbook”, we have tried to organize it so that it will also be useful to you as a reference guide or “handbook”, should you later decide to perform in-depth research in the related fields or pursue a career in data mining.
What do you need to know to read this guide? You should have some knowledge of the concepts and terminology associated with statistics, database systems, and machine learning. However, we do try to provide enough background of the basics, so that if you are not so familiar with these fields or your memory is a bit rusty, you will not have trouble following the discussions in the guide.
You should have some programming experience. In particular, you should be able to read pseudocode and understand simple data structures such as multidimensional arrays.
Note to the Professional
This guide was designed to cover a wide range of topics in the data mining field. As a result, it is an excellent “handbook” on the subject. Because each section is designed to be as standalone as possible, you can focus on the topics that most interest you. The guide can be used by application programmers and information service managers who wish to learn about the key ideas of data mining on their own. The guide would also be useful for technical data analysis staff in banking, insurance, medicine, and retailing industries who are interested in applying data mining solutions to their businesses. Moreover, the guide may serve as a comprehensive survey of the data mining field, which may also benefit researchers who would like to advance the state-of-the-art in data mining and extend the scope of data mining applications.
The techniques and algorithms presented are of practical utility. Rather than selecting algorithms that perform well on small "toy" data sets, the algorithms described in the guide are geared for the discovery of patterns and knowledge hidden in large, real data sets. Algorithms presented in the guide are illustrated in pseudocode. The pseudocode is similar to the C programming language, yet is designed so that it should be easy to follow by programmers unfamiliar with C or C++. If you wish to implement any of the algorithms, you should find the translation of our pseudocode into the programming language of your choice to be a fairly straightforward task.