Data mining is the relatively recent practice of using algorithms to distill patterns, summaries, and other specific forms of information from databases.
Business, Economics, and Marketing.
Fields of Study: Data Analysis and Probability; Measurement; Number and Operations.
Advances in technology in the latter half of the twentieth century led to the accumulation of massive data sets in government, business, industry, and various sciences. Extracting useful information from these large-scale data sets required new mathematical and statistical methods to model data, account for error, and handle issues like missing data values and different variable scales or measures. Data mining uses tools from statistics, machine learning, computer science, and mathematics to extract information from data, especially from large databases. The concepts involved in data mining are drawn from many mathematical fields such as fuzzy sets, developed by mathematician and computer scientist Lotfi Zadeh, and genetic algorithms, based on the work of mathematicians such as Nils Barricelli. Because of the massive amounts of data processed, data mining relies heavily on computers, and mathematicians contribute to the development of new algorithms and hardware systems. For example, the Gfarm Grid File System was developed in the early twenty-first century to facilitate high-performance petascale-level computing and data mining.
Data mining has roots in three areas: classical statistics, artificial intelligence, and machine learning. In the late 1980s and early 1990s, companies that owned large databases of customer information, in particular credit card banks, wanted to explore the potential for learning more about their customers through their transactions. The term “data mining” had been used by statisticians since the 1960s as a pejorative term to describe the undisciplined exploration of data. It was also called “data dredging” and “fishing.” However, in the 1990s, researchers and practitioners from the field of machine learning began successfully applying their algorithms to these large databases in order to discover patterns that enable businesses to make better decisions and to develop hypotheses for future investigations.
Partly to avoid the negative connotations of the term “data mining,” researchers coined the term “knowledge discovery in databases” (KDD) to describe the entire process of finding useful patterns in databases, from the collection and preparation of the data, to the end product of communicating the results of the analyses to others. This term gained popularity in the machine learning and AI fields, but the term “data mining” is still used by statisticians. Those who use the term “KDD” refer to data mining as only the specific part of the KDD process where algorithms are applied to the data. The broader interpretation will be used in this discussion.
Software programs to implement data mining emerged in the 1990s and continue to evolve today. There are open-source programs (such as WEKA, http://www.cs.waikato.ac.nz/ml/wekaandpackages in R, http://www.r-project.org TYPES OF PROBLEMS
The specific types of tasks that data mining addresses are typically broken into four types:
Predictive modeling is the building of models for a response variable for the main purpose of predicting the value of that response under new—or future—values of the predictor variables. Predictive modeling problems, in turn, are further broken into classification problems or regression problems, depending on the nature of the response variable being predicted. If the response variable is categorical (for example, whether a customer will switch telephone providers at the end of a subscription period or will stay with his or her current company), the problem is called a “classification.” If the response is quantitative (for example, the amount a customer will spend with the company in the next year), the problem is a “regression problem.” The term “regression” is used for these problems even when techniques other than regression are used to produce the predictions. Because there is a clear response variable, predictive modeling problems are also called “supervised problems” in machine learning. Sometimes there is no response variable to predict, but an analyst may want to divide customers into segments based on a variety of variables. These segments may be meaningful to the analyst, but there is no response variable to predict in order to evaluate the accuracy of the segmentation. Such problems with no specified response variable are known as “unsupervised learning problems.”
Summarization describes any numerical summaries of variables that are not necessarily used to model a response. For example, an analyst may want to examine the average age, income, and credit scores of a large batch of potential new customers without wanting to predict other behaviors. Any use of graphical displays for this purpose, especially those involving many variables at the same time, is called “visualization.”
Data mining uses a variety of algorithms (computer code) based on mathematical equations to build models that describe the relationship between the response variable and a set of predictor variables. The algorithms are taken from statistics and machine learning literature, including such classical statistical techniques as linear regression and logistic regression and time series analysis, as well as more recently developed techniques like classification and regression trees (ID3 or C4.5 in machine learning), neural networks, naïve Bayes, K-nearest neighbor techniques, and support vector machines.
One of the challenges of data mining is to choose which algorithm to use in a particular application. Unlike the practice in classical statistics, the data miner often builds multiple models on the same data set, using a new set of data (called the “test set”) to evaluate which model performs best.
Recent advances in data mining combine models into ensembles in an effort to collect the benefits of the constituent models. The two main ensemble methods are known as “bootstrap aggregation” (bagging) and “boosting.” Both methods build many (possibly hundreds or even thousands of) models on resampled versions of the same data set and take a (usually weighted) average (in the case of regression) or a majority vote (in the case of classification) to combine the models. The claim is that ensemble methods produce models with both less variance and less bias than individual models in a wide variety of applications. This is a current area of research in data mining.
Data mining techniques are being applied everywhere there are large data sets. A number of important application areas include the following:
Privacy issues are some of the main concerns of the public with respect to data mining. In fact, some kinds of data mining and discovery are illegal. There are federal and state privacy laws that protect the information of individuals. Nearly every Web site, credit card company, and other information collecting organization has a publicly available privacy policy. Social networking sites, such as Facebook, have been criticized for sharing and selling information about subscribers for data mining purposes. In healthcare, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) was enacted to help protect individuals’ health information from being shared without their knowledge.
—Richard De Veaux
Berry, M. A. J., and G. Linoff. Data Mining Techniques For Marketing, Sales and Customer Support. Hoboken, NJ: Wiley, 1997.
De Veaux, R. D. “Data Mining: A View From Down in the Pit.” Stats 34 (2002).
———, and H. Edelstein. “Reducing Junk Mail Using Data Mining Techniques.” In Statistics: A Guide to the Unknown. 4th ed. Belmont, CA: Thomson, Brooks-Cole, 2006.
Piatetsky-Shapiro, Gregory. “Knowledge Discovery in Real Databases: A Workshop Report.” AI Magazine 11, no. 5 (January 1991).