Data Mining, Abstract, Fields of study

Software programs to implement data mining emerged in the 1990s and continue to evolve today. There are open-source programs (such as WEKA, http://www.cs.waikato.ac.nz/ml/wekaandpackages in R, http://www.r-project.org TYPES OF PROBLEMS

The specific types of tasks that data mining addresses are typically broken into four types:

Predictive Modeling (classification, regression)
Segmentation (data clustering)
Summarization
Visualization

Predictive modeling is the building of models for a response variable for the main purpose of predicting the value of that response under new—or future—values of the predictor variables. Predictive modeling problems, in turn, are further broken into classification problems or regression problems, depending on the nature of the response variable being predicted. If the response variable is categorical (for example, whether a customer will switch telephone providers at the end of a subscription period or will stay with his or her current company), the problem is called a “classification.” If the response is quantitative (for example, the amount a customer will spend with the company in the next year), the problem is a “regression problem.” The term “regression” is used for these problems even when techniques other than regression are used to produce the predictions. Because there is a clear response variable, predictive modeling problems are also called “supervised problems” in machine learning. Sometimes there is no response variable to predict, but an analyst may want to divide customers into segments based on a variety of variables. These segments may be meaningful to the analyst, but there is no response variable to predict in order to evaluate the accuracy of the segmentation. Such problems with no specified response variable are known as “unsupervised learning problems.”

Summarization describes any numerical summaries of variables that are not necessarily used to model a response. For example, an analyst may want to examine the average age, income, and credit scores of a large batch of potential new customers without wanting to predict other behaviors. Any use of graphical displays for this purpose, especially those involving many variables at the same time, is called “visualization.”

ALGORITHMS

Data mining uses a variety of algorithms (computer code) based on mathematical equations to build models that describe the relationship between the response variable and a set of predictor variables. The algorithms are taken from statistics and machine learning literature, including such classical statistical techniques as linear regression and logistic regression and time series analysis, as well as more recently developed techniques like classification and regression trees (ID3 or C4.5 in machine learning), neural networks, naïve Bayes, K-nearest neighbor techniques, and support vector machines.

One of the challenges of data mining is to choose which algorithm to use in a particular application. Unlike the practice in classical statistics, the data miner often builds multiple models on the same data set, using a new set of data (called the “test set”) to evaluate which model performs best.

Recent advances in data mining combine models into ensembles in an effort to collect the benefits of the constituent models. The two main ensemble methods are known as “bootstrap aggregation” (bagging) and “boosting.” Both methods build many (possibly hundreds or even thousands of) models on resampled versions of the same data set and take a (usually weighted) average (in the case of regression) or a majority vote (in the case of classification) to combine the models. The claim is that ensemble methods produce models with both less variance and less bias than individual models in a wide variety of applications. This is a current area of research in data mining.

APPLICATIONS

Data mining techniques are being applied everywhere there are large data sets. A number of important application areas include the following:

CONCERNS AND CONTROVERSIES

Privacy issues are some of the main concerns of the public with respect to data mining. In fact, some kinds of data mining and discovery are illegal. There are federal and state privacy laws that protect the information of individuals. Nearly every Web site, credit card company, and other information collecting organization has a publicly available privacy policy. Social networking sites, such as Facebook, have been criticized for sharing and selling information about subscribers for data mining purposes. In healthcare, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) was enacted to help protect individuals’ health information from being shared without their knowledge.

—Richard De Veaux

Berry, M. A. J., and G. Linoff. Data Mining Techniques For Marketing, Sales and Customer Support. Hoboken, NJ: Wiley, 1997.

De Veaux, R. D. “Data Mining: A View From Down in the Pit.” Stats 34 (2002).

———, and H. Edelstein. “Reducing Junk Mail Using Data Mining Techniques.” In Statistics: A Guide to the Unknown. 4th ed. Belmont, CA: Thomson, Brooks-Cole, 2006.

Piatetsky-Shapiro, Gregory. “Knowledge Discovery in Real Databases: A Workshop Report.” AI Magazine 11, no. 5 (January 1991).

Data Mining

ABSTRACT

FIELDS OF STUDY