What Are Data Mining and Data Analytics?
Data mining is the process of discovering hidden patterns in data, where :
Patterns refer to inherent relationships and/or dependencies in the data, and : Large-scale data is typically stored in a database environment. Data mining is also frequently referred to as knowledge discovery in database (KDD). Data analytics is the process of transforming raw data into knowledge and insight for making better decisions.
Data mining and analytics are closely related to the areas of databases, artificial intelligence (AI), statistics, and information retrieval. However, there are considerable differences between data mining and these fields.
Databases — focuses on data storage and access technology, while data mining focuses on data analysis and knowledge discovery.
Artificial Intelligence — there are overlaps between AI and data mining techniques, including those concerning machine learning. However, AI techniques are not necessarily data-oriented (e.g., expert systems).
Statistics — statistical science assumes data is scarce; it focuses on numeric data and a parametric approach (e.g., “assume data follows normal distribution”). Conversely, data mining assumes data is abundant; it deals with various data types and focuses on efficient algorithms for large-scale data.
Information retrieval (IR) — concerns finding materials (e.g., documents) of an unstructured nature (e.g., text) that satisfies an information need; it is closely related to text and web mining. A typical example of IR techniques is a search engine.
Customer ID Age Income Year-of-Education Purchase-Amount Favorite
1 35 62,000 10 429 YES
2 25 54,000 8 314 YES
3 36 100,000 20 659 YES
4 65 89,000 11 551 YES
5 29 87,000 5 483 YES
6 69 48,000 11 463 YES
We define a few Terms
Attribute (aka a variable or field or column):
Numeric attribute (aka a continuous or real attribute) — mathematical operations (e.g., addition, multiplication) can be applied to the values of this type of attribute.
Categorical attribute (aka a nominal attribute) — mathematical operations cannot be applied to the values of such attribute, even if the values appear in a numeric format (e.g., social security number, credit card number).
Record (aka an observation or instance or row).
Dataset (aka a relation or table) — a set of data with attributes in columns and records in rows.
For the above table, we say:
This is a Dataset.
We have 6 attributes (variables, or fields, or columns), they are " Customer ID", "Age", "Income", "Year-of-Education", "Purchase-Amount" and " Favorite".
"Age", "Income", "Year-of-Education" and "Purchase-Amount" are Numeric attributes " Favorite" is a Categorical attribute
There are 20 records (or observations, or instances, or rows )
Now let's look at some possible business applications:
Market basket analysis
Web usage mining and personalization
Data Mining Tasks
Supervised Learning - Where there is a predefined attribute whose values are to be predicted: Classification Prediction (of numeric values)
Unsupervised Learning - Where there is no predefined attribute for prediction: Clustering (or Cluster Analysis)
Classification is the process of assigning data records into one of several predefined groups, referred to as classes. Classification involves building a model, called a classifier, which can be a mathematical function, a set of rules, or other representations.
From the above table, if we want to use "Age", "Income", and "Year-of-Education" to predict "Favorite" then this task is Classification , since the outcomes are "YES" or "NO".
More Examples of Classification:
Fraud detection (true or false)
Security trading decision (buy, sell, or hold)
Medical diagnosis (presence or absence of a disease)
The prediction of numeric values helps us discover the relationship between one set of variables (called independent or input variables), and another set of variables (called dependent or output variables in data). Once these relationships are discovered, the past or current values of independent variables can be used to predict the future values of dependent variables.
Prediction vs. Classification:
Prediction - The values of the attribute to be predicted (dependent variable) are numeric
Classification - The values of the attribute to be predicted (class attribute) are categorical
From the above table, if we want to use "Age", "Income", and "Year-of-Education" to predict "Purchase-Amount" then this task is Prediction , since the outcomes are numeric values .
More Examples of Prediction:
Sales volume / revenue prediction
Stock price prediction association
Clustering is the process of grouping data records into a number of groups, called clusters, such that records within the same cluster are more similar than those belonging to different clusters. This process differs from classification in that clusters are formed as a result of analysis, instead of being predefined.
From the above table, if we want to use "Age" to group customers, then this task is Clustering. For example we can group customers into 3 groups: Young (Age between 20-30), Mid-Age (Age between 31-60), Senior (Age 60 above)
More Examples of Clustering:
Grouping of library books by field
The Data Mining Process
1. Problem Identification: Define the purpose of the data-mining project and nature of the problem (classification, prediction, clustering).
2. Data Preparation:
Data collection - retrieving, merging, and/or dividing data
Data cleaning - correcting errors, handling missing data, resolving inconsistencies
Data reduction - sampling (in rows), feature (attribute) selection (in columns)
Data transformation - standardizing data, reforming data, conversation between numeric and categorical data
3. Model Formulation and Pattern Exploration:
Select appropriate data-mining techniques and tools, then use the selected techniques and tools to build models and explore the patterns/relationships hidden in the data
4. Verification and Modification:
Test if the models built are valid; modify the models if necessary
Compare different candidate models
5. Interpretation and Implementation:
Interpret the results of data mining in an intuitive manner
Implement (AKA deploy) the model into related applications
Missing Data Replacement:
Listwise deletion: disregard a record if it has any missing attribute values
- For a missing value of a numeric attribute, replace it with the mean of
the existing values of that attribute
- For a missing value of a categorical attribute, replace it with the mode
(most frequent value) of the non-missing values of that attribute
Task-specific missing value replacement methods: We will discuss some of these in detail later in this course
Normalizing Numeric Data
Transform a series of numeric data into values within range [0, 1], as below:
Where "min value" is the minimal value, "max value" is the maximal value.
Original values: -1, 0, 2, and 4
From the original values we see "min value"=-1,"max value"=4
Overfitting and Data Partitioning
A training set is the portion of data used to build data-mining models, for example we could use records 1 to 10 as our training set in the above table.
A validation set is the portion of data used to validate or adjust the models, and to prevent overfitting problems, for example we could use records 11 to 15 as our validation set in the above table.
A test set is the data used to evaluate the performance of the models, for example we could use records 16 to 20 as our test set in the above table. . The test set serves as unseen future data.
Overfitting occurs when a model fits the training data very well (even perfectly), but performs poorly when it is applied to new data over time
Contact Us or send your requirement details at:
if you need any help in Data Mining Assignment, Data Mining Project and Data Mining Homework.
Our expert provide Plagiarism free code with an affordable price as per your given requirement and delivered it within your given time frame.