Decision Tree Algorithm in Data Mining

Decision trees, and data mining are useful techniques these days.A decision tree is a hierarchical relationship diagram that is used to determine the answer to an overall question. It does this by asking a sequence of sub-questions related to that question. Each branch of the diagram represents a possible choice or answer to a specific sub-question. And each sub-question iteratively reduces the number of remaining choices, or answers, until only the correct one for the overall question, in that particular situation, remains.

Let’s look at an example. In the diagram above, the overall question is, ‘Is the weather good enough to go outside?’ This isn’t a simple question to answer. There are a number of factors to consider. Each bubble in the diagram represents a factor, or sub-question, and each line represents a choice or answer to the sub-question above.

So the first sub-question we ask is, ‘Is it windy?’ If it is, we go down the left of the diagram, if not, we go down the right. Let’s say it is windy. That takes us to the ‘What is the outlook?’ sub-question. If the answer is sunny, we go down the left, if overcast down the center, and if rainy, down the right. Let’s say that it is sunny, so we go down the left. Then the next sub-question is ‘What is the humidity?’. If the humidity is less than 80 percent, the answer to the overall question is ‘Yes’. And if the humidity is greater than 80 percent, the answer is ‘No.’

What is Data Mining?

Data Mining is the process of identifying trends in large data sets.
Steps are as following:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

The data is usually collected and stored in data warehouses.
Then we apply suitable data mining algorithms for identifying trends.
Most popular algorithms are clustering and regression trees.

Data Mining can be done for:

  1. Mining for patterns
  2. Mining for associations
  3. Mining for correlations
  4. Mining for clusters
  5. Mining for predictive analysis

What Is Deep Neural Networks?

A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers. DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives. The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network.

Deep architectures include many variants of a few basic approaches. Each architecture has found success in specific domains. It is not always possible to compare the performance of multiple architectures unless they have been evaluated on the same data sets.

DNNs are typically feedforward networks in which data flows from the input layer to the output layer without looping back.

Recurrent neural networks (RNNs), in which data can flow in any direction, are used for applications such as language modeling. Long short-term memory is particularly effective for this use.

Convolutional deep neural networks (CNNs) are used in computer vision. CNNs also have been applied to acoustic modeling for automatic speech recognition (ASR).

Reference:https://deeplearning4j.org/neuralnet-overview

Many  application is developed for Deep Learning which are  very help to other famous applications.

What Is Data Wrangling?

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. In other words, it is the process of cleaning and unifying messy and complex data sets for easy access and analysis.

  1. With the amount of data and data sources rapidly growing and expanding, it is getting more and more essential for the large amounts of available data to be organized for analysis.
  2. This process typically includes manually converting/mapping data from one raw form into another format to allow for more convenient consumption and organization of the data.

The goals of data wrangling:

  1. Reveal a “deeper intelligence” within your data, by gathering data from multiple sources
  2. Provide accurate, actionable data in the hands of business analysts in a timely matter
  3. Reduce the time spent collecting and organizing unruly data before it can be utilized
  4. Enable data scientists and analysts to focus on the analysis of data, rather than the wrangling
  5. Drive better decision-making skills by senior leaders in an organization

The key steps to data wrangling:

  1. Data Acquisition: Identify and obtain access to the data within your sources
  2. Joining Data: Combine the edited data for further use and analysis
  3. Data Cleansing: Redesign the data into a usable/functional format and correct/remove any bad data

How to Remove Duplicate Data in R

During the processing of data cleansing, it is often required to remove duplicate values from the database. A very useful application of subsetting data is to find and remove duplicate values. R has a useful function, duplicated(), that finds duplicate values and returns a logical vector that tells you whether the specified value is a duplicate of a previous value. This means that for duplicated values, duplicated() returns FALSE for the first occurrence and TRUE for every following occurrence of that value, as in the following example:

> duplicated(c(1,2,1,6,1,8))
[1] FALSE FALSE TRUE FALSE TRUE FALSE

If you try this on a data frame, R automatically checks the observations (meaning, it treats every row as a value). So, for example, with the data frame iris:

> duplicated(iris)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
....
 [136] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE

If you look carefully, you notice that row 143 is a duplicate (because the 143rd element of your result has the value TRUE). You also can tell this by using the which() function:

> which(duplicated(iris))
[1] 143

Now, to remove the duplicate from iris, you need to exclude this row from your data. Remember that there are two ways to exclude data using subsetting:

  • Specify a logical vector, where FALSE means that the element will be excluded. The ! (exclamation point) operator is a logical negation. This means that it converts TRUE into FALSE and vice versa. So, to remove the duplicates from iris, you do the following:

> iris[!duplicated(iris), ]

Specify negative values. In other words:

> index <- which(duplicated(iris))
> iris[-index, ]

In both cases, you’ll notice that your instruction has removed row 143.