top of page

DATA MODEL

  1. Association rule
  1. FP-Growth
                Association rule is a rule-based machine learning algorithm that used to discover interesting relations between variables in databases. Some algorithm is run to determine that the rules are strong or weak. This rule-based approach also generates new rules as it analyzes more data. Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. Such information can be used as the basis for decisions. The FP-Growth algorithm is an efficient algorithm for calculating frequently co-occurring items in a transaction database.
Association rule
This is the snapshot of the process in order to apply the association rule.
​
Frequency Pattern Growth algorithm are applying in data set using RapidMiner. Some of the module are required to apply the association rule into the data set. “Read CSV” is used to read the dataset from computer. “Select attributes” is used to select the attributes that want to apply in the dataset. Since “FP-Growth” module is using to changing binomial attributes to frequency, all the attributes must change to binomial. “Discretize” is used to change the numerical attributes into nominal. “Discretize” is used to discretize the review rating into 3 group as shown in data visualization. “Discretize (2)” is used to discretize other numerical into nominal by binning. “Nominal to binomial” module is then used to change nominal attributes to binomial. “FP-Growth” module use the binomial attributes to find the frequency- occurring item set. After finding the frequency- occurring item set, the association rule can be applying into the dataset by using “Create Association Rules”.
​
​
2. Clustering
p2.png
This is one of the clustering model applying in clustering model.
​
        The tools that used to apply the clustering model is RapidMiner. To apply X-mean in data set, several modules are needed. “Read CSV” is used to read the dataset from computer. “Select attributes” is used to select the attributes that want to apply in the dataset. Since clustering model cannot use to calculate nominal attributes, the nominal attributes cannot be chosen to apply the clustering model. Only the numerical attributes are selected using the module “Select Attribute” to apply clustering model. The “multiply” is used to multiply the cluster model and “multiply (2)” is used to multiply clustered set for visualization and check the performance of the clustering. The “cluster model visualization” is used to visualize the result after clustering by showing graph and summary. The “performance” is used to calculate the centroid distance between each cluster and Davies Bouldin to check the performance of clustering that applying to the dataset.
 
    a. X-means
X-means is one of the examples of clustering algorithm and technique. Clustering methods are used to group the data so data within any segment are alike while data across segments are different. Cluster centroids are chosen randomly through a fixed number of K-clusters. X-Means performs division of objects into clusters which are “similar” between them and are “dissimilar” to the objects belonging to another cluster. The goals of this clustering are determining intrinsic grouping in a set of unlabeled data. It also provides a fast and efficient way to cluster unstructured data.
X-means clustering is basically the same as K-means as it is a variation of K-means clustering. The variation is important because the number of clusters K must be supplied by the user, and the search is prone to local minima fixed number of K-clusters. The different of number of K-clusters will inflict the result after doing the clustering algorithm. Thus, X-means is created to treats cluster allocations by repetitively attempting partition and keeping the optimal resultant splits, until some criterion is reached. X-means reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated K-means for different values of K.
p3.jpg
These is the process for k-mean and x-mean clustering. The only different is k-mean have to estimate the number of clusters K manually whereas x-mean will estimate the best value of the number of clusters K.
p4.png
This is the snapshot of the process in order to apply X-Mean.
     b. k-Medoids
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm. The K-means clustering algorithm is sensitive to outliers, because a mean is easily influenced by extreme values. Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it. Medoid is the most centrally located object of the cluster, with minimum sum of distances to other points.
p5.png
This is the snapshot of the process in order to apply k-Metroids.
​
​
​
​
​
​
​
​
​
​
 
 
 3. Classsification and prediction
        All the classification and prediction are using the same process in RapidMiner. The only different is the model itself. The other classification and prediction model can be used by replacing the model to the model that wanted to apply.
p6.png
This is the snapshot of the process in order to apply k-NN. One of the examples of model in classification and prediction.
​
To apply classification and prediction in dataset, some of the module in RapidMiner is used. “Read CSV” is used to read the dataset from computer. “Normalize” the dataset to improve the performance of each model. The predicting value, which is reviews.rating is being discretize into 3 group as showed in data visualization by using “discretize”. “Select attributes” is used to select the attributes that want to apply in the dataset. Reviews.rating must be labelled to show that reviews.rating is the value that need to be predicted by using “set role”. The data is split by using “Split data” to ensure the model running smoothly and to calculate the accuracy.
​
                The data is the split into 3 set: training, validation, and test sets. The current model is run with the training dataset and produces a result, which is then compared with the target, for each input vector in the training dataset. Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. Validation datasets can be used for regularization by early stopping: stop training when the error on the validation dataset increases, as this is a sign of overfitting to the training dataset. Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset.
​
                “k-NN” the module in RapidMiner for the model of k-NN. It can replace with other classification and prediction model. “Apply model” the model is apply in validation set to prevent overfitting of training set.  “Model simulator” provides an easy, real-time method to change the inputs to a model and view the output. Apply model on validation set. The performance is used to statistically evaluate the strengths and weaknesses of a binary classification. “Performance” use to test the accuracy of dataset after applying model.
 
    a. k-NN (Classification and prediction)
        k-NN(K-Nearest Neighbors) can be used for both classification and regression predictive problems. However, it is more widely used in classification problems. It is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure. k-NN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. When a technique is non-parametric, it means that it does not make any assumptions on the underlying data distribution. In other words, the model structure is determined from the data.
p7.png
This is the snapshot of the process in order to apply k-NN.
   b. Naïve Bayes (Classification and prediction)
        Naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors. A Naive Bayes model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayes classifier is widely used because it often outperforms more sophisticated classification methods. A Naive Bayes classifier predicts that the presence (or absence) of a feature of a class is unrelated to the presence (or absence) of any other feature.
p8.png
This is the snapshot of the process in order to apply Naïve Bayes.
    c. Decision tree
        Decision tree is a map of the possible outcomes of a series of related choices. Decision Trees are a non-parametric supervised learning method used for classification. A decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Decision Tree can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities.
p9.png
This is the snapshot of the process in order to apply decision tree.
    d. Random Forest
        Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time. Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because it’s simplicity and the fact that it can be used for both classification and regression tasks.
p10.png
This is the snapshot of the process in order to apply random forest.
bottom of page