top of page

This site was designed with the

website builder. Create your website today.Start Now

DATA ANALYSIS

Frequecy Pattern Growth algorithm (association rule)

After applying FP-growth in RapidMiner, the result of the association rule is as below.

This is the graph for all the association rule that find out by using fp-growth algorithm. Some of them are meaningless, some of them give out some meaning. One of the example of the rule is “[reviews.month = range2 [6.500 - ∞], reviews.userProvince = CA] --> [province = CA]”. From this rule, it tells when the review month is in the range from 6.5 to 12, and the review user’s province is in CA, the province will be CA. This not the result that are needed. The rule that give out the conclusion to the review rating are the rule that needed. For example, “[review.year = range2 [2010.500 - ∞], name = Simpson House Inn] --> [reviews.rating = last]”. From this rule, we can conclude that when the review year is between 2010.5 to 2017 and the hotel name is Simpson House Inn, then the review rating is in the last range, which is between 4.0-5.0.

This is the table that show the result with the value of support, confidence, gain, lift and others is shown. This is the result that conclusion is reviews.rating, other conclusion are being filter out.

This is the graph for the map of association rule. This show that which of the object will give the conclusion that review.rating=last, other conclusion are being filter out.

From these results, there are more than 1000 association rules after applying FP-Growth. As most of the association rule are not useful in this dataset. There only a few of association rule are suitable. Thus, FP-growth(association rule) are not very suitable for our dataset.

2. X-means (Clustering)

The model is test with different attributes in dataset. This is the result with each of the testing.

The average within centroid distance is about 243.037 and the Davies Boulding is 0.534. This means that this model is average suitable for using this clustering model. The result of Davies Boulding is almost the same for setting 2, setting 4, setting 5, setting 6 and setting 7, which is more than 0.36 and less than 0.39. However, setting 4 has the lower Davies Boulding. The Davies–Bouldin is a metric for evaluating clustering algorithms. The lower the Davies Boulding, the better the clustering group. Thus, the result of this setting will be shown below.

This is the performance for setting 4 for X-mean clustering. From this photo, the average within centroid cluster 1 is greater than centroid cluster 0.

This is the overview of the clustering for setting 4.

This plot show review.rating against longtitude. From this graph we can noticed there has 2 group of review rating based on longtitude.

3. k-Medoids (Clustering)

The model is test with different attributes in dataset. This is the result with each of the testing.

From the table above, we can conclude that setting 2 is more suitable in this model because in setting 2 has a highest average within centroid distance and a lowest Davies Boulding.

This is the performance for setting 2 for k-Medoids clustering. From this photo, the average within centroid cluster 0 is greater than centroid cluster 1.

This is the overview of the clustering for setting 2.

This plot show review.rating against longtitude. From this graph we can noticed there has 2 group of review rating based on longtitude.

Comparison between X-mean and k-Medoids

From the table above, we can conclude that X-mean are more suitable in this dataset. This is because the average within centroid distance for X-mean is higher than k-Medoids and Davies Boulding for X-mean is lower than 0.534. This means that X-mean has a more accurate clustering model in this dataset.

This is each attribute that try in X-mean. From this table, we can know that setting 4 are a more accurate setting compare to the others as setting 4 has the lowest Davies Boulding and a high Average within centroid distance.

Thus, from these 2 table, we can conclude that by using setting 4 and applying in X-mean, a more accurate clustering model can be created in this dataset.

4. k-NN (Classification and prediction)

For this model, all attributes are select for a better classification and prediction. However, different value of k is selected and applying for each k.

The accuracy for training set in k=2 is the highest, whereas the accuracy for validation set and test set is highest in k=10. Although in k=2, the training set has a high accuracy which is 91.83%, but the validation set and test set for k=2 are very low. Even increasing the value of k, the problem still exists. The average of training set is 64.82%, the average of validation set is 41.48%, and the test set is 40.90%. The accuracy of training set is higher than both validations set and test set are because of overfitting. This means that either the model is not suitable, or the data is too bias.

Below are the example result of k-NN when k=5.

This is the overall for k-NN model.

This is the performance for k-NN training set when k=5. The accuracy of training set is 70.24%.

This is the performance for k-NN training set when k=5. The accuracy of validation set is 46.33%.

This is the model simulator for k-NN when k=5. The left side of the data can be adjusted and see the model reaction on the right. The accuracy of test set for k-NN is 45.55%.

5. Naïve Bayes (Classification and prediction)

This is the overall for Naïve Bayes model.

This is the performance for Naïve Bayes. The accuracy of training set is 66.35%.

This is the performance for Naïve Bayes. The accuracy of validation set is 48.70%.

This is the model simulator for Naïve Bayes. The left side of the data can be adjusted and see the model reaction on the right. The accuracy of test set for Naïve Bayes is 48.11%.

From these pictures, the accuracy of training set is bigger than validation set and test set. This also can represent overfitting in this model.

This is the graph of review.month against its density. From the graph above, the review rating in first range has the highest density.

This is the graph of review.year against its density. From the graph above, the review rating in second range has the highest density.

6. Decision tree (Classification and Prediction)

This is the result after applying decision tree.

This is the decision tree from the dataset.

This is the performance for decision tree. The accuracy of training set is 59.36%.

This is the performance for decision tree. The accuracy of validation set is 44.13%.

This is the model simulator for decision tree. The left side of the data can be adjusted and see the model reaction on the right. The accuracy of test set for decision tree is 44.10%.

From these pictures, the accuracy of training set is bigger than validation set and test set. This can represent overfitting in this model.

7. Random Forest (Classification and prediction)

This is one of the result from random forest.

This is the performance for random forest. The accuracy of training set is 69.26%.

This is the performance for random forest. The accuracy of validation set is 47.60%.

This is the model simulator for random forest. The left side of the data can be adjusted and see the model reaction on the right. The accuracy of test set for Naïve Bayes is 48.00%.

From these pictures, the accuracy of training set is bigger than validation set and test set. This can represent overfitting in this model.

Comparison of accuracy for training set, validation set and test set for classification model

From the table above, we can conclude that Naïve Bayes has the greatest performance among all the model because Naïve Bayes has the highest accuracy of training set, validation set and test set. Another conclusion can be made from the table is Decision tree are the worst performance among all of the model as decision tree has the lowest accuracy in all 3 sets of data.

Although Naïve Bayes has the highest performance among all the 4 model, Naïve Bayes still not suitable to use in this dataset. This is because the accuracy of training set is higher than both validation set and test set. The difference between the accuracy of training set and the accuracy of validation set is about 20%. This big difference is one of the major problems that face by most of the data science, which is overfitting. This problem not only occur in Naïve Bayes model, but it also occurs in k-NN, random forest and decision tree. No matter what kind of classification model that are applying, the accuracy of training set is higher than validation set and test set. The conclusion for all the model in classification and prediction are having the overfitting problem.

Conclusion

There are many models that can used to apply to dataset. In this dataset, the FP-growth is not so suitable in this dataset as many of the association rule are meaningless and only a few of the association rule got meaning.

For clustering, 2 models are applying in the dataset, which are X-mean and k-Medoids. X-mean has a lower Davies Boulding and higher average within centroid distance compare to k-Medoids. These means that X-mean is more suitable to apply in this dataset compare to k-Medoids. By using X-mean, a more accurate clustering result can be produced.

For classification and prediction model, 4 models are applying in the dataset. Naïve Bayes performs the best in this dataset whereas decision tree is the worst. All the model’s training set has a higher accuracy than validation and test set. This shows the overfitting problem. Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. Overfitting must be overcome by using some of the technique. First, cross validation can be used. Cross-validation is a powerful preventative measure against overfitting. cross validation uses initial training data to generate multiple mini train-test splits and to tune model. Second, increase the dataset can overcome this problem. Third, the overfitting dataset has to be applied by more than 1 model, which is collaborative filtering technique. For example, the clustering model have to be applied first, and then from the cluster apply prediction.

Thus, from all the model that apply in this dataset, I would like to suggest that to apply X-mean in the dataset first. Then, use the cluster to apply the classification and prediction model. By using this step, the overfitting problem can be overcome and getting a more accuracy model in all set. The prediction that predicted by these models will also be more accuracy compare to those only using one type of model.

By predicting the rating of a hotel, a hotel management can improve their hotel when they know their rating is low and this can increase the service quality of tourism industry.

bottom of page