Amount of data?

Question

Hi,

I'm giving ML a try, and currently working on a very simple WikipediaLanguagePredicter, that tries to predict what language the page is in just by looking at how many links there is on the page.

Using the sklearn DecisionTreeClassifier how much data (items) would I need to feed the decisiontree to get some accurate'ish predictions?

Currently on 60% success rate

Answer 1 · 2018-09-09T04:54:20Z

September 9, 2018 4:54am

When you are working with Machine Learning, the amount of data is related to the type of model you choose. There are no exact answer on how many you needed for a given model, but most machine learning models required around 5,000 to 10,000 data point at minimum to get a significant result. While most Deep Learning models need more than that (around 100,000 to 1 Million in minimum).

The more data you have, the more complex your model will be. Correct me if I'm wrong, but as far as i know from experience, Decision Trees work well when the data is not too spread out. In statistics, we say that if the variance is not too large.

Your 60% success rate is the training or the validation/test accuracy? if the training accuracy is way larger than the validation/test accuracy then your model might suffer from overfitting. That means you either need to use more complex ML architecture such as Random Forest or something.

Welcome to the Treehouse Community

Looking to learn something new?

Henrik Christensen

Henrik Christensen

Amount of data?

Kent Åsvang

Kent Åsvang

1 Answer

Gilang Ilhami

Gilang Ilhami