Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Machine Learning

Henrik Christensen
seal-mask
.a{fill-rule:evenodd;}techdegree
Henrik Christensen
Python Web Development Techdegree Student 38,322 Points

Amount of data?

Hi,

I'm giving ML a try, and currently working on a very simple WikipediaLanguagePredicter, that tries to predict what language the page is in just by looking at how many links there is on the page.

Using the sklearn DecisionTreeClassifier how much data (items) would I need to feed the decisiontree to get some accurate'ish predictions?

Currently on 60% success rate

Kent Γ…svang
Kent Γ…svang
18,823 Points

In general its the quality of the data that matters the most. Even though certain ML/DL-models can crack a lot of problems if feeding them a bunch of data. Maybe you could try and add some features, perhaps extracting the suffixes from the links and include them in your model.

1 Answer

Gilang Ilhami
Gilang Ilhami
12,045 Points

When you are working with Machine Learning, the amount of data is related to the type of model you choose. There are no exact answer on how many you needed for a given model, but most machine learning models required around 5,000 to 10,000 data point at minimum to get a significant result. While most Deep Learning models need more than that (around 100,000 to 1 Million in minimum).

The more data you have, the more complex your model will be. Correct me if I'm wrong, but as far as i know from experience, Decision Trees work well when the data is not too spread out. In statistics, we say that if the variance is not too large.

Your 60% success rate is the training or the validation/test accuracy? if the training accuracy is way larger than the validation/test accuracy then your model might suffer from overfitting. That means you either need to use more complex ML architecture such as Random Forest or something.