Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Histograms are used to show distributions of data. Let's explore the Iris data set with this chart style.
Further Reading
- Matplotlib style sheets
- Number of bins and widths for histograms
- Freedman-Diaconis rule for Histogram Bin widths
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
As I've mentioned, histograms are used
to show distributions of data.
0:00
This can be very useful to see
how closely grouped together or
0:04
spread out a variable is.
0:07
The area of the rectangles in
a histogram is proportional
0:09
to the frequency of the variable.
0:12
This allows for
0:14
the rough assessment of the probable
distribution of a given variable.
0:15
The rectangles or
0:19
bins in a histogram, are important to
consider when doing data visualization.
0:20
Both the number of overall bins and
0:25
the bin width can have an impact on
the overall presentation of data.
0:27
From our iris data set let's generate
a histogram chart to see the distribution
0:32
of petal length.
0:36
Let's examine the petal lengths
of the iris virginica class and
0:38
visualize the distribution of that data.
0:41
Here's where we start off in
a new notebook, iris histogram,
0:43
with bringing in our data and
getting it stored in a list called irises.
0:47
Let's process through this list
to just obtain the petal length
0:51
of the iris virginica species.
0:54
Create a list, hold our data.
0:57
Let's also create a variable for
our bin numbers,
1:08
so we can see how changing bin
numbers impacts our visualization.
1:11
Now let's loop through our
data to get our petal lengths.
1:18
For petal in range of our iris data.
1:23
So if the species is Iris-virginica,
1:50
we'll add the petal length to
our virginica_petal_length list.
1:53
And we'll get that from our iris data.
2:12
Now we can pass our data
into our plot.hist method.
2:19
This method takes several parameters,
2:22
including the number of
bins we'd like to have.
2:24
The color we'd like to set,
along with alpha values.
2:27
plt.hist pass in our
virginica_petal_length.
2:30
Our number bins.
2:40
The color of our plot will be red.
2:45
And we give it an alpha value
to make it slightly transparent.
2:50
As I've mentioned, it's always
important to add labels to your charts.
2:55
For chart title.
2:59
Iris-virginica Petal length.
3:05
We'll give that a font size of 12.
3:13
For our x-axis, for xlabel,
3:16
we'll give it what it is,
3:20
Petal length in centimeters.
3:23
Font size of 10.
3:30
And for our ylabel.
3:34
We'll just call it Probability.
3:36
And again,
we'll give that a font size of 10.
3:46
Cool and then we call our show method and
run our cell.
3:54
We are shown a histogram
chart with red rectangles.
4:02
However, the rectangles
are all clumped together and
4:05
can be a challenge to differentiate.
4:08
Matplotlib allows for and includes some
chart styling options which can help out.
4:10
Let's apply matplotlib's
classic style to our chart and
4:16
see if it helps clear things up.
4:19
We'll go back up here and
under where we assign our figure size.
4:22
We'll ask it to use the classic style and
then we can run our cell.
4:32
That's much better.
4:40
Now we are setting our
number of bins to ten,
4:41
which is also the matplotlib default for
histograms.
4:44
Let's change that the 15 and then to 5 to
see how that impacts our visualization.
4:47
Notice here that at 15 bins
we have some empty bins.
4:58
While we get more detail about the data
set, it also spreads the data into
5:02
a broken comb look that doesn't provide as
clear of a picture of the distribution.
5:07
And if we go back and set it to 5 bins.
5:11
At 5 bins,
the data isn't portrayed very well either.
5:19
There are a variety of formulas and
considerations for the number of bins and
5:23
their widths to use.
5:27
I've included links to some resources for
these in the teacher's notes.
5:28
It is not uncommon in practice
to produce multiple histograms
5:33
with different numbers of bins, before
settling on the best communication tool.
5:37
Histograms are great for
exploring the distribution of data, but
5:42
our data set has many more
ways that it can be explored.
5:46
Sepal length and sepal and
5:49
pedal width, can all be explored
across all different species.
5:50
Before the next video,
5:54
practice creating some other
histograms of this data on your own.
5:56
Next, we'll look at box plots.
5:59
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up