Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
How do you get insight on all this data?
Learn More
- Apache Hadoop homepage
- Wikipedia has a great history of the Hadoop project and much of the Big Data ecosystem that formed on top of Hadoop
- Apache Spark homepage
- Apache Spark Documentation
- Apache Solr Quickstart
- Apache Lucene homepage
- Elasticsearch in 5 minutes
- Getting started with Tensorflow
- Getting started with Scikit-Learn
- Machine Learning on Spark with MLlib
- More ML on Spark Tutorial
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Storing data is only part of the process.
0:00
Often, you'll need to get
insight out of that data or
0:03
process it at lightning
speeds in order to keep up.
0:05
Let's discuss three of the main
computational use cases for big data.
0:08
When we talk about
generalized data processing,
0:12
we're talking about being fed data and
running a computation over it.
0:15
It's typically fed through a stream,
and the processing could be simple
0:19
as counting, taking the variance of data,
or some other custom algorithm.
0:22
Generalized data processing
forms the foundation
0:27
of most what of you'll encounter if you
work with big data systems and problems.
0:30
These systems provide APIs, or
application programming interfaces,
0:35
which are specific documented functions
that are available for you to use.
0:39
These functions are robust
enough to solve nearly
0:43
any problem with data
in nearly any format.
0:46
Apache Hadoop is the most popular
generalized data processing engine
0:49
in the big data world today.
0:54
Hadoop is based off of a system originally
developed at Google to rank webpages of
0:56
the entire Internet.
1:00
Apache Spark is a newer player to
the game, and it's built on top of Hadoop.
1:01
Spark brings lighting fast
speed not available in Hadoop
1:06
through largely in-memory processing and
a simpler API.
1:09
Spark also has the ability to handle
streaming data, work with graphs, and
1:14
manipulate text data as SQL.
1:18
It also has a machine learning component.
1:21
Nearly every company that deals
with big data uses Hadoop or
1:25
Spark at some place in their stack.
1:28
Hadoop and Spark are both backed by HDFS.
1:31
Remember, that is for
Hadoop Distributed File System.
1:34
And therefore, it can scale to tens of
thousands of machines in a single cluster.
1:37
Okay, so let's move on to our next
computational use case, search.
1:42
Often, you'll need to find some piece of
data within all the data that you have, so
1:47
that you can display it to a user.
1:51
Now to find that relevant data, all of
your internal data has to be stored in
1:54
a way that can be quickly retrieved, and
surfaced to the application asking for it.
1:58
This turns out to be such
a difficult problem at scale
2:03
that there are major tools built just for
this.
2:06
So next up in our computational
use cases is search.
2:10
Popular tools here include Solr and
Lucene, both of which are Apache projects.
2:15
You've probably noticed a bunch of these
big data projects are part of Apache.
2:21
Check the teacher's notes for more.
2:25
Lucene is the full tech search tool
that Solr uses to provide more advanced
2:27
searching features.
2:32
Full text searching involves breaking
your content up into search terms so
2:34
that it can ignore differences in tense,
or other differences,
2:37
like the singular book
versus the plural books.
2:41
Users of Solr and Lucene include Netflix,
DuckDuckGo, Instagram, AOL, and Twitter.
2:45
Elasticsearch is another popular
open source search tool and
2:53
it is an alternative choice to Lucene and
Solr.
2:56
These systems take data from different
storage layers, like HDFS, then
3:00
index the data on the disk, and finally
provide APIs for front-end clients that
3:04
hook into the search engine and perform
full-text searches on that indexed data.
3:10
The next computational use case that we
are going to look at is machine learning.
3:14
You can think of this as training
computers to recognize patterns.
3:19
Now this is done through statistical
analysis and more complex algorithms.
3:23
Check the teacher's notes for more.
3:27
TensorFlow and
3:29
scikit-learn are two of the most popular
machine learning frameworks available.
3:30
Machine learning can be used for
a plethora of applications.
3:35
Some examples where it's used are in
recommending products or services.
3:38
It can also be used to detect and
prevent financial or ad fraud.
3:42
The magic behind self driving cars
operating on crowded roads is greatly
3:46
powered by machine learning.
3:50
TensorFlow is a Google open source
project that allows users to build
3:52
complex data flow graphs that can perform
a wide variety of machine learning tasks.
3:56
Since TensorFlow can be rather complex,
4:01
beginners are usually advised to
start exploring with scikit-learn.
4:04
Scikit-learn is a Python-based
framework that is very approachable.
4:08
It is a wide ranging set of features for
all kinds of machine learning.
4:11
So that just about wraps up
the domain of computations.
4:16
Let's take a deeper look at our
final domain, infrastructure,
4:18
right after this quick break.
4:22
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up