Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
Let's discuss what problems we are trying to solve with all these data needs
Terms
- Sentiment Analysis -- The analysis of structured text to determine the emotion behind it.
- Cluster -- A group of computers arranged together logically to work more efficiently on tasks in parallel.
Learn More
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
With an understanding of the importance
of big data, and a want to learn
0:00
the paradigms and major tools for dealing
with it, you're now ready to tackle many
0:04
new problems you may have never dealt with
before across many different domains.
0:08
Let's take a moment to look in more detail
at some of these problems, just to get you
0:13
thinking about how great of a tool
big data can be in your solutions.
0:17
We often want to store large amounts
of data for various reasons.
0:21
For instance, maybe your company processes
large amounts of credit card transaction,
0:25
and you want to store them for
fraud detection.
0:29
Maybe your side project requires you to
store tweets for sentiment analysis like,
0:33
is this a happy tweet or an angry one?
0:37
Or perhaps your school project
requires that you find and
0:39
rank related articles from
Wikipedia based on relevancy.
0:42
Storing large amount of data is hard
to do in memory on one machine,
0:47
you can't just store 100 gigabyte data
set in RAM on your typical laptop.
0:51
Keeping that much data on your hard disk
0:56
means the code that you write
has to process all that data.
0:58
It also has to be able to read
it efficiently and all at once.
1:02
As you might imagine,
1:05
this is something that is hard to
write well, especially from scratch.
1:06
Searching through lots of data
introduces several problems.
1:11
Think a minute here about search
bars on your most used applications,
1:15
like LinkedIn, Facebook, or Twitter.
1:18
There is a lot of data in
those tiny little search
1:21
bars that you need to search through.
1:24
So first, you have to index
the data into search terms ,and
1:26
then surface it quickly enough for
users, so
1:29
that they don't notice too much latency or
delay in their request.
1:32
You also need to make sure that
your data is stored consistently,
1:36
otherwise the results will be wrong for
each different request.
1:39
Now, searching is typically
spread across many machines.
1:43
So you need tools to ingest the new data
that will update the search indexes, so
1:46
that your query systems get
the most up to date data.
1:51
Another common problem, is that we need
to process large amounts of incoming or
1:55
streaming data.
1:59
Now, for instance, imagine a power company
that has thousands of sensors in their
2:00
power stations, distributed
across large geographic regions.
2:04
They need to be able to
ingest all that new data,
2:08
which could be in any number of different
units, as well as different formats.
2:11
They will use that data
to detect anomalies
2:15
that could indicate failures or surges.
2:18
Social media applications like
Facebook need to be able to process
2:21
actions from users quickly,
and send out notifications.
2:24
As a Facebook user, you need to know
immediately when you get that like.
2:28
I mean, it's like,
why you posted it, right?
2:32
They don't want you feeling like,
no one likes me?
2:35
That validation needs
to be almost immediate.
2:38
Netflix, Amazon, and Hulu all want to be
able to process your movie choices and
2:41
provide specific
recommendations in real time.
2:46
When you need them, with the latest
versions of their video catalog.
2:49
Cyber security companies want to be
able to ingest customers' logs, and
2:53
tell the customer whether they've
been potentially compromised.
2:57
Minutes matter here, and the wrong
tools will provide answers far too
3:01
slowly to prevent the magnitude
of the possible attack.
3:04
To solve the problems,
we've referenced the need to use many
3:08
machines to do both the data
processing and the storage.
3:11
Now, in general, this is another problem
presented to us in the realm of big data.
3:14
To store, process, and recall information
from large and complex data sets,
3:19
it's almost always a necessity to
have more than one computer, or
3:24
relatively small size server,
to handle the data.
3:28
When you start to have data spread across,
potentially,
3:32
many machines, you need to have tools
that abstract away the management and
3:34
work flow needed to use multiple machines.
3:38
As we'll learn about,
almost all big data tools and
3:42
systems are built for running across
large groups, or clusters, of machines.
3:44
Now, that we have an idea of the new
problems for big data, let's take a look
3:50
at how they are being solved by some
of the most popular tools out there.
3:54
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up