Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
How do you keep data flowing and scale?
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
The final important domain of big data
that we'll take a look at here is
0:00
infrastructure.
0:04
Infrastructure in the world of big
data allows the data to keep flowing.
0:05
And also allows the systems that we
have discussed to run at scale and
0:10
on large data sets.
0:13
The fundamental unit in
big data infrastructure is
0:15
often a cluster of machines.
0:18
Now typically, this is a group
of networked Linux servers.
0:20
Managing clusters of machines
is a non-trivial task.
0:24
You can't just write your own
homegrown software to manage
0:27
all the servers you have available.
0:30
You need to have the ability to run your
processing code across the cluster and
0:32
then gather the results to
display them to the client.
0:37
The great news is that there
are awesome cluster management tools.
0:41
A popular cluster manager is Apache Mesos.
0:45
Mesos is used by companies like Airbnb,
Apple, Cisco Systems, Netflix, and Uber.
0:48
Another cluster manager that you're
likely to hear about is Kubernetes.
0:54
Kubernetes is more often used for managing
containers and not virtual machines.
0:58
More in the teacher's notes.
1:03
Another area of infrastructure
that is vastly important
1:05
is the layer of messaging
done between services.
1:08
Now, this includes sending data
between the various storage layers,
1:11
computation engines and
other infrastructure pieces.
1:15
We need systems that can handle
the robust transportation of messages.
1:18
Because our normal tools,
like simple Unix pipes or TCP connections,
1:23
just cannot do the trick for
very large amounts of streaming data.
1:27
One of the most widely used tools
to handle this messaging dilemma is
1:31
Apache Kafka.
1:35
Kafka ensures that you are always able
to keep your data around to be ingested
1:37
by general computation engines or
storage layers.
1:42
It also allows for
1:45
historical playback of data that has
already been streamed through the system.
1:46
Kafka is typically placed
between clients and
1:50
the back end servers that run general
computation engines and databases.
1:52
There are many other infrastructure
services that exist, but
1:56
they are out of the scope of this course.
2:00
You may also wanna look into visualizing
your data with tools like D3.js.
2:02
Or maybe you want to manage your state and
configuration across many machines using
2:07
tools like Apache ZooKeeper or
HashiCorp's Consul.
2:11
Data serialization for
faster transfer of data
2:15
can be performed using tools
like Apache Thrift and Parquet.
2:18
Now that we've touched on the three
major domains of big data; storage,
2:22
computation and infrastructure.
2:26
We're ready for the final stage of this
course where we'll look at specific
2:27
problems that a well known company,
Netflix has encountered,
2:31
and how they are solving
them using big data.
2:35
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up