🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Take our "AI in Tech Jobs" survey and win 3 months free Treehouse!

New No-Code Track! 🚀 New videos dropping every week—start learning today!

🤑 Join the Treehouse affiliate program and earn 25% commission!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Introduction to Big Data!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Video Player

00:00

00:00

00:00

2x 2x
1.75x 1.75x
1.5x 1.5x
1.25x 1.25x
1.1x 1.1x
1x 1x
0.75x 0.75x
0.5x 0.5x

None
English

Use Up/Down Arrow keys to increase or decrease volume.

Domain: Infrastructure

2:38 with Craig Dennis and Jared Smith

How do you keep data flowing and scale?

Teacher's Notes
Questions?0
Video Transcript
Downloads
Workspaces

The final important domain of big data that we'll take a look at here is 0:00

infrastructure. 0:04

Infrastructure in the world of big data allows the data to keep flowing. 0:05

And also allows the systems that we have discussed to run at scale and 0:10

on large data sets. 0:13

The fundamental unit in big data infrastructure is 0:15

often a cluster of machines. 0:18

Now typically, this is a group of networked Linux servers. 0:20

Managing clusters of machines is a non-trivial task. 0:24

You can't just write your own homegrown software to manage 0:27

all the servers you have available. 0:30

You need to have the ability to run your processing code across the cluster and 0:32

then gather the results to display them to the client. 0:37

The great news is that there are awesome cluster management tools. 0:41

A popular cluster manager is Apache Mesos. 0:45

Mesos is used by companies like Airbnb, Apple, Cisco Systems, Netflix, and Uber. 0:48

Another cluster manager that you're likely to hear about is Kubernetes. 0:54

Kubernetes is more often used for managing containers and not virtual machines. 0:58

More in the teacher's notes. 1:03

Another area of infrastructure that is vastly important 1:05

is the layer of messaging done between services. 1:08

Now, this includes sending data between the various storage layers, 1:11

computation engines and other infrastructure pieces. 1:15

We need systems that can handle the robust transportation of messages. 1:18

Because our normal tools, like simple Unix pipes or TCP connections, 1:23

just cannot do the trick for very large amounts of streaming data. 1:27

One of the most widely used tools to handle this messaging dilemma is 1:31

Apache Kafka. 1:35

Kafka ensures that you are always able to keep your data around to be ingested 1:37

by general computation engines or storage layers. 1:42

It also allows for 1:45

historical playback of data that has already been streamed through the system. 1:46

Kafka is typically placed between clients and 1:50

the back end servers that run general computation engines and databases. 1:52

There are many other infrastructure services that exist, but 1:56

they are out of the scope of this course. 2:00

You may also wanna look into visualizing your data with tools like D3.js. 2:02

Or maybe you want to manage your state and configuration across many machines using 2:07

tools like Apache ZooKeeper or HashiCorp's Consul. 2:11

Data serialization for faster transfer of data 2:15

can be performed using tools like Apache Thrift and Parquet. 2:18

Now that we've touched on the three major domains of big data; storage, 2:22

computation and infrastructure. 2:26

We're ready for the final stage of this course where we'll look at specific 2:27

problems that a well known company, Netflix has encountered, 2:31

and how they are solving them using big data. 2:35