How do you keep data flowing and scale?
The final important domain of big data that we'll take a look at here is 0:00 infrastructure. 0:04 Infrastructure in the world of big data allows the data to keep flowing. 0:05 And also allows the systems that we have discussed to run at scale and 0:10 on large data sets. 0:13 The fundamental unit in big data infrastructure is 0:15 often a cluster of machines. 0:18 Now typically, this is a group of networked Linux servers. 0:20 Managing clusters of machines is a non-trivial task. 0:24 You can't just write your own homegrown software to manage 0:27 all the servers you have available. 0:30 You need to have the ability to run your processing code across the cluster and 0:32 then gather the results to display them to the client. 0:37 The great news is that there are awesome cluster management tools. 0:41 A popular cluster manager is Apache Mesos. 0:45 Mesos is used by companies like Airbnb, Apple, Cisco Systems, Netflix, and Uber. 0:48 Another cluster manager that you're likely to hear about is Kubernetes. 0:54 Kubernetes is more often used for managing containers and not virtual machines. 0:58 More in the teacher's notes. 1:03 Another area of infrastructure that is vastly important 1:05 is the layer of messaging done between services. 1:08 Now, this includes sending data between the various storage layers, 1:11 computation engines and other infrastructure pieces. 1:15 We need systems that can handle the robust transportation of messages. 1:18 Because our normal tools, like simple Unix pipes or TCP connections, 1:23 just cannot do the trick for very large amounts of streaming data. 1:27 One of the most widely used tools to handle this messaging dilemma is 1:31 Apache Kafka. 1:35 Kafka ensures that you are always able to keep your data around to be ingested 1:37 by general computation engines or storage layers. 1:42 It also allows for 1:45 historical playback of data that has already been streamed through the system. 1:46 Kafka is typically placed between clients and 1:50 the back end servers that run general computation engines and databases. 1:52 There are many other infrastructure services that exist, but 1:56 they are out of the scope of this course. 2:00 You may also wanna look into visualizing your data with tools like D3.js. 2:02 Or maybe you want to manage your state and configuration across many machines using 2:07 tools like Apache ZooKeeper or HashiCorp's Consul. 2:11 Data serialization for faster transfer of data 2:15 can be performed using tools like Apache Thrift and Parquet. 2:18 Now that we've touched on the three major domains of big data; storage, 2:22 computation and infrastructure. 2:26 We're ready for the final stage of this course where we'll look at specific 2:27 problems that a well known company, Netflix has encountered, 2:31 and how they are solving them using big data. 2:35
You need to sign up for Treehouse in order to download course files.Sign up