1 00:00:00,700 --> 00:00:04,156 The final important domain of big data that we'll take a look at here is 2 00:00:04,156 --> 00:00:05,950 infrastructure. 3 00:00:05,950 --> 00:00:10,160 Infrastructure in the world of big data allows the data to keep flowing. 4 00:00:10,160 --> 00:00:13,700 And also allows the systems that we have discussed to run at scale and 5 00:00:13,700 --> 00:00:15,550 on large data sets. 6 00:00:15,550 --> 00:00:18,230 The fundamental unit in big data infrastructure is 7 00:00:18,230 --> 00:00:20,280 often a cluster of machines. 8 00:00:20,280 --> 00:00:23,170 Now typically, this is a group of networked Linux servers. 9 00:00:24,380 --> 00:00:27,710 Managing clusters of machines is a non-trivial task. 10 00:00:27,710 --> 00:00:30,590 You can't just write your own homegrown software to manage 11 00:00:30,590 --> 00:00:32,760 all the servers you have available. 12 00:00:32,760 --> 00:00:37,360 You need to have the ability to run your processing code across the cluster and 13 00:00:37,360 --> 00:00:40,120 then gather the results to display them to the client. 14 00:00:41,170 --> 00:00:45,190 The great news is that there are awesome cluster management tools. 15 00:00:45,190 --> 00:00:48,570 A popular cluster manager is Apache Mesos. 16 00:00:48,570 --> 00:00:54,690 Mesos is used by companies like Airbnb, Apple, Cisco Systems, Netflix, and Uber. 17 00:00:54,690 --> 00:00:58,720 Another cluster manager that you're likely to hear about is Kubernetes. 18 00:00:58,720 --> 00:01:03,110 Kubernetes is more often used for managing containers and not virtual machines. 19 00:01:03,110 --> 00:01:04,050 More in the teacher's notes. 20 00:01:05,330 --> 00:01:08,380 Another area of infrastructure that is vastly important 21 00:01:08,380 --> 00:01:11,990 is the layer of messaging done between services. 22 00:01:11,990 --> 00:01:15,470 Now, this includes sending data between the various storage layers, 23 00:01:15,470 --> 00:01:18,880 computation engines and other infrastructure pieces. 24 00:01:18,880 --> 00:01:23,110 We need systems that can handle the robust transportation of messages. 25 00:01:23,110 --> 00:01:27,457 Because our normal tools, like simple Unix pipes or TCP connections, 26 00:01:27,457 --> 00:01:31,524 just cannot do the trick for very large amounts of streaming data. 27 00:01:31,524 --> 00:01:35,786 One of the most widely used tools to handle this messaging dilemma is 28 00:01:35,786 --> 00:01:37,460 Apache Kafka. 29 00:01:37,460 --> 00:01:42,020 Kafka ensures that you are always able to keep your data around to be ingested 30 00:01:42,020 --> 00:01:45,160 by general computation engines or storage layers. 31 00:01:45,160 --> 00:01:46,080 It also allows for 32 00:01:46,080 --> 00:01:50,240 historical playback of data that has already been streamed through the system. 33 00:01:50,240 --> 00:01:52,670 Kafka is typically placed between clients and 34 00:01:52,670 --> 00:01:56,950 the back end servers that run general computation engines and databases. 35 00:01:56,950 --> 00:02:00,070 There are many other infrastructure services that exist, but 36 00:02:00,070 --> 00:02:02,300 they are out of the scope of this course. 37 00:02:02,300 --> 00:02:07,099 You may also wanna look into visualizing your data with tools like D3.js. 38 00:02:07,099 --> 00:02:11,564 Or maybe you want to manage your state and configuration across many machines using 39 00:02:11,564 --> 00:02:15,330 tools like Apache ZooKeeper or HashiCorp's Consul. 40 00:02:15,330 --> 00:02:18,120 Data serialization for faster transfer of data 41 00:02:18,120 --> 00:02:21,530 can be performed using tools like Apache Thrift and Parquet. 42 00:02:22,550 --> 00:02:26,090 Now that we've touched on the three major domains of big data; storage, 43 00:02:26,090 --> 00:02:27,960 computation and infrastructure. 44 00:02:27,960 --> 00:02:31,840 We're ready for the final stage of this course where we'll look at specific 45 00:02:31,840 --> 00:02:35,320 problems that a well known company, Netflix has encountered, 46 00:02:35,320 --> 00:02:37,650 and how they are solving them using big data.