1 00:00:00,890 --> 00:00:04,530 With an understanding of the importance of big data, and a want to learn 2 00:00:04,530 --> 00:00:08,610 the paradigms and major tools for dealing with it, you're now ready to tackle many 3 00:00:08,610 --> 00:00:12,220 new problems you may have never dealt with before across many different domains. 4 00:00:13,380 --> 00:00:17,580 Let's take a moment to look in more detail at some of these problems, just to get you 5 00:00:17,580 --> 00:00:21,970 thinking about how great of a tool big data can be in your solutions. 6 00:00:21,970 --> 00:00:25,920 We often want to store large amounts of data for various reasons. 7 00:00:25,920 --> 00:00:29,910 For instance, maybe your company processes large amounts of credit card transaction, 8 00:00:29,910 --> 00:00:32,040 and you want to store them for fraud detection. 9 00:00:33,050 --> 00:00:37,796 Maybe your side project requires you to store tweets for sentiment analysis like, 10 00:00:37,796 --> 00:00:39,913 is this a happy tweet or an angry one? 11 00:00:39,913 --> 00:00:42,937 Or perhaps your school project requires that you find and 12 00:00:42,937 --> 00:00:46,090 rank related articles from Wikipedia based on relevancy. 13 00:00:47,470 --> 00:00:51,540 Storing large amount of data is hard to do in memory on one machine, 14 00:00:51,540 --> 00:00:56,510 you can't just store 100 gigabyte data set in RAM on your typical laptop. 15 00:00:56,510 --> 00:00:58,970 Keeping that much data on your hard disk 16 00:00:58,970 --> 00:01:02,160 means the code that you write has to process all that data. 17 00:01:02,160 --> 00:01:05,920 It also has to be able to read it efficiently and all at once. 18 00:01:05,920 --> 00:01:06,780 As you might imagine, 19 00:01:06,780 --> 00:01:09,720 this is something that is hard to write well, especially from scratch. 20 00:01:11,680 --> 00:01:14,320 Searching through lots of data introduces several problems. 21 00:01:15,370 --> 00:01:18,770 Think a minute here about search bars on your most used applications, 22 00:01:18,770 --> 00:01:21,710 like LinkedIn, Facebook, or Twitter. 23 00:01:21,710 --> 00:01:24,690 There is a lot of data in those tiny little search 24 00:01:24,690 --> 00:01:26,280 bars that you need to search through. 25 00:01:26,280 --> 00:01:29,570 So first, you have to index the data into search terms ,and 26 00:01:29,570 --> 00:01:32,000 then surface it quickly enough for users, so 27 00:01:32,000 --> 00:01:36,350 that they don't notice too much latency or delay in their request. 28 00:01:36,350 --> 00:01:39,430 You also need to make sure that your data is stored consistently, 29 00:01:39,430 --> 00:01:43,080 otherwise the results will be wrong for each different request. 30 00:01:43,080 --> 00:01:46,710 Now, searching is typically spread across many machines. 31 00:01:46,710 --> 00:01:51,150 So you need tools to ingest the new data that will update the search indexes, so 32 00:01:51,150 --> 00:01:53,760 that your query systems get the most up to date data. 33 00:01:55,450 --> 00:01:59,830 Another common problem, is that we need to process large amounts of incoming or 34 00:01:59,830 --> 00:02:00,910 streaming data. 35 00:02:00,910 --> 00:02:04,660 Now, for instance, imagine a power company that has thousands of sensors in their 36 00:02:04,660 --> 00:02:08,550 power stations, distributed across large geographic regions. 37 00:02:08,550 --> 00:02:11,680 They need to be able to ingest all that new data, 38 00:02:11,680 --> 00:02:15,540 which could be in any number of different units, as well as different formats. 39 00:02:15,540 --> 00:02:18,350 They will use that data to detect anomalies 40 00:02:18,350 --> 00:02:20,150 that could indicate failures or surges. 41 00:02:21,570 --> 00:02:24,900 Social media applications like Facebook need to be able to process 42 00:02:24,900 --> 00:02:28,510 actions from users quickly, and send out notifications. 43 00:02:28,510 --> 00:02:32,860 As a Facebook user, you need to know immediately when you get that like. 44 00:02:32,860 --> 00:02:35,140 I mean, it's like, why you posted it, right? 45 00:02:35,140 --> 00:02:38,170 They don't want you feeling like, no one likes me? 46 00:02:38,170 --> 00:02:40,380 That validation needs to be almost immediate. 47 00:02:41,840 --> 00:02:46,350 Netflix, Amazon, and Hulu all want to be able to process your movie choices and 48 00:02:46,350 --> 00:02:49,330 provide specific recommendations in real time. 49 00:02:49,330 --> 00:02:52,260 When you need them, with the latest versions of their video catalog. 50 00:02:53,660 --> 00:02:57,270 Cyber security companies want to be able to ingest customers' logs, and 51 00:02:57,270 --> 00:03:00,030 tell the customer whether they've been potentially compromised. 52 00:03:01,100 --> 00:03:04,420 Minutes matter here, and the wrong tools will provide answers far too 53 00:03:04,420 --> 00:03:07,280 slowly to prevent the magnitude of the possible attack. 54 00:03:08,440 --> 00:03:11,600 To solve the problems, we've referenced the need to use many 55 00:03:11,600 --> 00:03:14,910 machines to do both the data processing and the storage. 56 00:03:14,910 --> 00:03:19,330 Now, in general, this is another problem presented to us in the realm of big data. 57 00:03:19,330 --> 00:03:24,010 To store, process, and recall information from large and complex data sets, 58 00:03:24,010 --> 00:03:28,430 it's almost always a necessity to have more than one computer, or 59 00:03:28,430 --> 00:03:30,730 relatively small size server, to handle the data. 60 00:03:32,070 --> 00:03:34,320 When you start to have data spread across, potentially, 61 00:03:34,320 --> 00:03:38,480 many machines, you need to have tools that abstract away the management and 62 00:03:38,480 --> 00:03:41,080 work flow needed to use multiple machines. 63 00:03:42,120 --> 00:03:44,870 As we'll learn about, almost all big data tools and 64 00:03:44,870 --> 00:03:49,410 systems are built for running across large groups, or clusters, of machines. 65 00:03:50,990 --> 00:03:54,500 Now, that we have an idea of the new problems for big data, let's take a look 66 00:03:54,500 --> 00:03:57,770 at how they are being solved by some of the most popular tools out there.