1 00:00:00,840 --> 00:00:03,030 Storing data is only part of the process. 2 00:00:03,030 --> 00:00:05,350 Often, you'll need to get insight out of that data or 3 00:00:05,350 --> 00:00:08,590 process it at lightning speeds in order to keep up. 4 00:00:08,590 --> 00:00:12,610 Let's discuss three of the main computational use cases for big data. 5 00:00:12,610 --> 00:00:15,080 When we talk about generalized data processing, 6 00:00:15,080 --> 00:00:19,136 we're talking about being fed data and running a computation over it. 7 00:00:19,136 --> 00:00:22,520 It's typically fed through a stream, and the processing could be simple 8 00:00:22,520 --> 00:00:26,570 as counting, taking the variance of data, or some other custom algorithm. 9 00:00:27,740 --> 00:00:30,980 Generalized data processing forms the foundation 10 00:00:30,980 --> 00:00:35,410 of most what of you'll encounter if you work with big data systems and problems. 11 00:00:35,410 --> 00:00:39,340 These systems provide APIs, or application programming interfaces, 12 00:00:39,340 --> 00:00:43,810 which are specific documented functions that are available for you to use. 13 00:00:43,810 --> 00:00:46,480 These functions are robust enough to solve nearly 14 00:00:46,480 --> 00:00:49,840 any problem with data in nearly any format. 15 00:00:49,840 --> 00:00:54,270 Apache Hadoop is the most popular generalized data processing engine 16 00:00:54,270 --> 00:00:56,136 in the big data world today. 17 00:00:56,136 --> 00:01:00,541 Hadoop is based off of a system originally developed at Google to rank webpages of 18 00:01:00,541 --> 00:01:01,785 the entire Internet. 19 00:01:01,785 --> 00:01:06,610 Apache Spark is a newer player to the game, and it's built on top of Hadoop. 20 00:01:06,610 --> 00:01:09,690 Spark brings lighting fast speed not available in Hadoop 21 00:01:09,690 --> 00:01:12,770 through largely in-memory processing and a simpler API. 22 00:01:14,000 --> 00:01:18,890 Spark also has the ability to handle streaming data, work with graphs, and 23 00:01:18,890 --> 00:01:21,660 manipulate text data as SQL. 24 00:01:21,660 --> 00:01:23,790 It also has a machine learning component. 25 00:01:25,190 --> 00:01:28,650 Nearly every company that deals with big data uses Hadoop or 26 00:01:28,650 --> 00:01:31,060 Spark at some place in their stack. 27 00:01:31,060 --> 00:01:34,320 Hadoop and Spark are both backed by HDFS. 28 00:01:34,320 --> 00:01:37,630 Remember, that is for Hadoop Distributed File System. 29 00:01:37,630 --> 00:01:41,560 And therefore, it can scale to tens of thousands of machines in a single cluster. 30 00:01:42,860 --> 00:01:47,880 Okay, so let's move on to our next computational use case, search. 31 00:01:47,880 --> 00:01:51,940 Often, you'll need to find some piece of data within all the data that you have, so 32 00:01:51,940 --> 00:01:54,090 that you can display it to a user. 33 00:01:54,090 --> 00:01:58,350 Now to find that relevant data, all of your internal data has to be stored in 34 00:01:58,350 --> 00:02:03,180 a way that can be quickly retrieved, and surfaced to the application asking for it. 35 00:02:03,180 --> 00:02:06,790 This turns out to be such a difficult problem at scale 36 00:02:06,790 --> 00:02:10,260 that there are major tools built just for this. 37 00:02:10,260 --> 00:02:14,690 So next up in our computational use cases is search. 38 00:02:15,780 --> 00:02:21,330 Popular tools here include Solr and Lucene, both of which are Apache projects. 39 00:02:21,330 --> 00:02:25,510 You've probably noticed a bunch of these big data projects are part of Apache. 40 00:02:25,510 --> 00:02:27,500 Check the teacher's notes for more. 41 00:02:27,500 --> 00:02:32,000 Lucene is the full tech search tool that Solr uses to provide more advanced 42 00:02:32,000 --> 00:02:32,910 searching features. 43 00:02:34,020 --> 00:02:37,860 Full text searching involves breaking your content up into search terms so 44 00:02:37,860 --> 00:02:41,240 that it can ignore differences in tense, or other differences, 45 00:02:41,240 --> 00:02:44,880 like the singular book versus the plural books. 46 00:02:45,990 --> 00:02:52,260 Users of Solr and Lucene include Netflix, DuckDuckGo, Instagram, AOL, and Twitter. 47 00:02:53,410 --> 00:02:56,720 Elasticsearch is another popular open source search tool and 48 00:02:56,720 --> 00:02:59,200 it is an alternative choice to Lucene and Solr. 49 00:03:00,250 --> 00:03:04,900 These systems take data from different storage layers, like HDFS, then 50 00:03:04,900 --> 00:03:10,120 index the data on the disk, and finally provide APIs for front-end clients that 51 00:03:10,120 --> 00:03:14,978 hook into the search engine and perform full-text searches on that indexed data. 52 00:03:14,978 --> 00:03:19,050 The next computational use case that we are going to look at is machine learning. 53 00:03:19,050 --> 00:03:23,150 You can think of this as training computers to recognize patterns. 54 00:03:23,150 --> 00:03:27,170 Now this is done through statistical analysis and more complex algorithms. 55 00:03:27,170 --> 00:03:29,218 Check the teacher's notes for more. 56 00:03:29,218 --> 00:03:30,402 TensorFlow and 57 00:03:30,402 --> 00:03:35,190 scikit-learn are two of the most popular machine learning frameworks available. 58 00:03:35,190 --> 00:03:38,680 Machine learning can be used for a plethora of applications. 59 00:03:38,680 --> 00:03:42,820 Some examples where it's used are in recommending products or services. 60 00:03:42,820 --> 00:03:46,679 It can also be used to detect and prevent financial or ad fraud. 61 00:03:46,679 --> 00:03:50,439 The magic behind self driving cars operating on crowded roads is greatly 62 00:03:50,439 --> 00:03:52,655 powered by machine learning. 63 00:03:52,655 --> 00:03:56,320 TensorFlow is a Google open source project that allows users to build 64 00:03:56,320 --> 00:04:01,812 complex data flow graphs that can perform a wide variety of machine learning tasks. 65 00:04:01,812 --> 00:04:04,390 Since TensorFlow can be rather complex, 66 00:04:04,390 --> 00:04:08,270 beginners are usually advised to start exploring with scikit-learn. 67 00:04:08,270 --> 00:04:11,910 Scikit-learn is a Python-based framework that is very approachable. 68 00:04:11,910 --> 00:04:14,829 It is a wide ranging set of features for all kinds of machine learning. 69 00:04:16,030 --> 00:04:18,870 So that just about wraps up the domain of computations. 70 00:04:18,870 --> 00:04:22,160 Let's take a deeper look at our final domain, infrastructure, 71 00:04:22,160 --> 00:04:23,140 right after this quick break.