Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
How do you get insight on all this data?
- Apache Hadoop homepage
- Wikipedia has a great history of the Hadoop project and much of the Big Data ecosystem that formed on top of Hadoop
- Apache Spark homepage
- Apache Spark Documentation
- Apache Solr Quickstart
- Apache Lucene homepage
- Elasticsearch in 5 minutes
- Getting started with Tensorflow
- Getting started with Scikit-Learn
- Machine Learning on Spark with MLlib
- More ML on Spark Tutorial
Storing data is only part of the process. 0:00 Often, you'll need to get insight out of that data or 0:03 process it at lightning speeds in order to keep up. 0:05 Let's discuss three of the main computational use cases for big data. 0:08 When we talk about generalized data processing, 0:12 we're talking about being fed data and running a computation over it. 0:15 It's typically fed through a stream, and the processing could be simple 0:19 as counting, taking the variance of data, or some other custom algorithm. 0:22 Generalized data processing forms the foundation 0:27 of most what of you'll encounter if you work with big data systems and problems. 0:30 These systems provide APIs, or application programming interfaces, 0:35 which are specific documented functions that are available for you to use. 0:39 These functions are robust enough to solve nearly 0:43 any problem with data in nearly any format. 0:46 Apache Hadoop is the most popular generalized data processing engine 0:49 in the big data world today. 0:54 Hadoop is based off of a system originally developed at Google to rank webpages of 0:56 the entire Internet. 1:00 Apache Spark is a newer player to the game, and it's built on top of Hadoop. 1:01 Spark brings lighting fast speed not available in Hadoop 1:06 through largely in-memory processing and a simpler API. 1:09 Spark also has the ability to handle streaming data, work with graphs, and 1:14 manipulate text data as SQL. 1:18 It also has a machine learning component. 1:21 Nearly every company that deals with big data uses Hadoop or 1:25 Spark at some place in their stack. 1:28 Hadoop and Spark are both backed by HDFS. 1:31 Remember, that is for Hadoop Distributed File System. 1:34 And therefore, it can scale to tens of thousands of machines in a single cluster. 1:37 Okay, so let's move on to our next computational use case, search. 1:42 Often, you'll need to find some piece of data within all the data that you have, so 1:47 that you can display it to a user. 1:51 Now to find that relevant data, all of your internal data has to be stored in 1:54 a way that can be quickly retrieved, and surfaced to the application asking for it. 1:58 This turns out to be such a difficult problem at scale 2:03 that there are major tools built just for this. 2:06 So next up in our computational use cases is search. 2:10 Popular tools here include Solr and Lucene, both of which are Apache projects. 2:15 You've probably noticed a bunch of these big data projects are part of Apache. 2:21 Check the teacher's notes for more. 2:25 Lucene is the full tech search tool that Solr uses to provide more advanced 2:27 searching features. 2:32 Full text searching involves breaking your content up into search terms so 2:34 that it can ignore differences in tense, or other differences, 2:37 like the singular book versus the plural books. 2:41 Users of Solr and Lucene include Netflix, DuckDuckGo, Instagram, AOL, and Twitter. 2:45 Elasticsearch is another popular open source search tool and 2:53 it is an alternative choice to Lucene and Solr. 2:56 These systems take data from different storage layers, like HDFS, then 3:00 index the data on the disk, and finally provide APIs for front-end clients that 3:04 hook into the search engine and perform full-text searches on that indexed data. 3:10 The next computational use case that we are going to look at is machine learning. 3:14 You can think of this as training computers to recognize patterns. 3:19 Now this is done through statistical analysis and more complex algorithms. 3:23 Check the teacher's notes for more. 3:27 TensorFlow and 3:29 scikit-learn are two of the most popular machine learning frameworks available. 3:30 Machine learning can be used for a plethora of applications. 3:35 Some examples where it's used are in recommending products or services. 3:38 It can also be used to detect and prevent financial or ad fraud. 3:42 The magic behind self driving cars operating on crowded roads is greatly 3:46 powered by machine learning. 3:50 TensorFlow is a Google open source project that allows users to build 3:52 complex data flow graphs that can perform a wide variety of machine learning tasks. 3:56 Since TensorFlow can be rather complex, 4:01 beginners are usually advised to start exploring with scikit-learn. 4:04 Scikit-learn is a Python-based framework that is very approachable. 4:08 It is a wide ranging set of features for all kinds of machine learning. 4:11 So that just about wraps up the domain of computations. 4:16 Let's take a deeper look at our final domain, infrastructure, 4:18 right after this quick break. 4:22
You need to sign up for Treehouse in order to download course files.Sign up