🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Take our "AI in Tech Jobs" survey and win 3 months free Treehouse!

New No-Code Track! 🚀 New videos dropping every week—start learning today!

🤑 Join the Treehouse affiliate program and earn 25% commission!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Introduction to Big Data!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Video Player

00:00

00:00

00:00

2x 2x
1.75x 1.75x
1.5x 1.5x
1.25x 1.25x
1.1x 1.1x
1x 1x
0.75x 0.75x
0.5x 0.5x

None
English

Use Up/Down Arrow keys to increase or decrease volume.

Domain: Computations

4:23 with Craig Dennis and Jared Smith

How do you get insight on all this data?

Teacher's Notes
Questions?0
Video Transcript
Downloads
Workspaces

Storing data is only part of the process. 0:00

Often, you'll need to get insight out of that data or 0:03

process it at lightning speeds in order to keep up. 0:05

Let's discuss three of the main computational use cases for big data. 0:08

When we talk about generalized data processing, 0:12

we're talking about being fed data and running a computation over it. 0:15

It's typically fed through a stream, and the processing could be simple 0:19

as counting, taking the variance of data, or some other custom algorithm. 0:22

Generalized data processing forms the foundation 0:27

of most what of you'll encounter if you work with big data systems and problems. 0:30

These systems provide APIs, or application programming interfaces, 0:35

which are specific documented functions that are available for you to use. 0:39

These functions are robust enough to solve nearly 0:43

any problem with data in nearly any format. 0:46

Apache Hadoop is the most popular generalized data processing engine 0:49

in the big data world today. 0:54

Hadoop is based off of a system originally developed at Google to rank webpages of 0:56

the entire Internet. 1:00

Apache Spark is a newer player to the game, and it's built on top of Hadoop. 1:01

Spark brings lighting fast speed not available in Hadoop 1:06

through largely in-memory processing and a simpler API. 1:09

Spark also has the ability to handle streaming data, work with graphs, and 1:14

manipulate text data as SQL. 1:18

It also has a machine learning component. 1:21

Nearly every company that deals with big data uses Hadoop or 1:25

Spark at some place in their stack. 1:28

Hadoop and Spark are both backed by HDFS. 1:31

Remember, that is for Hadoop Distributed File System. 1:34

And therefore, it can scale to tens of thousands of machines in a single cluster. 1:37

Okay, so let's move on to our next computational use case, search. 1:42

Often, you'll need to find some piece of data within all the data that you have, so 1:47

that you can display it to a user. 1:51

Now to find that relevant data, all of your internal data has to be stored in 1:54

a way that can be quickly retrieved, and surfaced to the application asking for it. 1:58

This turns out to be such a difficult problem at scale 2:03

that there are major tools built just for this. 2:06

So next up in our computational use cases is search. 2:10

Popular tools here include Solr and Lucene, both of which are Apache projects. 2:15

You've probably noticed a bunch of these big data projects are part of Apache. 2:21

Check the teacher's notes for more. 2:25

Lucene is the full tech search tool that Solr uses to provide more advanced 2:27

searching features. 2:32

Full text searching involves breaking your content up into search terms so 2:34

that it can ignore differences in tense, or other differences, 2:37

like the singular book versus the plural books. 2:41

Users of Solr and Lucene include Netflix, DuckDuckGo, Instagram, AOL, and Twitter. 2:45

Elasticsearch is another popular open source search tool and 2:53

it is an alternative choice to Lucene and Solr. 2:56

These systems take data from different storage layers, like HDFS, then 3:00

index the data on the disk, and finally provide APIs for front-end clients that 3:04

hook into the search engine and perform full-text searches on that indexed data. 3:10

The next computational use case that we are going to look at is machine learning. 3:14

You can think of this as training computers to recognize patterns. 3:19

Now this is done through statistical analysis and more complex algorithms. 3:23

Check the teacher's notes for more. 3:27

TensorFlow and 3:29

scikit-learn are two of the most popular machine learning frameworks available. 3:30

Machine learning can be used for a plethora of applications. 3:35

Some examples where it's used are in recommending products or services. 3:38

It can also be used to detect and prevent financial or ad fraud. 3:42

The magic behind self driving cars operating on crowded roads is greatly 3:46

powered by machine learning. 3:50

TensorFlow is a Google open source project that allows users to build 3:52

complex data flow graphs that can perform a wide variety of machine learning tasks. 3:56

Since TensorFlow can be rather complex, 4:01

beginners are usually advised to start exploring with scikit-learn. 4:04

Scikit-learn is a Python-based framework that is very approachable. 4:08

It is a wide ranging set of features for all kinds of machine learning. 4:11

So that just about wraps up the domain of computations. 4:16

Let's take a deeper look at our final domain, infrastructure, 4:18

right after this quick break. 4:22