1
00:00:00,840 --> 00:00:03,030
Storing data is only part of the process.

2
00:00:03,030 --> 00:00:05,350
Often, you'll need to get
insight out of that data or

3
00:00:05,350 --> 00:00:08,590
process it at lightning
speeds in order to keep up.

4
00:00:08,590 --> 00:00:12,610
Let's discuss three of the main
computational use cases for big data.

5
00:00:12,610 --> 00:00:15,080
When we talk about
generalized data processing,

6
00:00:15,080 --> 00:00:19,136
we're talking about being fed data and
running a computation over it.

7
00:00:19,136 --> 00:00:22,520
It's typically fed through a stream,
and the processing could be simple

8
00:00:22,520 --> 00:00:26,570
as counting, taking the variance of data,
or some other custom algorithm.

9
00:00:27,740 --> 00:00:30,980
Generalized data processing
forms the foundation

10
00:00:30,980 --> 00:00:35,410
of most what of you'll encounter if you
work with big data systems and problems.

11
00:00:35,410 --> 00:00:39,340
These systems provide APIs, or
application programming interfaces,

12
00:00:39,340 --> 00:00:43,810
which are specific documented functions
that are available for you to use.

13
00:00:43,810 --> 00:00:46,480
These functions are robust
enough to solve nearly

14
00:00:46,480 --> 00:00:49,840
any problem with data
in nearly any format.

15
00:00:49,840 --> 00:00:54,270
Apache Hadoop is the most popular
generalized data processing engine

16
00:00:54,270 --> 00:00:56,136
in the big data world today.

17
00:00:56,136 --> 00:01:00,541
Hadoop is based off of a system originally
developed at Google to rank webpages of

18
00:01:00,541 --> 00:01:01,785
the entire Internet.

19
00:01:01,785 --> 00:01:06,610
Apache Spark is a newer player to
the game, and it's built on top of Hadoop.

20
00:01:06,610 --> 00:01:09,690
Spark brings lighting fast
speed not available in Hadoop

21
00:01:09,690 --> 00:01:12,770
through largely in-memory processing and
a simpler API.

22
00:01:14,000 --> 00:01:18,890
Spark also has the ability to handle
streaming data, work with graphs, and

23
00:01:18,890 --> 00:01:21,660
manipulate text data as SQL.

24
00:01:21,660 --> 00:01:23,790
It also has a machine learning component.

25
00:01:25,190 --> 00:01:28,650
Nearly every company that deals
with big data uses Hadoop or

26
00:01:28,650 --> 00:01:31,060
Spark at some place in their stack.

27
00:01:31,060 --> 00:01:34,320
Hadoop and Spark are both backed by HDFS.

28
00:01:34,320 --> 00:01:37,630
Remember, that is for
Hadoop Distributed File System.

29
00:01:37,630 --> 00:01:41,560
And therefore, it can scale to tens of
thousands of machines in a single cluster.

30
00:01:42,860 --> 00:01:47,880
Okay, so let's move on to our next
computational use case, search.

31
00:01:47,880 --> 00:01:51,940
Often, you'll need to find some piece of
data within all the data that you have, so

32
00:01:51,940 --> 00:01:54,090
that you can display it to a user.

33
00:01:54,090 --> 00:01:58,350
Now to find that relevant data, all of
your internal data has to be stored in

34
00:01:58,350 --> 00:02:03,180
a way that can be quickly retrieved, and
surfaced to the application asking for it.

35
00:02:03,180 --> 00:02:06,790
This turns out to be such
a difficult problem at scale

36
00:02:06,790 --> 00:02:10,260
that there are major tools built just for
this.

37
00:02:10,260 --> 00:02:14,690
So next up in our computational
use cases is search.

38
00:02:15,780 --> 00:02:21,330
Popular tools here include Solr and
Lucene, both of which are Apache projects.

39
00:02:21,330 --> 00:02:25,510
You've probably noticed a bunch of these
big data projects are part of Apache.

40
00:02:25,510 --> 00:02:27,500
Check the teacher's notes for more.

41
00:02:27,500 --> 00:02:32,000
Lucene is the full tech search tool
that Solr uses to provide more advanced

42
00:02:32,000 --> 00:02:32,910
searching features.

43
00:02:34,020 --> 00:02:37,860
Full text searching involves breaking
your content up into search terms so

44
00:02:37,860 --> 00:02:41,240
that it can ignore differences in tense,
or other differences,

45
00:02:41,240 --> 00:02:44,880
like the singular book
versus the plural books.

46
00:02:45,990 --> 00:02:52,260
Users of Solr and Lucene include Netflix,
DuckDuckGo, Instagram, AOL, and Twitter.

47
00:02:53,410 --> 00:02:56,720
Elasticsearch is another popular
open source search tool and

48
00:02:56,720 --> 00:02:59,200
it is an alternative choice to Lucene and
Solr.

49
00:03:00,250 --> 00:03:04,900
These systems take data from different
storage layers, like HDFS, then

50
00:03:04,900 --> 00:03:10,120
index the data on the disk, and finally
provide APIs for front-end clients that

51
00:03:10,120 --> 00:03:14,978
hook into the search engine and perform
full-text searches on that indexed data.

52
00:03:14,978 --> 00:03:19,050
The next computational use case that we
are going to look at is machine learning.

53
00:03:19,050 --> 00:03:23,150
You can think of this as training
computers to recognize patterns.

54
00:03:23,150 --> 00:03:27,170
Now this is done through statistical
analysis and more complex algorithms.

55
00:03:27,170 --> 00:03:29,218
Check the teacher's notes for more.

56
00:03:29,218 --> 00:03:30,402
TensorFlow and

57
00:03:30,402 --> 00:03:35,190
scikit-learn are two of the most popular
machine learning frameworks available.

58
00:03:35,190 --> 00:03:38,680
Machine learning can be used for
a plethora of applications.

59
00:03:38,680 --> 00:03:42,820
Some examples where it's used are in
recommending products or services.

60
00:03:42,820 --> 00:03:46,679
It can also be used to detect and
prevent financial or ad fraud.

61
00:03:46,679 --> 00:03:50,439
The magic behind self driving cars
operating on crowded roads is greatly

62
00:03:50,439 --> 00:03:52,655
powered by machine learning.

63
00:03:52,655 --> 00:03:56,320
TensorFlow is a Google open source
project that allows users to build

64
00:03:56,320 --> 00:04:01,812
complex data flow graphs that can perform
a wide variety of machine learning tasks.

65
00:04:01,812 --> 00:04:04,390
Since TensorFlow can be rather complex,

66
00:04:04,390 --> 00:04:08,270
beginners are usually advised to
start exploring with scikit-learn.

67
00:04:08,270 --> 00:04:11,910
Scikit-learn is a Python-based
framework that is very approachable.

68
00:04:11,910 --> 00:04:14,829
It is a wide ranging set of features for
all kinds of machine learning.

69
00:04:16,030 --> 00:04:18,870
So that just about wraps up
the domain of computations.

70
00:04:18,870 --> 00:04:22,160
Let's take a deeper look at our
final domain, infrastructure,

71
00:04:22,160 --> 00:04:23,140
right after this quick break.