1 00:00:00,518 --> 00:00:03,080 Big data spans a broad range of data sets 2 00:00:03,080 --> 00:00:06,910 that are nearly impossible to use without specialized tools and systems. 3 00:00:06,910 --> 00:00:10,290 Now, I'd like you to think of big data as not only the data sets that are be 4 00:00:10,290 --> 00:00:14,470 processed, but also the infrastructure that's needed to support such analysis. 5 00:00:14,470 --> 00:00:18,410 Now this spans from ingesting the data in the back end all the way to displaying 6 00:00:18,410 --> 00:00:19,950 visualizations on the front end. 7 00:00:19,950 --> 00:00:22,571 Big data plays a part in all of that. 8 00:00:22,571 --> 00:00:26,550 Big data is typically characterized by what is known as the four V's. 9 00:00:26,550 --> 00:00:30,190 That's volume, velocity, variety, and veracity. 10 00:00:30,190 --> 00:00:31,640 Let's take a look at each one of those. 11 00:00:32,870 --> 00:00:35,730 The size of the data helps to define whether it can 12 00:00:35,730 --> 00:00:37,970 actually be considered big data. 13 00:00:37,970 --> 00:00:41,240 You define the volume by the size of the data in gigabytes, 14 00:00:41,240 --> 00:00:43,070 terabytes or even petabytes. 15 00:00:43,070 --> 00:00:44,950 Now this varies widely across data sets, 16 00:00:44,950 --> 00:00:50,070 but usually anything over one gigabyte is considered to be a large volume. 17 00:00:50,070 --> 00:00:52,600 The velocity helps you define the challenges and 18 00:00:52,600 --> 00:00:55,670 the demands that make growth and development difficult. 19 00:00:55,670 --> 00:00:57,703 This is often defined by the problem space. 20 00:00:57,703 --> 00:01:01,626 Now for example, one problem space you might encounter is that you're doing 21 00:01:01,626 --> 00:01:06,070 search querying in real time, so you want to process data extremely quickly. 22 00:01:06,070 --> 00:01:08,370 This is typically called streaming. 23 00:01:08,370 --> 00:01:09,280 On the other hand, 24 00:01:09,280 --> 00:01:13,420 if you want to process data once per day, that's called batch processing. 25 00:01:13,420 --> 00:01:16,000 Now for instance, maybe you want to process 26 00:01:16,000 --> 00:01:20,470 usage data from a whole suite of mobile applications at the end of each day. 27 00:01:20,470 --> 00:01:24,220 You won't really be bothered by the latency of getting a response back. 28 00:01:24,220 --> 00:01:27,550 And you can prepare the data to be ready when you need it. 29 00:01:27,550 --> 00:01:32,940 Data can be very diverse, often containing both structured and unstructured sources. 30 00:01:32,940 --> 00:01:35,360 Often, it's made up of many different types. 31 00:01:35,360 --> 00:01:38,830 Some of it is dense or sparse and sometimes it's dependent on time and 32 00:01:38,830 --> 00:01:40,120 sometimes it's not. 33 00:01:40,120 --> 00:01:42,430 There are many other defining characteristics. 34 00:01:42,430 --> 00:01:46,040 All of these different properties of data means that processing it 35 00:01:46,040 --> 00:01:49,900 can be significantly harder due to the amount of work ahead of time that 36 00:01:49,900 --> 00:01:52,880 has to be done to get it into the correct format. 37 00:01:52,880 --> 00:01:57,160 The trustworthiness and validity of captured data is not always immediately 38 00:01:57,160 --> 00:02:01,750 clear and therefore, can vary greatly, affecting accurate analysis. 39 00:02:01,750 --> 00:02:04,290 This could lead to longer pre-processing times and 40 00:02:04,290 --> 00:02:08,470 more specific requirements on how much data is necessary to make 41 00:02:08,470 --> 00:02:11,050 effective decisions with the tools available. 42 00:02:11,050 --> 00:02:16,460 So once again, those four V's are volume, the scale of data, velocity, the speed 43 00:02:16,460 --> 00:02:21,430 of data, veracity, the certainty of data and variety, the diversity of data. 44 00:02:22,900 --> 00:02:26,890 With these four characteristics in mind, let's explore why big data is so 45 00:02:26,890 --> 00:02:29,940 important and why you should be aware of it, right after this quick break.