How do you characterize Big Data?
petabyte -- 2^50 bytes; 1024 terabytes, or a million gigabytes.
The 4V's of Big Data
- Volume - the scale
- Velocity - the speed
- Veracity - the certainty
- Variety - the diversity
We know that there might be more than 4 Vs in the world of Big Data, but for those of you getting started in the Industry, these are the most relevant to remember
Big data spans a broad range of data sets 0:00 that are nearly impossible to use without specialized tools and systems. 0:03 Now, I'd like you to think of big data as not only the data sets that are be 0:06 processed, but also the infrastructure that's needed to support such analysis. 0:10 Now this spans from ingesting the data in the back end all the way to displaying 0:14 visualizations on the front end. 0:18 Big data plays a part in all of that. 0:19 Big data is typically characterized by what is known as the four V's. 0:22 That's volume, velocity, variety, and veracity. 0:26 Let's take a look at each one of those. 0:30 The size of the data helps to define whether it can 0:32 actually be considered big data. 0:35 You define the volume by the size of the data in gigabytes, 0:37 terabytes or even petabytes. 0:41 Now this varies widely across data sets, 0:43 but usually anything over one gigabyte is considered to be a large volume. 0:44 The velocity helps you define the challenges and 0:50 the demands that make growth and development difficult. 0:52 This is often defined by the problem space. 0:55 Now for example, one problem space you might encounter is that you're doing 0:57 search querying in real time, so you want to process data extremely quickly. 1:01 This is typically called streaming. 1:06 On the other hand, 1:08 if you want to process data once per day, that's called batch processing. 1:09 Now for instance, maybe you want to process 1:13 usage data from a whole suite of mobile applications at the end of each day. 1:16 You won't really be bothered by the latency of getting a response back. 1:20 And you can prepare the data to be ready when you need it. 1:24 Data can be very diverse, often containing both structured and unstructured sources. 1:27 Often, it's made up of many different types. 1:32 Some of it is dense or sparse and sometimes it's dependent on time and 1:35 sometimes it's not. 1:38 There are many other defining characteristics. 1:40 All of these different properties of data means that processing it 1:42 can be significantly harder due to the amount of work ahead of time that 1:46 has to be done to get it into the correct format. 1:49 The trustworthiness and validity of captured data is not always immediately 1:52 clear and therefore, can vary greatly, affecting accurate analysis. 1:57 This could lead to longer pre-processing times and 2:01 more specific requirements on how much data is necessary to make 2:04 effective decisions with the tools available. 2:08 So once again, those four V's are volume, the scale of data, velocity, the speed 2:11 of data, veracity, the certainty of data and variety, the diversity of data. 2:16 With these four characteristics in mind, let's explore why big data is so 2:22 important and why you should be aware of it, right after this quick break. 2:26
You need to sign up for Treehouse in order to download course files.Sign up