Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
How do you characterize Big Data?
petabyte -- 2^50 bytes; 1024 terabytes, or a million gigabytes.
The 4V's of Big Data
- Volume - the scale
- Velocity - the speed
- Veracity - the certainty
- Variety - the diversity
We know that there might be more than 4 Vs in the world of Big Data, but for those of you getting started in the Industry, these are the most relevant to remember
Big data spans a broad range of data sets
that are nearly impossible to use without specialized tools and systems.
Now, I'd like you to think of big data as not only the data sets that are be
processed, but also the infrastructure that's needed to support such analysis.
Now this spans from ingesting the data in the back end all the way to displaying
visualizations on the front end.
Big data plays a part in all of that.
Big data is typically characterized by what is known as the four V's.
That's volume, velocity, variety, and veracity.
Let's take a look at each one of those.
The size of the data helps to define whether it can
actually be considered big data.
You define the volume by the size of the data in gigabytes,
terabytes or even petabytes.
Now this varies widely across data sets,
but usually anything over one gigabyte is considered to be a large volume.
The velocity helps you define the challenges and
the demands that make growth and development difficult.
This is often defined by the problem space.
Now for example, one problem space you might encounter is that you're doing
search querying in real time, so you want to process data extremely quickly.
This is typically called streaming.
On the other hand,
if you want to process data once per day, that's called batch processing.
Now for instance, maybe you want to process
usage data from a whole suite of mobile applications at the end of each day.
You won't really be bothered by the latency of getting a response back.
And you can prepare the data to be ready when you need it.
Data can be very diverse, often containing both structured and unstructured sources.
Often, it's made up of many different types.
Some of it is dense or sparse and sometimes it's dependent on time and
sometimes it's not.
There are many other defining characteristics.
All of these different properties of data means that processing it
can be significantly harder due to the amount of work ahead of time that
has to be done to get it into the correct format.
The trustworthiness and validity of captured data is not always immediately
clear and therefore, can vary greatly, affecting accurate analysis.
This could lead to longer pre-processing times and
more specific requirements on how much data is necessary to make
effective decisions with the tools available.
So once again, those four V's are volume, the scale of data, velocity, the speed
of data, veracity, the certainty of data and variety, the diversity of data.
With these four characteristics in mind, let's explore why big data is so
important and why you should be aware of it, right after this quick break.
You need to sign up for Treehouse in order to download course files.Sign up