Let's discuss what problems we are trying to solve with all these data needs
- Sentiment Analysis -- The analysis of structured text to determine the emotion behind it.
- Cluster -- A group of computers arranged together logically to work more efficiently on tasks in parallel.
With an understanding of the importance of big data, and a want to learn 0:00 the paradigms and major tools for dealing with it, you're now ready to tackle many 0:04 new problems you may have never dealt with before across many different domains. 0:08 Let's take a moment to look in more detail at some of these problems, just to get you 0:13 thinking about how great of a tool big data can be in your solutions. 0:17 We often want to store large amounts of data for various reasons. 0:21 For instance, maybe your company processes large amounts of credit card transaction, 0:25 and you want to store them for fraud detection. 0:29 Maybe your side project requires you to store tweets for sentiment analysis like, 0:33 is this a happy tweet or an angry one? 0:37 Or perhaps your school project requires that you find and 0:39 rank related articles from Wikipedia based on relevancy. 0:42 Storing large amount of data is hard to do in memory on one machine, 0:47 you can't just store 100 gigabyte data set in RAM on your typical laptop. 0:51 Keeping that much data on your hard disk 0:56 means the code that you write has to process all that data. 0:58 It also has to be able to read it efficiently and all at once. 1:02 As you might imagine, 1:05 this is something that is hard to write well, especially from scratch. 1:06 Searching through lots of data introduces several problems. 1:11 Think a minute here about search bars on your most used applications, 1:15 like LinkedIn, Facebook, or Twitter. 1:18 There is a lot of data in those tiny little search 1:21 bars that you need to search through. 1:24 So first, you have to index the data into search terms ,and 1:26 then surface it quickly enough for users, so 1:29 that they don't notice too much latency or delay in their request. 1:32 You also need to make sure that your data is stored consistently, 1:36 otherwise the results will be wrong for each different request. 1:39 Now, searching is typically spread across many machines. 1:43 So you need tools to ingest the new data that will update the search indexes, so 1:46 that your query systems get the most up to date data. 1:51 Another common problem, is that we need to process large amounts of incoming or 1:55 streaming data. 1:59 Now, for instance, imagine a power company that has thousands of sensors in their 2:00 power stations, distributed across large geographic regions. 2:04 They need to be able to ingest all that new data, 2:08 which could be in any number of different units, as well as different formats. 2:11 They will use that data to detect anomalies 2:15 that could indicate failures or surges. 2:18 Social media applications like Facebook need to be able to process 2:21 actions from users quickly, and send out notifications. 2:24 As a Facebook user, you need to know immediately when you get that like. 2:28 I mean, it's like, why you posted it, right? 2:32 They don't want you feeling like, no one likes me? 2:35 That validation needs to be almost immediate. 2:38 Netflix, Amazon, and Hulu all want to be able to process your movie choices and 2:41 provide specific recommendations in real time. 2:46 When you need them, with the latest versions of their video catalog. 2:49 Cyber security companies want to be able to ingest customers' logs, and 2:53 tell the customer whether they've been potentially compromised. 2:57 Minutes matter here, and the wrong tools will provide answers far too 3:01 slowly to prevent the magnitude of the possible attack. 3:04 To solve the problems, we've referenced the need to use many 3:08 machines to do both the data processing and the storage. 3:11 Now, in general, this is another problem presented to us in the realm of big data. 3:14 To store, process, and recall information from large and complex data sets, 3:19 it's almost always a necessity to have more than one computer, or 3:24 relatively small size server, to handle the data. 3:28 When you start to have data spread across, potentially, 3:32 many machines, you need to have tools that abstract away the management and 3:34 work flow needed to use multiple machines. 3:38 As we'll learn about, almost all big data tools and 3:42 systems are built for running across large groups, or clusters, of machines. 3:44 Now, that we have an idea of the new problems for big data, let's take a look 3:50 at how they are being solved by some of the most popular tools out there. 3:54
You need to sign up for Treehouse in order to download course files.Sign up