1 00:00:00,400 --> 00:00:04,520 In order to work with data, you often need to store it in some medium before or 2 00:00:04,520 --> 00:00:05,970 after processing it. 3 00:00:05,970 --> 00:00:09,420 For this reason, data storage tools and frameworks make up a large part of 4 00:00:09,420 --> 00:00:12,240 the major tools and frameworks in the big data ecosystem. 5 00:00:13,265 --> 00:00:17,325 We'll be taking a look at these three major classes of data storage systems. 6 00:00:17,325 --> 00:00:21,855 Relational databases are used to store structured data like we just talked about. 7 00:00:21,855 --> 00:00:26,845 These databases store the structured data according to what is known as a schema. 8 00:00:26,845 --> 00:00:30,465 A schema defines the way the data is organized in the system, 9 00:00:30,465 --> 00:00:32,740 also known as its structure. 10 00:00:32,740 --> 00:00:37,090 Relational databases use a query language that can access these schemas, 11 00:00:37,090 --> 00:00:42,440 most typically a dialect of SQL, which stands for structured query language. 12 00:00:42,440 --> 00:00:45,890 SQL provides a standard way to query, manipulate, and 13 00:00:45,890 --> 00:00:48,370 store data in relational databases. 14 00:00:48,370 --> 00:00:51,076 Check the teachers notes if you're looking to learn more about SQL. 15 00:00:52,355 --> 00:00:57,370 Relational databases perform very well for data that is not sparse. 16 00:00:57,370 --> 00:01:01,750 Sparsity in a database is defined by the amount of blank entries. 17 00:01:01,750 --> 00:01:06,500 In the case of relational databases, the less sparse the data, the better. 18 00:01:06,500 --> 00:01:09,710 It also does well with data that can be contained on a single machine. 19 00:01:10,810 --> 00:01:14,700 It becomes less appropriate when data is needed to be spread across many machines 20 00:01:14,700 --> 00:01:16,390 and accessed in parallel. 21 00:01:16,390 --> 00:01:18,460 There are databases built specifically for 22 00:01:18,460 --> 00:01:22,330 these highly distributed purposes and we'll cover those here shortly. 23 00:01:22,330 --> 00:01:27,302 A few of the major relational databases you've probably heard 24 00:01:27,302 --> 00:01:31,346 of are PostgreSQL, MySQL, MS SQL, and MariaDB. 25 00:01:31,346 --> 00:01:35,871 Non-relational databases are often based around documents, which you can think 26 00:01:35,871 --> 00:01:40,760 of as a piece of data that doesn't have a predefined schema, or structure. 27 00:01:40,760 --> 00:01:45,692 Now these documents could be JSON, which stands for JavaScript Object Notation, 28 00:01:45,692 --> 00:01:47,814 XML, or just plain old text blobs. 29 00:01:47,814 --> 00:01:52,790 Non-relational databases, or NoSQL, perform better when the data needs 30 00:01:52,790 --> 00:01:57,130 to be distributed or sharded across many machines. 31 00:01:57,130 --> 00:02:00,640 It opens up the possibility for having everything accessed in parallel 32 00:02:00,640 --> 00:02:05,570 with the ability to read and write in parallel across a cluster. 33 00:02:05,570 --> 00:02:10,430 One of the most popular NoSQL databases is MongoDB, a document-based NoSQL 34 00:02:10,430 --> 00:02:14,560 database that stores data in BSON, which is a binary format. 35 00:02:14,560 --> 00:02:17,260 And then clients retrieve the results in the form of JSON. 36 00:02:18,430 --> 00:02:22,460 Remember that there's often more data than can fit on a single computer. 37 00:02:22,460 --> 00:02:26,030 When you need to scale the number of machines where you're storing your data 38 00:02:26,030 --> 00:02:31,020 to potentially thousands or more, you have to use specialized storage systems. 39 00:02:31,020 --> 00:02:34,360 These systems have the ability to scale and have been battle tested and 40 00:02:34,360 --> 00:02:37,900 can now store up to petabytes of data. 41 00:02:37,900 --> 00:02:41,960 A few of the most popular storage engines for large distributed data sets 42 00:02:41,960 --> 00:02:47,080 are the Hadoop Distributed File System, or HDFS, and Cassandra. 43 00:02:47,080 --> 00:02:49,940 These are used for unstructured or structured text data. 44 00:02:51,320 --> 00:02:55,140 Amazon's Simple Storage Service, more commonly referred to as Amazon S3, 45 00:02:55,140 --> 00:02:58,090 is used to store files of nearly any size. 46 00:02:59,350 --> 00:03:04,640 Hadoop was originally built by Google to index the entire web, like all of it. 47 00:03:04,640 --> 00:03:09,510 Cassandra is a system used by Facebook to power a large part of their systems. 48 00:03:09,510 --> 00:03:12,310 Amazon S3 is used by Dropbox and many others for 49 00:03:12,310 --> 00:03:14,620 storing files across many regions of the world. 50 00:03:15,890 --> 00:03:20,250 And last but not least, we should discuss graph based databases. 51 00:03:20,250 --> 00:03:24,870 Graph databases store data that can be represented by nodes and edges, where 52 00:03:24,870 --> 00:03:30,260 a node could be a person and an edge could be a property that the two nodes share. 53 00:03:30,260 --> 00:03:32,590 They help search and walk relationships, and 54 00:03:32,590 --> 00:03:36,000 find patterns in the interconnectivity between nodes. 55 00:03:36,000 --> 00:03:41,350 The canonical example of a good use case for a graph database is a social network. 56 00:03:41,350 --> 00:03:45,210 It's important to keep in mind that you don't wanna just use a graph database 57 00:03:45,210 --> 00:03:47,800 just for the sake of using a graph database. 58 00:03:47,800 --> 00:03:52,130 It sounds cool, but often, the normal SQL database will do the trick. 59 00:03:52,130 --> 00:03:56,270 If you do find this is a good choice for your data, Neo4j and Dgraph 60 00:03:56,270 --> 00:04:00,130 are two very popular graph databases that are open source and widely used. 61 00:04:01,550 --> 00:04:04,750 Now that we've taken a brief overview of the domain of data storage, 62 00:04:04,750 --> 00:04:07,340 let's start looking at our next domain, computation.