Changing the Size of the Window with Frame Clauses9:49 with Danielle Kiowski
We conclude this workshop on SQL Window Functions by introducing frame clauses. A frame clause lets you control the size of the window, or the number of rows over which the window function operates. This lets you analyze data to determine trends and rolling averages.
- Frame clause — Narrows the frame to a subset of rows based on the order of the rows
- ROWS — Keyword that introduces a frame clause. Treats each row individually for purposes of inclusion in the frame.
- RANGE — Keyword that introduces a frame clause. Groups rows together according to the ordering clause for purposes of inclusion in the frame (i.e. “CURRENT ROW” includes all of the rows with the same ordering value as the row on which the result is returned).
SELECT PLAYER_NAME, PLAYER_RANKING, PLAYER_TEAM, COUNT(MATCH_ID) as MATCHES_PLAYED, SUM(CHECKMATE) as MATCHES_WON, SUM(CHECKMATE)/COUNT(MATCH_ID) as WIN_RATE, AVG(PLAYER_RANKING) OVER () as AVG_RANKING, PLAYER_RANKING - AVG(PLAYER_RANKING) OVER () as DIFF_FROM_AVERAGE, SUM(SUM(CHECKMATE)) OVER () as TOTAL_MATCHES, SUM(MAX(CHECKMATE)) OVER () as PLAYERS_WITH_WINS, MAX(SUM(CHECKMATE)) OVER () as HIGHEST_WINS, SUM(SUM(CHECKMATE)) OVER (PARTITION BY PLAYER_TEAM) / SUM(COUNT(MATCH_ID)) OVER (PARTITION BY PLAYER_TEAM) as TEAM_WIN_RATE, MIN(PLAYER_RANKING) OVER (PARTITION BY SUM(CHECKMATE)) as TIEBREAKER, MIN(PLAYER_RANKING) OVER (PARTITION BY SUM(CHECKMATE), PLAYER_TEAM) as TIEBREAKER_TEAM, RANK() OVER (PARTITION BY PLAYER_TEAM ORDER BY PLAYER_RANKING DESC) as RANKING, MAX(PLAYER_RANKING) OVER (ORDER BY SUM(CHECKMATE) ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as MAX_RANKING_ROWS, MAX(PLAYER_RANKING) OVER (ORDER BY SUM(CHECKMATE) ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as MAX_RANKING_RANGE, SUM(SUM(CHECKMATE)) OVER (ORDER BY PLAYER_RANKING DESC RANGE UNBOUNDED PRECEDING) / SUM(SUM(CHECKMATE)) OVER () as PROP_MATCHES_WON_BY_EQ_OR_HIGHER, SUM(SUM(CHECKMATE)) OVER (ORDER BY PLAYER_RANKING RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) / SUM(SUM(CHECKMATE)) OVER () as PROP_MATCHES_WON_BY_EQ_OR_HIGHER_2, CAST(COUNT(PLAYER_NAME) OVER (ORDER BY PLAYER_RANKING DESC RANGE UNBOUNDED PRECEDING) AS FLOAT) / COUNT(PLAYER_NAME) OVER ()::FLOAT as PROP_PLAYERS_EQ_OR_HIGHER FROM treehouse.chess_data_matches GROUP BY PLAYER_NAME, PLAYER_RANKING,PLAYER_TEAM ORDER BY SUM(CHECKMATE)
We can enhance the power of window functions further, 0:00 by specifying a frame clause. 0:03 The frame clause is appropriately named, because it determines the size of 0:05 the window, or the number of rows over which the window function operates. 0:08 So far ,we've been working with the default frame 0:12 which encompasses the total result set. 0:15 Frame clauses allow us to narrow our focus to a subset of rows. 0:18 For example, we could look from six rows before up to the current row. 0:22 This is useful for doing moving averages or running sum. 0:26 Say that you have a report of visitors to a website broken down by day. 0:29 It's likely to have a lot of variation depending on the day of the week. 0:34 Maybe people are more likely to visit the site on weekdays, but 0:37 visits are lower on weekends. 0:40 With a lot of variation, it's hard to see whether visits are trending up or down. 0:42 By adding in the average for the day, plus the preceding six days, 0:46 you can get a rolling weekly average that washes out day of the week differences. 0:50 Let's look at how to create a frame clause, 0:54 it comes directly after the order by clause. 0:56 Anytime that we have a frame clause, there must be an ordering clause, 0:59 to specify which rows are before, and which are after the current row. 1:02 The frame clause, tells the function which row to start the window on, and 1:06 where to end it. 1:10 The syntax of the frame clause is between, frames start, and frame end. 1:11 For the frame start and frame end, there are three options. 1:17 We can specify the current row, rows preceding and rows following. 1:21 The current row refers to the row of the query on which the result will appear. 1:26 Rows preceding come before the current row according to the ordering clause and 1:31 rows following come after. 1:35 If we specify unbounded preceding or 1:37 unbounded following, that means the beginning or end of the window. 1:39 Sometimes the order clause will return the same value for multiple rows. 1:43 For example, if we ordered by the number of wins, 1:47 multiple players will have the same number. 1:50 The keyword that we used to introduce the frame clause either rows or 1:53 range determines what happens then. 1:56 Let's look at the chess tournament report to see what the difference is. 1:59 Let's find out for each player what the maximum ranking is for 2:03 players who won the same number of matches or fewer. 2:06 We'll create a new field using MAX( PLAYER_RANKING). 2:08 For the order by clause, we'll use SUM(CHECKMATE) to get the number of wins. 2:17 Then we'll add the frame clause. 2:23 Let's see what happens when we introduce the frame using the ROWS keyword. 2:25 We want the maximum to be calculated for all of the rows before. 2:29 So we'll make the frame start UNBOUNDED PRECEDING. 2:33 We don't want to include any of the rows after, so we'll make the frame And 2:39 CURRENT ROW. 2:43 To compare this to what happens when we introduce the frame clause using the range 2:51 keyword, let's duplicate this entire function. 2:55 Then we'll replace the word ROWS, with the word RANGE. 3:01 Let's order the entire query by the SUM(CHECKMATE), so 3:13 that it's easier to see what these functions are doing. 3:16 Let's look at the results of our two window functions. 3:21 When we use rows, 3:24 sometimes the values change even when the rows have the same order value. 3:25 This is because row looks at the actual row number, but 3:29 range counts any row with the same order value as part of the current row. 3:32 Range is the better choice, if you're doing calculations that depend on an order 3:36 value that could be the same across multiple rows. 3:40 In a regional sales report with one row per region per day, you could use the SUM 3:43 window function partition by date to get the total company sales to date. 3:47 Range would be the best choice in this example because you want the total 3:51 to date to include all the sales from the current date 3:54 not just from the regions that are listed before the current row. 3:57 The rows keyword is sometimes the better choice, because it allows you more control 4:00 over the number of rows to include in the window. 4:04 Here, we've specified UNBOUNDED to get to the beginning of the window. 4:07 If you use the ROWS keyword, you can replace UNBOUNDED with an integer. 4:10 This is useful when you want rolling totals or 4:14 averages over a certain number of rows. 4:16 For example, in a daily sales report you can have a rolling seven days total 4:19 by using rows between six preceding and current row. 4:24 If you leave out the frame clause the default is range between 4:28 unbounded preceding and current row. 4:31 If there is also no ordering clause, then every row is ordered equally so 4:34 the entire window is included. 4:38 There is also a simplified syntax that we can use, if we want the frame to end at 4:40 the current row, in that case we can leave out the between keyword, and 4:44 just use rows or range followed by the frame start. 4:48 Now that we've seen how to create frame clauses, we can answer our final question. 4:52 Do higher ranked players win more often the lower ranked ones? 4:56 To do this, we'll find the percent of all matches that are won by players with 5:00 a ranking equal to or higher than the player in question. 5:04 Well compare that to the percentage of total players that are of equal or 5:07 higher ranking. 5:10 First, let's find the numerator for the win rate. 5:12 We know how to find the total number of wins SUM(SUM(CHECKMATE)). 5:15 We're interested in how all players compare, so 5:23 we'll leave out the partition clause. 5:25 We need an ordering clause to tell the function which rows are higher and 5:28 which are lower. 5:31 So we'll use ORDER BY PLAYER_RANKING DESC. 5:33 Now, we'll add the frame clause. 5:40 There aren't any players with the same ranking, but there could be. 5:42 It's good practice to use RANGE in a case like this. 5:45 Since the order is descending, 5:48 all the higher rankings will be in the fourth and current row. 5:49 So our frame start will be UNBOUNDED PRECEDEDING. 5:52 We can use the simplified syntax, since we want to end the frame on the current row. 5:55 Now we have the total matches won by players with equal or higher ranking. 6:16 We'll divide by the total number of matches. 6:20 We get that using SUM(SUM(CHECKMATE)) over 6:25 empty parenthesis for the entire window. 6:30 We'll compare this to the proportion of players of equal or higher ranking. 6:39 We'll make a new field for that. 6:46 We'll use COUNT(PLAYER_NAME) to get the total number of players, over, and 6:52 we're going to leave out the partitions since we want the entire field. 6:58 But we're going to ORDER BY the PLAYER_RANKING again. 7:05 Just like above, that will be descending over a RANGE UNBOUNDED PRECEDING. 7:12 And we'll divide it by the total number of players. 7:17 COUNT(PLAYER_NAME) OVER the entire window. 7:20 When we run the query we find that it's all 0s and 1s. 7:46 That's because we're dividing two integers. 7:49 In PostgreSQL, the precision of the answer defaults to an integer also. 7:51 We want it to return a number with decimals. 7:56 We'll need to change the type of the numerator, denominator or both to do that. 7:58 Let's start with the numerator. 8:02 We can change it using the CAST function. 8:07 Which requires the expression to be changed. 8:12 Which is the whole window function. 8:16 Then the word AS, and then the type. 8:20 We'll use FLOAT, which will give us a number with decimals. 8:24 It's important to know to use the whole window function including over and 8:28 the parenthesis containing the options as the expression for a type conversion. 8:32 If you try to put the CAST just around the first part, before the over keyword, 8:37 it won't work. 8:40 Now the proportion has decimals, so 8:45 it's showing us a lot more information than just zeroes or ones. 8:47 We can also use a shorthand for 8:50 type conversion, which is a double colon followed by the type. 8:51 Let's do that on the denominator It needs to go at the end of the window function. 8:55 Similar to CAST, 9:03 it won't work if the double colon notation is used before the OVER keyword. 9:04 Looking at the results of our window function again, we can see that 9:13 the winning percentage is consistently higher than the proportion of players, so 9:16 higher ranked players win more frequently than lower ranked ones. 9:20 Now you have everything you need to incorporate window functions into 9:24 your reports. 9:28 Feel free to experiment with the chess dataset, or with your own data 9:29 to explore what window functions can add to your analytical tool kit. 9:33 What other questions can you think of that you can answer using window functions? 9:37 Try passing different expressions in the different clauses and 9:41 see how it affects your result. 9:43 Above all, have fun getting to know the data. 9:45 We'll see you next time. 9:48
You need to sign up for Treehouse in order to download course files.Sign up