1. Install QueryScript
If you haven't already, make sure to install QueryScript. We highly recommend using the VSCode extension as you follow along.
2. Download the data
In this tutorial, we'll analyze some data from the National Basketball Association (NBA) courtesy of data from Sports Reference. Start by downloading the queryscript-tutorial repository locally.
The repository contains an analysis.qs
file that contains the final output from this tutorial. You're welcome to either create your own file or use this file as you work through each step.
3. Get familiar with the data
We'll be working with two datasets, data/teams.parquet
(NBA team metadata) and data/regular_season.parquet
(regular season games and outcomes). Create a file ending in .qs
(this tells VSCode to invoke the QueryScript extension) and add these two lines at the top:
let teams = load('data/teams.parquet');let regular_season = load('data/regular_season.parquet');
If you're using VSCode, you'll see a Preview
button (CodeLens) above each statement:

You can click this button to quickly preview the data in each of these variables.
Each of these statements defines a variable which maps to the underlying data. QueryScript will automatically infer the types of each variable based on the underlying parquet file.

You can use these variables in SQL queries, expressions, import statements, etc. Let's get a sense of how big each table is. Add the following two lines to your file:
SELECT COUNT(*) FROM teams;SELECT COUNT(*) FROM regular_season;
We've now run our first queries with QueryScript. Wow, the regular_season table is almost 82,000 rows!
You can run these queries directly by running
qs <your_file.qs>
which runs each unnamed expression and prints its results to stdout.
4. Initial analysis
Let's start analyzing this data by looking for the highest scoring teams of all time:
SELECT TEAM_ABBREVIATION, SUM(PTS) points FROM regular_seasonGROUP BY 1 ORDER BY 2 DESC LIMIT 10;
You should see DEN
as the top result, with 365,920 points. To get the full team name, we can perform a join:
SELECT full_name, SUM(pts) points FROM regular_season JOIN teamsON regular_season.team_abbreviation = teams.abbreviationGROUP BY 1 ORDER BY 2 DESC LIMIT 10;
5. Adding a let expression
Hmm, these two queries look very similar. Other than the join, the underlying logic is the same. As we analyze this data further, we'll probably want to access the full team's name more often. QueryScript allows us to capture this join in a reusable variable:
let regular_season_teams =SELECT full_name, regular_season.*FROMregular_season JOIN teamsON regular_season.team_abbreviation = teams.abbreviation;}
We can now query regular_season_teams
instead of regular_season
and use the full_name
field as we wish. This is called a let expression and is similar to a CTE (except that you can use it in multiple queries) or a view (without updating the database's catalog).
SELECT full_name, SUM(PTS) points FROM regular_season_teamsGROUP BY 1 ORDER BY 2 DESC;
Let expressions are lazily evaluated when they're used by an expression. QueryScript will automatically figure out how to either execute the expression directly or inline it into the surrounding query.
6. A more complex expression
Let's now look at some team-vs-team matchups over time. We'll start by defining a variable that joins together winners and losers:
let matchups = SELECTt1.FULL_NAME AS WINNER,t1.PTS AS WINNER_POINTS,T2.FULL_NAME AS LOSER,T2.PTS AS LOSER_POINTS,T1.SEASON,T1.GAME_IDFROMregular_season_teams t1JOIN regular_season_teams t2 ON t1.GAME_ID = t2.GAME_IDWHEREt1.WL = 'W' AND t1.FULL_NAME != t2.FULL_NAME;
This query tells us the biggest lobsided matchups of all time:
SELECTwinner, loser,COUNT(*) wins,AVG(winner_points-loser_points) avg_diffFROM matchupsGROUP BY 1, 2 DESC;
As you're following along the tutorial, you should use the Preview
CodeLens in VSCode to make sure each query returns what you expect, or simply invoke the expression (e.g. matchups;
) in a file if you're using the command line.
You should see that the Boston Celtics beat the Washington Wizards a whopping 119 times, by an average of 11.6 points.
7. Using functions
What if we want to add a few extra parameters to this analysis? For example, who were the biggest winners in the 90s? What about the last few seasons? In SQL, we would have to copy/paste several slightly different queries to explore this data. QueryScript allows us to capture this analysis in a function:
fn matchup_report(min_season bigint, max_season bigint, winning_team text) {SELECT winner, loser, COUNT(*) wins, AVG(winner_points-loser_points) avg_diffFROM matchupsWHERE(winning_team IS NULL OR winner = winning_team)AND (min_season IS NULL OR season >= min_season)AND (max_season IS NULL OR season <= max_season)GROUP BY 1, 2 ORDER BY 3 DESC}matchup_report(1980, 1989);matchup_report(1990, 1999);matchup_report(winning_team => 'Golden State Warriors');
By default, function parameters are optional and NULL
if unspecified.
We can even compute the latest season and look at relative windows:
let latest_season = (SELECT MAX(season) FROM regular_season_teams);matchup_report(latest_season-1, latest_season);matchup_report(latest_season-5, latest_season);matchup_report(latest_season-5, winning_team => 'Golden State Warriors');
Variables can be relations (e.g. matchups
), or scalars (e.g. latest_season
which is just a number). SQL allows you to express both kinds of expressions, and as a convention forces you to wrap expressions in parantheses to indicate that they're scalars. By parenthesizing (SELECT MAX(season) FROM regular_season_teams)
in the definition latest_season
, we're telling the compiler that we want the single scalar maximum season as a numeric value, just like you would a scalar subselect.
8. Next steps
Now you know the basics of QueryScript! Hopefully you feel empowered to write clean, robust analytics code with powerful abstractions like functions and variables. As you continue your journey, you should explore:
- Using remote databases
- Schemas and imports
- Type definitions
- Materialized views
- Saving schemas to the database