First Steps #4: Digging Into DataFrames

DataFrames.jl provides the most widely used tabular data structure in Julia. In this post we'll explore DataFrames using sample data from RDatasets.jl (and we'll plot stuff using StatsPlots).
A rather timely event: DataFrames.jl has reached version 1.0!
⚙️ Setup
First, install DataFrames and RDatasets via Pkg Mode (]
) in the REPL:
Now load both packages along with the diamonds dataset from R's ggplot2 package. The diamonds data contains price/size/quality information on 53,940 different diamonds.
🚀 DataFrames Quickstart
- Variables (columns) of a Dataframe can be referenced either by strings or symbols, e.g.
"I am a string"
and:I_am_a_symbol
.
Make a Copy of a Column
Extract a Column
- These commands retrieve the exact data held in the DataFrame. Warning! Making a change to the extracted data will change the values in DataFrame.
Selecting a Subset of Columns
Filtering a Subset of Rows
The syntax x -> do something with x
is an anonymous function (sometimes called lambda expression). The filter function will apply a function to each row and return back a DataFrame for the rows that returned true.
- We can also use indexing (with broadcasting) rather than
filter
:
- For functions that accept a function as its first argument, Julia's do-block syntax can help you clean up your code. Here we are using
&&
, the logical "and" operator, to create multiple filter conditions.
You can now do several essential DataFrame tasks:
- Get a single column
- Choose a subset of columns
- Choose a subset of rows
Next we'll use groupby
and combine
to apply functions across groups of data.
🤔 How does Price relate to Cut?
We are big on learning by example, so let's start by answering this relatively simple question. First things first: What do the Price and Cut variables look like?
- Price: The cost in US Dollars.
- Cut: The rating of cut quality. In order (best-to-worst): "Ideal", "Premium" "Very Good", "Good", and "Fair". Side note: The data is stored in a CategoricalArray, which uses less memory than storing each element as a separate
String
.
Using groupby
We can use the groupby function to group our data by the "Cut" variable.
Using our grouped DataFrame, we can then apply a function to a variable in each group using combine
. Let's get the average Price for each level of Cut:
Now we know what the distribution center is for each Cut, but what about the spread and shape?
📊 Using StatsPlots
The StatsPlots package adds functionality and plot recipes to Plots.jl. We'll use it to do the grouping for us so that we don't need groupby
. First, add StatsPlots:
Next, use the @df <dataframe> <plot command>
syntax to create a violin plot overlaid with a box plot for each level of Cut.
Things to note in the code/plot above:
- The
@df
macro will replaceSymbol
s with the associated DataFrame columns. - We must use
string.(:Cut)
because Plots/StatsPlots doesn't know how to work withCategoricalArray
s directly. - We use
boxplot!
(instead ofboxplot
) to add a new series to the existing plot. - We set
lab
(shorthand forlabel
) to""
to avoid adding an entry to the plot legend. If all legend entries are blank, the legend will not appear. - We use
alpha=.4
to set the opacity of the boxplot so that it doesn't cover up the violin in the layer beneath it.
From our plot, we can see the distributions are all similarly skewed with a long right tail. Some Cuts (Good, Premium, and Very Good) are bimodal (they have two "peaks"). However, we are ignoring some important factors (such as how Carat and Color affect the price!), so we shouldn't make any conclusions based solely on this plot.
🚀 That's It!
You now know how to do a little bit of data wrangling with DataFrames. What do you want to learn about next?
Enjoying Julia For Data Science? Please share us with a friend and follow us on Twitter at @JuliaForDataSci.