First Steps #4: Digging Into DataFrames
DataFrames.jl provides the most widely used tabular data structure in Julia. In this post we'll explore DataFrames using sample data from RDatasets.jl (and we'll plot stuff using StatsPlots).
A rather timely event: DataFrames.jl has reached version 1.0!
⚙️ Setup
First, install DataFrames and RDatasets via Pkg Mode (]
) in the REPL:
(@v1.6) pkg> add DataFrames RDatasets
Now load both packages along with the diamonds dataset from R's ggplot2 package. The diamonds data contains price/size/quality information on 53,940 different diamonds.
julia> using DataFrames, RDatasets
julia> df = dataset("ggplot2", "diamonds")
53940×10 DataFrame
Row │ Carat Cut Color Clarity Depth Tabl ⋯
│ Float64 Cat… Cat… Cat… Float64 Floa ⋯
───────┼────────────────────────────────────────────────────
1 │ 0.23 Ideal E SI2 61.5 5 ⋯
2 │ 0.21 Premium E SI1 59.8 6
3 │ 0.23 Good E VS1 56.9 6
4 │ 0.29 Premium I VS2 62.4 5
5 │ 0.31 Good J SI2 63.3 5 ⋯
6 │ 0.24 Very Good J VVS2 62.8 5
7 │ 0.24 Very Good I VVS1 62.3 5
8 │ 0.26 Very Good H SI1 61.9 5
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
🚀 DataFrames Quickstart
- Variables (columns) of a Dataframe can be referenced either by strings or symbols, e.g.
"I am a string"
and:I_am_a_symbol
.
Make a Copy of a Column
df[:, "Carat"]
df[:, :Carat]
Extract a Column
- These commands retrieve the exact data held in the DataFrame. Warning! Making a change to the extracted data will change the values in DataFrame.
df.Carat
df[!, "Carat"]
Selecting a Subset of Columns
select(df, "Carat")
select(df, ["Carat", "Cut"])
Filtering a Subset of Rows
The syntax x -> do something with x
is an anonymous function (sometimes called lambda expression). The filter function will apply a function to each row and return back a DataFrame for the rows that returned true.
filter(row -> row.Carat > 1, df)
- We can also use indexing (with broadcasting) rather than
filter
:
df[df.Carat .> 1, :]
- For functions that accept a function as its first argument, Julia's do-block syntax can help you clean up your code. Here we are using
&&
, the logical "and" operator, to create multiple filter conditions.
filter(x -> x.Carat > 1 && x.Cut == "Premium" && x.Color == "J" && 5000 <= x.Price <= 6000, df)
# Same as above, but with do-block
filter(df) do x
x.Carat > 1 &&
x.Cut == "Premium" &&
x.Color == "J" &&
5000 <= x.Price <= 6000
end
You can now do several essential DataFrame tasks:
- Get a single column
- Choose a subset of columns
- Choose a subset of rows
Next we'll use groupby
and combine
to apply functions across groups of data.
🤔 How does Price relate to Cut?
We are big on learning by example, so let's start by answering this relatively simple question. First things first: What do the Price and Cut variables look like?
julia> df.Price
53940-element Vector{Int32}:
326
326
⋮
2757
2757
julia> df.Cut
53940-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
"Ideal"
"Premium"
⋮
"Premium"
"Ideal"
- Price: The cost in US Dollars.
- Cut: The rating of cut quality. In order (best-to-worst): "Ideal", "Premium" "Very Good", "Good", and "Fair". Side note: The data is stored in a CategoricalArray, which uses less memory than storing each element as a separate
String
.
Using groupby
We can use the groupby function to group our data by the "Cut" variable.
gdf = groupby(df, :Cut)
Using our grouped DataFrame, we can then apply a function to a variable in each group using combine
. Let's get the average Price for each level of Cut:
julia> using Statistics # for `mean`
julia> combine(gdf, :Price => mean)
5×2 DataFrame
Row │ Cut Price_mean
│ Cat… Float64
─────┼───────────────────────
1 │ Fair 4358.76
2 │ Good 3928.86
3 │ Very Good 3981.76
4 │ Premium 4584.26
5 │ Ideal 3457.54
Now we know what the distribution center is for each Cut, but what about the spread and shape?
📊 Using StatsPlots
The StatsPlots package adds functionality and plot recipes to Plots.jl. We'll use it to do the grouping for us so that we don't need groupby
. First, add StatsPlots:
(@v1.6) pkg> add StatsPlots
Next, use the @df <dataframe> <plot command>
syntax to create a violin plot overlaid with a box plot for each level of Cut.
julia> @df df violin(string.(:Cut), :Price, lab="")
julia> @df df boxplot!(string.(:Cut), :Price, alpha=.4, lab="")
Things to note in the code/plot above:
- The
@df
macro will replaceSymbol
s with the associated DataFrame columns. - We must use
string.(:Cut)
because Plots/StatsPlots doesn't know how to work withCategoricalArray
s directly. - We use
boxplot!
(instead ofboxplot
) to add a new series to the existing plot. - We set
lab
(shorthand forlabel
) to""
to avoid adding an entry to the plot legend. If all legend entries are blank, the legend will not appear. - We use
alpha=.4
to set the opacity of the boxplot so that it doesn't cover up the violin in the layer beneath it.
From our plot, we can see the distributions are all similarly skewed with a long right tail. Some Cuts (Good, Premium, and Very Good) are bimodal (they have two "peaks"). However, we are ignoring some important factors (such as how Carat and Color affect the price!), so we shouldn't make any conclusions based solely on this plot.
🚀 That's It!
You now know how to do a little bit of data wrangling with DataFrames. What do you want to learn about next?
Enjoying Julia For Data Science? Please share us with a friend and follow us on Twitter at @JuliaForDataSci.