Author

Josh Day

Published

June 3, 2021

Big CSVs

“Big data” can be defined as any dataset requiring you to change how you analyze it because of its size.


Small CSVs: CSV.File

For datasets comfortably fitting in memory, the CSV package offers CSV.File() for high-performance file reading:

  • Automatically determines column types
  • Creates objects loadable into DataFrames
  • Supports numerous formatting options
using CSV, DataFrames, Plots, Downloads

url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv"

file = CSV.File(Downloads.download(url))

df = DataFrame(file)

plot(df.date, df.cases, title="Cumulative COVID-19 Cases in the US",
    lab="", ylab="N", xlab="Date")
Precompiling packages...
   1936.3 msQuartoNotebookWorkerTablesExt (serial)
  1 dependency successfully precompiled in 2 seconds
Precompiling packages...
    955.3 msQuartoNotebookWorkerLaTeXStringsExt (serial)
  1 dependency successfully precompiled in 1 seconds

Large CSVs: CSV.Rows

For massive files, CSV.Rows creates a memory-efficient iterator loading one row at a time. All data appears as strings requiring manual type parsing.


CSV Combined with OnlineStats

The OnlineStats package provides fast single-pass algorithms using constant memory, enabling statistics calculation on infinitely-sized datasets.

using CSV, OnlineStats, Plots

rows = CSV.Rows("/path/to/large_file.csv", reusebuffer=true)

itr = (parse(Int, r.passenger_count) => parse(Float64, r.fare_amount)
    for r in rows)

o = GroupBy(Int, Hist(0:.5:100))

fit!(o, itr)

plot(plot(o[1]), plot(o[2]), plot(o[3]), plot(o[4]),
    layout=(4,1), link=:all, lab=[1 2 3 4]
)

The example above examines NYC taxi data to visualize fare distributions by passenger count — processing arbitrarily large files with constant memory usage.