Reading CSV files with Clojure
The face-melting speeds of TMD
As a Data Scientist, one of the first things I do when trying a new language is read in a CSV, so that I have some data to kick around. When trying this with Clojure’s built-in clojure.data.csv
library, I noticed that it was really quite slow. To demonstrate this, I’ll use the nycflights13
dataset which is available from here.
And here’s the code I wrote to read the file:
require '[clojure.data.csv :as csv]
(:as io])
'[clojure.java.io
defn csv->df [file-path]
(with-open [reader (io/reader file-path)]
(let [in-file (csv/read-csv reader)
(map keyword (first in-file))
names (rest in-file)]
data (zipmap names (apply mapv vector data))))) (
After loading the necessary libraries, I define a function for reading CSVs, csv->df
. I define a reader, which I pass to read-csv
to load the data. After that, I parse the first row as column names and the rest of the rows as data. The data is placed into vectors — one per column using (apply mapv vector data)
— each of which is associated with a column name in a map using zipmap
. And that’s it. Nothing too fancy or complicated. But the performance isn’t great.
time (csv->df "data/flights.csv"))
(;; "Elapsed time: 8267.756866 msecs"
So, around 8 seconds to read a file with 336,776 rows and 19 columns. Thankfully, the folks at TechAscent
have developed a high-performance data library that reduces run time substantially. tech.ml.dataset
’s ->dataset
function can read CSVs into data structures that are essentially augmented maps of vectors — augmented because they also include column typing, for example — which is what I was aiming for with my initial code. Here’s the TMD way of reading a CSV:
require '[tech.v3.dataset :as ds]
(
time (ds/->dataset "data/flights.csv"))
(;; "Elapsed time: 2289.541069 msecs"
Note, however, that it reduced the run time by around 75%, clocking in at around 2 seconds. Very cool!