Reading CSV files with Clojure

The face-melting speeds of TMD

CSV

Clojure

Published

2024-01-09

As a Data Scientist, one of the first things I do when trying a new language is read in a CSV, so that I have some data to kick around. When trying this with Clojure’s built-in clojure.data.csv library, I noticed that it was really quite slow. To demonstrate this, I’ll use the nycflights13 dataset which is available from here.

And here’s the code I wrote to read the file:

(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

(defn csv->df [file-path]
  (with-open [reader (io/reader file-path)]
    (let [in-file (csv/read-csv reader)
          names (map keyword (first in-file))
          data (rest in-file)]
      (zipmap names (apply mapv vector data)))))

After loading the necessary libraries, I define a function for reading CSVs, csv->df. I define a reader, which I pass to read-csv to load the data. After that, I parse the first row as column names and the rest of the rows as data. The data is placed into vectors — one per column using (apply mapv vector data) — each of which is associated with a column name in a map using zipmap. And that’s it. Nothing too fancy or complicated. But the performance isn’t great.

(time (csv->df "data/flights.csv"))
;; "Elapsed time: 8267.756866 msecs"

So, around 8 seconds to read a file with 336,776 rows and 19 columns. Thankfully, the folks at TechAscent have developed a high-performance data library that reduces run time substantially. tech.ml.dataset’s ->dataset function can read CSVs into data structures that are essentially augmented maps of vectors — augmented because they also include column typing, for example — which is what I was aiming for with my initial code. Here’s the TMD way of reading a CSV:

(require '[tech.v3.dataset :as ds] 

(time (ds/->dataset "data/flights.csv")) 
;; "Elapsed time: 2289.541069 msecs"

Note, however, that it reduced the run time by around 75%, clocking in at around 2 seconds. Very cool!