Building a CSV Streaming Library in Haskell

April 25, 2022

Hey Internet.

I've been trying to reduce my reliance on proprietary software, but I try to be pragmatic about it. Sometimes convenience wins out, as was the case when I decided to use the Samsung Health app to track my blood glucose levels. I've recently decided that I want to use something that doesn't hand all my health data over to a large corporation with dubious motives, but I have several years of historical data, and I don't want to lose it, even if the older data isn't really relevant any more.

I was able to find an option to export my data from the app into a series of CSV files, but as you might expect, with years worth of data, these files are a little on the large side. They also contain a lot of information that I'm not particularly interested in keeping. I decided that I wanted to make a utility to process this data, and marshal it into a format that was more useful to me. I decided I'd use Haskell for this task... because I like Haskell. Surely, I could find a library that handled CSV input in the official repositories, and yes, I could... but there was a catch.

All of the libraries I found for dealing with CSV seemed to involve loading the entire file into memory before you could start working with the data. Most of the time this is fine, but the files I'm working with are quite large. Rather than hunting for a utility that would allow me to work with the data row by row, I decided instead to build one. Something like this might have existed already, but I was looking for an opportunity to familiarize myself with the conduit library anyways. The end result was the csv-sip library. It was Katy who came up with the name. When I was explaining to her the problem I was trying to solve, I said that most libraries "slurp the whole file into memory", and she said: "so you want to sip it instead?" I decided that I liked the concept.

The conduit library allows me to read from the file one row at a time, perform some action, and then load the next row, discarding the previous. For those who don't know, the conduit library defines the ConduitT type, which is used for streaming data. Conduits can essentially be broken up into three general types:

Conduits can be piped together using the .| operator to produce a more complex conduit. The .| operator works similarly to the way a pipe works in a *NIX shell.

A naïve approach would be to simply read a line of text from the CSV file and pass it to an existing CSV library as a single row CSV file. That way, I wouldn't have to reinvent the wheel. This, in fact, was my original plan. The problem is that a field in a CSV file can contain a literal CR/LF character combination, thus a more nuanced approach is required. This meant that I had to essentially write my own CSV parser from scratch. I also subscribe to the design philosophy that a program should be lenient in terms of the types of input it will accept, and strict about the type of output it produces. In keeping with that philosophy, there were some trade-offs that I made in terms of reading CSV files.

I'm given to understand that according to the official CSV spec, a CSV file should use CR/LF as a row separator, however it's likely that somewhere along the line I'll encounter the more UNIX-ey LF-only format, or even potentially CR-only. For this reason, I decided to treat any contiguous combination of CR and LF characters as a row break. The consequence of this is that the library will omit blank rows. I don't expect this to be a thing that happens often though, and am considering a different approach for future releases. Files produced by the library will always use the proper CR/LF row terminators. Also, if a cell contains improperly encoded information, such as bad UTF8, or a quotation mark in the middle of an unquoted field, it will simply return a blank cell, rather than choking on the whole thing.

One other thing to note is that the library assumes that all files are encoded in UTF-8... because honestly, UTF-8 is really just the way to go in modern times. Nevertheless, every function that would yield a Text value also has a "Raw" equivalent that returns a ByteString instead. This way, if you want to use some other encoding scheme, you're free to do so.

Slurping Files

Okay, so lets have a look at the actual library I've written. It's all contained within one module, and the documentation for the public API is here, but let's look at the parts of it individually. First off, I did decide to create "slurp" functions that work with a CSV file as a whole. Sometimes that might actually be the thing you want to do. For reading the whole file at once, I have the slurpCSV and slurpRawCSV functions. Both will read the file at the supplied path and return a list of list of fields, [[Text]] and [[ByteString]] respectively. A row is made up of a list of fields, and a CSV file is made up of a list of rows, which is the reason for the nested lists.

For convenience, there are also the slurpLabelledCSV and slurpRawLabelledCSV functions, which assume that the first row of the file contains headers describing what the columns are. Instead of returning a list of fields, they return a mapping of labels to values for each row. If a row extends beyond the headers, the additional fields are simply truncated. If the row is too short, the extra keys will simply not be present in the map for that row. I'm considering the scenario where there are matching keys in the header to be "undefined behaviour", which is just a lazy way of saying "don't do that". The actual behaviour is whatever is done by Map's fromList function.

Next, we have functions for writing entire CSV files. For this we have writeCSV and writeRawCSV. Both take a file path and a list of list of rows (in the same format as produced by slurpCSV and SlurpRawCSV) and write them to a CSV file at the given path. Pretty straight-forward stuff.

Streaming Functions

Now we get to the actual streaming part of the library. This is where the "sipping" part comes in. First off, let's look at our producers. These are the functions that actually create a stream of values from some source. Up first we have readFromCSV and readFromCSVRaw. These will take a file path, and produce a stream of rows from that file. There are also two other producers: encodeCSV and encodeRawCSV. Much like writeCSV and writeRawCSV, these take an entire table as input, but instead of writing them to a file, they produce a stream of CSV-encoded ByteStrings which can then be sent over a network connection, for instance. We also have two consumers: writeToCSV and writeToCSVRaw, which will take a stream of rows and write them to a file at a specified path.

Lastly we have our transformers. Transformers are arguably where the real power of conduits comes in. A transformer can take a stream of one type and transform it into another. All of the functions listed above are basically convenience functions that were made by chaining the functions below together along with a handful of primitives provided by the conduit library. For encoding, we have encodeRows and encodeRawRows these will take a stream of rows and convert them into a stream of CSV-encoded ByteStrings, roughly one ByteString per field, though the chunking is kind of arbitrary. The functions primarily used for decoding are decodeRows and decodeRawRows, which essentially do the opposite of encodeRows and encodeRawRows, taking a stream of ByteStrings and decoding them into a stream of rows.

Apart from directly encoding and decoding, there are also some other small transformers. There's a transformer (labelFields) that takes a stream of rows and uses the first row as headers to produce a stream of key/value pairs. You may recognize this functionality from the slurpLabelledCSV function. Also, we have a transformer (toBytes) that converts a stream of ByteStrings into a stream of individual bytes. This is used internally by the decodeRows and decodeRawRows functions, but was deemed useful enough to expose.

I've also just realized that I have a decodeUTF8 function which I have erroneously listed as a transformer, but it's just a plain old wrapper around the text package's decodeUtf8' function that returns Nothing on failure. I just pushed an update to the development branch so that it gets corrected in the next version.

At any rate, that's the basic rundown of the new library I've released. Hopefully, now I can actually get back to the task of freeing my data, which is why I made the thing in the first place. Hopefully somebody else will find it useful as well.

Have a good one.