The Independent Life of Data

Something I frequently say, or write, is an assertion that data has a lifecycle and existence independent of the tools we use. I’ve never sat down to explore what that actually means.

Over the years I’ve come to the conclusion that there are technologists who get data, and those that don’t. It’s likely that this is an outcome of experiences as much as an outcome of training, but I do find that it’s more common for me to be dumbfounded or exasperated by conversations where something self-evident about data is… not self evident.

At the core of it – for me – is the ability to grok that data is separate from and different to code (with a slippery subtle habit of sometimes looking like code, and code itself can be considered data).

Where I want to start is the temporal nature of data. I’ve observed that a lot of computing, and hence a lot of code, is quite transactional: “how do I deal with this current request to formulate a response”, where every transaction is dealt with atomically, and with a peculiarly zen attitude toward time.

As an aside, in the the more pathological thinking about components in our computing ecosystem, there’s very little consideration of the behaviour of components as they start up, as they crash or are shut down, or how the behaviour may evolve over time. This always puzzles me. As witches, biologists, and chemists know, all the interesting things happen at the boundaries. Folks, the world has four dimensions that we can directly perceive, not three.

The first observation then is this: generally the data exists whether the code that is processing it is running or not. It may exist before the code starts, and probably exists after it stops.

Data doesn’t even have to exist in one place for its lifespan. Let’s use a simple, small set of data for our consideration: the collected works of William Shakespeare. Doing some quick online searches, and eschewing ChatGPT, I find that his plays had an average of 22.6 thousand words, and estimates of the average count of letters in an English words is around 5 – that gives us a reasonable guess that we could say a plain ASCII representation of a play is somewhere around 120Kb. Roughly 40 plays gives us about 4.8 Mb for all the plays, so let’s call it at 6Mb of ASCII text for all of Shakespeare. I could comfortably store that on my watch, and it’s smaller than a high quality image of Shakespeare might be. I absolutely could pop it on an SD card and carry it my wallet.

So. If I download a copy of all that ASCII text from Project Gutenberg, is the data that I have the same as the data that they have? If I move it from my laptop to an SD card, is it the same data? If I recode it from ASCII to UTF-8, is it the same data? How about if I wrap it in a PDF or PostScript? If I print it? For each of these, the answer is arguably “yes”, although I tentatively come down on the idea that a radical transcription of the data into a new encoding format might constitute creation of new data. It’s very context specific.

Restricting myself to talking just about that 6Mb of ASCII text, I hope you would I agree that it’s the same data no matter where I store it, or how I move it between storage mediums.

What I’d like to posit it is that the immutability of the data rests somewhere down at the lowest atomic level of representation: the ones and zeros that can then be interpreted according to some rule.

Oh, hello, we’ve just introduced something independent of the data as well.

For data to be useful, and to be distinguishable from random noise, it has to be usable and it looks like a law of the universe (hello Claude) that there has to be some information that accompanies the data.

Of course, there’s a universe of data that isn’t useful.

 head /dev/random | od

will give you a nice chunk of data that contains no useful information. (Yes, I know, it can be used for places where you actually do need random noise).

Let’s bring it down to a real world example. Say I sent you this:

IDProduct CodeSubID
52079532078201680828830
52079565282701200927350
2079534786536031130042

I hope that you will agree that contains considerably different information than this:

Credit Card NumberExpiry MonthCVC
52079532078201680828830
52079565282701200927350
2079534786536031130042

even though a CSV encoding is identical:

5207953207820168,0828,830
5207956528270120,0927,516
5207953478653603,1130,350

Those UTF8 bytes that have gone from my server to your browser to paint on your screen for your eyeballs are the same data everywhere, even though it’s gone through all sorts of encodings and re-encodings and transliterations between my server and your eyeballs.

Crucially, for that transport to have even be possible, there’s a huge slab of accompanying convention and consensus on how to map the ones and zeroes onto the screen, none of which are contained in the data.

So, we have what looks like another rule of the universe: the information needed to extract the information from the data is not part of the data. Oops. Looks like data does not inherently contain information.

This is something that long lived organisations like JPL or NASA with a lot of data struggle with. Having tapes full of magnetically encoded ones and zeroes is no use if you can’t map the ones and zeroes back to a representation that provides useful information.

I know that Information Theory and Shannon probably would argue that the information on the tapes is present whether they can be read and printed as photos, or not. This seems a little too abstract for the day to day of working with data. For now, I want to retain focus on the day to day working of coders and data professionals.

Where did I get to? How we store and move data doesn’t change the data. The data is in some crucial way independent of how we encode it. The meaning and usefulness of data is dependent on accompanying information which may itself be encoded as data.

And that’s the independence I’m talking about. The data is the same whether it’s on tape, disk, SD card, paper punch cards or engraved on basalt tablets in the Mojave desert. The data is the same when I copy it from my server to your laptop. As long as we have a shared understanding of the encoding and the semantics of the data, it makes no difference if I create the data using Go and you derive insights from it using Python in a notebook.

Data is independent of the code we use to process it, independent of how it is stored, and its lifespan is not implicitly connected to any computer system.

These opinions have been brought to you by Booster by Tangerine Dream, whose existence and continued output are somehow independent of the individual musicians who have contributed over the decades.

Leave a Reply

Your email address will not be published. Required fields are marked *