Joy Payton
Joy Payton

JSON, XML, CSV … when you’re downloading a data set from an API or a data repository, or have to decide on a “flat file” (not database solution) to store data, what format should you choose? How should you store your data for use by another application or team? This very light intro to data storage might help you understand what’s going on.

CSV

Most likely, you’re accustomed to using .CSV files, which you may open using Microsoft Excel (sometimes resulting in odd results, as Excel tries to guess what your data is and adds formatting without your consent). CSV stands for “comma separated values” and consists of a series of lines of data, each of which has the same number of fields (representing columns), separated by commas. If you open a CSV in a text editor like SublimeText or XCode, you can see this raw formatting, which looks something like this:

"YEAR","WORLDCONSUMPTION","COPPERPRICE","INCOMEINDEX","ALUMPRICE","INVENTORYINDEX","TIME"
1951,3173,26.56,0.7,19.76,0.98,1
1952,3281.1,27.31,0.71,20.78,1.04,2
1953,3135.7,32.95,0.72,22.55,1.05,3
1954,3359.1,33.9,0.7,23.06,0.97,4
1955,3755.1,42.7,0.74,24.93,1.02,5
1956,3875.9,46.11,0.74,26.5,1.04,6

There’s a lot of good things to say about CSV files:

  • They’re easy to get a quick view of using Excel
  • They are easy to create in Excel
  • They don’t suffer from the formatting issues that crop up in Excel – they are text only without fancy formatting
  • They are frequently used to store and share data
  • R, Python, SQL etc. have easy-to-use functions to load this kind of data
  • They are industry standard in many ways

However, a CSV format may not be right for your data. If your data fits pretty well into a row-and-column format without too many duplicates or empty cells, .CSV is probably a great choice. If, however, you have observations (rows) that differ widely in what they record, CSV may not be a good fit. Another way to say this is that if your data is 1:1, CSV is good, and if not, CSV may not be the best choice, because it’s so strictly tabular.

Consider two hypothetical datasets. One has employee information, namely employee first and last name, id number, salary, and hire date. Each employee will have one and only one value for each of these fields, except for some rare edge cases, like quitting and being re-hired. There will be very little to no duplication or empty cells. This is a good fit for CSV data:

"ID","FNAME","LNAME","HIRE","SALARY"
234967,"Jane","Doe","2000-04-03",82000
381613,"Jose Maria","Fernandez","1998-11-10",103000
367181,"Mark","Anderson","2010-03-30",45700

But what if we wanted to track the social media and contact information for the same employees? Some employees will have one phone number, others may have three. Do we create three columns of possible data, leaving the third column blank for most everyone? What if some people started having four or five phone numbers, like home, work, personal cell, work cell, skype, etc? Do we restructure the entire dataset? Most of our employees have one or two social media accounts, but we’ll have some Instagram, some SnapChat, some Facebook, some Twitter, some LinkedIn, etc. Again, if we go the CSV route, that means creating a column for each possibility, and leaving lots of blank cells. What if some folks have two Twitter accounts, one they monitor for work and one for personal use? And what if you wanted to nest information, such as separating contact info, which could include phone numbers and fax, and social media, which could include Twitter, Facebook, etc.?

In a case like this, CSV may not be the best bet. CSV is heavily schema-driven, meaning you need to know exactly what kinds of data you’ll have before you start constructing the file. When the kind of available data (fields, columns, values, etc.) can vary significantly from one observation to the next, a schema-heavy data format is not the best fit. That’s where XML and JSON come in.

XML

We’ll treat XML lightly, as it’s largely been replaced by JSON.

XML stands for eXtensible Markup Language, and allows you to wrap your data in beginning and ending tags that demonstrate the kind of value you’re storing. It has some important rules associated, like making sure you close tags and are careful with nesting things in ways that make sense. You also have to state what version of XML you’re using. A sample XML datasest for our three employees’ contact and social media info could be as follows. Indentation is optional, but helps with readability quite a bit! Take a look at the repeated, nested data. If someone doesn’t have Twitter, we don’t have to list that for them. If someone has two phone numbers, we can include both, no problem.

<?xml version="1.0" encoding="UTF-8"?>
<staff>
<employee id="234967">
  <contact>
    <phone>215-555-1212</phone>
    <phone>215-123-1234</phone>
  </contact>
  <social-media>
    <facebook>jane.doe.51</facebook>
  </social-media>
</employee>
<employee id="381613">
  <contact>
    <phone>215-321-9999</phone>
    <fax>267-231-3413</fax>
  </contact>
  <social-media>
    <twitter>@jmfern</twitter>
    <twitter>@wdfans4life</twitter>
  </social-media>
</employee>
<employee id="367181">
  <contact>
    <phone>267-555-1212</phone>
  </contact>
</employee>
</staff>

XML is a bit heavy, though – you have full human-readable opening and closing tags, which add bulk. A better nestable, not-schema-heavy option is JSON.

JSON

JSON stands for “JavaScript Object Notation”, and if you think this means it plays well with javascript, you’re right! JSON is widely used to pass data back and forth in web applications, and has the same strengths as XML (i.e. it’s good for data that can have zero, one, or many values for any given field, can be changed on the fly without restructuring every piece of data to fit a schema, etc.). It’s lighter weight, though, and while XML is reminscent of HTML markup, JSON is reminiscent of array and object creation in programming languages like javascript. Let’s see the same data from our XML file in a JSON format:

{
  "staff": {
    "employee": [
      {
        "id": "234967",
        "contact": {
          "phone": [
            "215-555-1212",
            "215-123-1234"
          ]
        },
        "social-media": { "facebook": "jane.doe.51" }
      },
      {
        "id": "381613",
        "contact": {
          "phone": "215-321-9999",
          "fax": "267-231-3413"
        },
        "social-media": {
          "twitter": [
            "@jmfern",
            "@wdfans4life"
          ]
        }
      },
      {
        "id": "367181",
        "contact": { "phone": "267-555-1212" }
      }
    ]
  }
}

Again, indentation is optional, but helps with readability.

There are many, many sites to help you familiarize yourself with XML, JSON, and CSV formats and even convert between them.