My File is Over There: File Paths for Data Scientists
Let’s say that you have some .csv files locally on your computer, and you want to load them into R or Python. You’re working in RStudio or a Jupyter notebook, and you’re not sure how to point to the file you want to bring in. This can be considerably painful if you are new to the concept of file paths. If you’re new to writing code, or you’ve encountered problems with this, read on!
Both R and Python use the concept of a “working directory”. A “directory” here is a folder on your computer. The “working directory” is where your software is “working”. Imagine, if you will, a workshop, where R or Python is chugging along hammering out the changes you indicate to your data – cleaning, aggregating, computing summary statistics, graphing. It’s doing that in a location – but where? You can figure out the working directory in R by typing
getwd() and in Python by importing the
os package and issuing the
os.getcwd() command. Knowing where the working directory (keep thinking of a physical workshop) is will help you understand what you need to indicate as far as directions to a file.
In computing, we can use what are called “relative” paths – paths that start from where “you” are, which can change. For example, the path to my toothbrush at home is “go to the bathroom, open the medicine cabinet, it’s on the right at the bottom.” This relative path works great when I’m in the living room or kitchen at home. Those directions don’t work for me, however, when I change locations, like when I’m at work. The work bathroom doesn’t have a medicine cabinet, and doesn’t have my toothbrush. If you gave me the that relative path while I was at work, I’d return an error: “no medicine cabinet found.”
Similarly, let’s say that within your working directory, there’s a directory called “Downloads”, and within that, a file called “phone_nums.csv”. A relative path from your working directory to that file would be “Downloads/phone_nums.csv”. But that relative path wouldn’t be helpful if you were in a totally different directory up a couple of levels, because there, there’s no “Downloads” directory to go to, much less a file inside called “phone_nums.csv”! What seems obvious to you, which is “you know, the main Downloads directory I always use”, isn’t obvious to R or Python. You have to be very explicit about what to do. For relative paths, keep in mind that the start of your path is the working directory. What are the exact steps to get a data elf from the working directory workshop to the file you’re interested in? Do this in pseudocode, if it helps. It might look like this:
- Go back one step to the directory that contains the working directory where you are now.
- Then look for a directory called “Data”, and go into it.
- Once in “Data”, look for a directory called “Smith_Project”, and go into it.
- Once in “Smith_Project”, open the directory called “data_pull_08_25_2018” and go into it.
- Finally, the file you want is called “subject_vitals.csv”.
The relative file path for this would look like this (note the use of the two dots to say “go back a step”):
On the other hand, if you are running R or Python within the same directory as where your subject vitals file is located, there’s no need to move to any other place – the file is within direct reach and can be referenced with just
Alternatively, we can also use “absolute” paths, which are complete and the same no matter what the reference point is. The absolute path to my toothbrush could include my street address, the floor information about where the bathroom is, and so forth. I could use that info to find my way to my toothbrush no matter where I was. It’s universally applicable in a way that “go to the bathroom, open the medicine cabinet…” is not.
Same goes for computing. My computer can to find its way to every file and folder by traversing the file system from the root all the way into all of the branches that make up various directories and subdirectories. For example, maybe the absolute path (which, since I’m in a Mac, will start with a forward slash to say “start at the root”) might be something like
/Volumes/paytonk/Data/Smith_Project/data_pull_08_25_2018/subject_vitals.csv. On your PC, it might be
C:/Users/doej/Data/Smith_Project/data_pull_08_25_2018/subject_vitals.csv. Here, the
C:/ indicates starting from the root of the C drive. Note also that backslashes in programming languages often indicate the presence of some special data called an escape sequence, so normally, in the PC world, we use forward slashes, even though Windows itself reports these paths with backslashes.
The important thing for you to know is that when you tell R or Python about a file, you need to be sure that R can follow the path you give it. If you use a relative path, you need to know what the starting point is (R’s working directory) so you can give the correct path. If you use a full path, that’s probably better – but it can be more typing!