Ever do some analysis on your computer that bogs down your performance, takes forever, and maybe eventually crashes? Unsure of whether you can do the analysis you need on your own CHOP desktop or laptop, or if you need to perform your analyses on a server? How does one even get a server, or borrow server space? This article will help you understand what’s available for CHOP researchers.
A computer’s ability to perform what you need it to do can be limited by several factors or resource constraints:
Memory, or RAM, is the amount of information a computer can hold at once while actively working on it. Imagine this like your “working memory” – you may have a huge rolodex of phone numbers on your desk, but your brain can only hold a couple of number sequences at a time. That means that if you have to make a bunch of phone calls, you’ll have to keep flipping through your rolodex before each call – you don’t have sufficient memory to load all of those numbers into your working memory and make 100 phone calls in a row.
Disk space refers to the amount of information that can be stored long-term, not in active use – like the rolodex on your desk. Disk space is cheap these days, and you can easily add storage that’s not on your own computer but is on the network somewhere (like a file share you request in Cirrus). This is usually not the rate-limiting factor in performing an analysis. Still, if you are processing big data in a disk that’s not local to the computer doing the processing, you have to deal with transit time, as the data comes in from the remote disk. This may not pose much of a problem when you’re dealing with network attached storage (storage that’s local to CHOP), but it could be sluggish if you’re moving terabyte-sized data back and forth to the cloud (say, AWS, or Amazon Web Services).
Processor speed and number of cores governs how many things a computer can do simultaneously and how quickly. Let’s take the example of the phone calls and rolodex. Let’s say you have a really good working memory, and you know that when you call aunt Mabel, she’s just going to talk your ear off about the weather, so you can tune out. In theory, your memory would allow you to start another conversation at the same time, because you only need a little bit of your awareness to deal with Mabel. But you only have one phone, and one mouth, and you can’t really make two conversations happen at once. Generally speaking, processor speeds are not going to be a rate-limiter for you on modern computing devices (with the possible exception of specialized computation that takes advantage of specific processor capabilities). However, if you have more cores, you can simultaneously do more things at once. This is very important in parallel processing, where the different things you’re doing (calling multiple family members, analyzing multiple FASTQ files) can be done independently – the results from one activity won’t condition what you want done in the others.
Can I do this on my computer?
When you ask “can I do __ “ on my computer, you have to know a little about the program you’re using and how it works. This can be tricky to learn at first, but if you consistently do the same types of things you’ll figure it out. If you don’t know how to judge ahead of time whether code will run on your computer, you can always try it (after you save everything and close out of as many operations as you can). If your analysis slows down your computer’s performace, takes forever, or causes the system to lock up, you may well have a resource constraint.
In my case, let’s say I want to use R to analyze some very large datasets. These datasets aren’t complex, they’re .csvs of 30 columns, but I have millions of rows to analyze. I know that R brings an entire object into RAM, so if my files together are 4 gigs, and my laptop’s RAM is 4 gigs, I’m going to have a problem – I’m maxing out my memory. What are some solutions?
Restructure your analysis. Instead of bringing everything into R at once, maybe do something like bring a single .csv that’s hundreds of megs into R as a data frame or data table, process it, and then remove that object from R, which will release the RAM it was taking up. Do that for each .csv.
Change your processing environment. In this example, R is kind of the culprit, because its weak point is memory. If you’re in a hurry and don’t have access to a higher-memory server, maybe switch to Python, which (unless you’re using
numpy, which uses RAM to up efficiency) won’t hit RAM as hard as R does.
Diagnose your problem. Are you sure it’s memory that’s causing things to run super slow? Or are you writing inefficient code that’s maxing out your processor? Especially if you’re doing the same thing thousands or millions of time, your algorithm (how you do a task computationally) really matters. For example, in R, for loops should be avoided in favor of vectorized operations, where possible, because fewer instructions are executed and the code will be much faster. Learn how to check out your system resource monitor / task manager and see what’s maxing out.
If you’re pretty sure your code is structured logically and you’re not overlooking major inefficiencies, and you’re just stuck with analysis that is too big for your computer, what can you do?
High Performance Computing
High Performance Computing (HPC) is a diffuse term that refers to computing solutions that far surpass your typical desktop computer environment. An HPC solution will generally offer lots of RAM (128+ gigs), lots of cores (multiple cores per computer, with multiple computers connected together in a cluster), and lots of disk space. These are expensive to build and maintain, and you probably don’t need a dedicated solution (like your own set of pricey servers). CHOP has a shared HPC cluster called Respublica, and you can read more about it on the Respublica Wiki.
HPC pros and cons:
- The HPC (aka Respublica) requires you to become comfortable with the command line interface in Linux – there’s no point-and-click graphical interface.
- The HPC comes at no cost to your lab (unless you use cloud resources in your job, which you won’t do by accident).
- Respublica is local to CHOP, so network latency (slowness) shouldn’t be a problem (as opposed, say, to moving huge FASTA or FASTQ files to Amazon’s cloud, AWS).
- HPC jobs are scheduled – if you ask for tons of resources, it will take a long time to make a reservation (think about how hard it is to book a table for 8 versus a table for 2 at a popular restaurant).
- The HPC is backed by CHOP’s Research IS team, so you don’t have to worry about OS updates and other things you might have to think about if you had your own server(s) to maintain.
- CHOP’s HPC can’t automatically fix problems with your software logic. For example, if you could in theory parallelize your process (say, simultaneously analyze multiple fMRI files), but you don’t build in the logic to do this and instead just do one after the other, you aren’t utilizing the HPC the way you could.
Learning to use Respublica effectively takes practice! There are a few things to keep in mind:
- Jobs are created and submitted using the Univa Grid Engine
- Your requests affect what’s available for other researchers. If you’ve ever experienced the rage of needing a place to drink a cup of coffee and seeing a single person push two tables together to get a huge workspace they’re not even using, you’ll understand why this is important. Request only what you need!
- Knowing what you need can also be a challenge, but you can try submitting jobs with small memory requests and see what happens. If they maxed out on memory and died or got bogged down, you can see that and then try again with a larger request.
- DBHi’s Scientific Computing Team exists to help researchers understand how to maximize their use of resources like CHOP’s HPC, so don’t be afraid to ask for a consult!
Very Quick How-To
You’ll eventually be better off checking out resources like CHOP’s Respublica Wiki or another guide put out by the Max Delbrück Center for Molecular Medicine (MDC). Note that MDC uses a Unix, not Linux, cluster, but the grid engine works in the same way. Still, sometimes just getting started is the hardest thing!
Step 1: Sign up for the HPC
See the directions provided by Research IS.
Step 2: SSH to the HPC
If you’re on a Mac, just open the terminal and type
ssh respublica.research.chop.edu. You’ll need to type your (CHOP) password, this time and every time you log in.
If you’re on a PC, you’ll want to download a “terminal emulator” (I like PuTTY). Follow the directions for your software. You want to establish a connection using “ssh” as the protocol, “respublica.research.chop.edu” as the host, and your username and password.
The very first time you ssh into respublica, you’ll get a message asking if you really want to connect. Type
yes. Note that you won’t see asterisks appear or anything like that when you type in your password!
Step 3: Look around / Linux warmup
Now you’re “in the HPC”. The login node (the computer you land in when you ssh in) is not intended for computation – it will take your commands and ship jobs over to other computers for execution. You can execute commands here, but you won’t want to run analysis jobs here in the same way you might if you are used to just running a bash script or python script in your local computer. If you’re new to linux, I encourage you to check out the many “Linux Cheat Sheets” that are out there (just Google it!) and choose one you like. Print it and keep it near your computer until common commands are second nature.
If you’re very used to Linux you might want to skip this part.
Where am I? The
pwd command, or “present working directory”, tells you where you are. When you ssh into the login node, you should be in your home directory. At the prompt, type “pwd” and hit enter. You should get an output like
What files are here?
ls allows you to LiSt files in your current location. Type ‘ls’ and hit enter. You probably don’t have any files show up!
How can I make a new file? You can use several different kinds of text editors. I like
nano, but other folks swear by
emacs. For now, type
nano myfile.txt and type a few lines. Then use Ctrl + X to exit. You’ll be prompted to save the file, which you want (so, type “Y”). You can save it using the file name suggested (the one you typed in the command itself), just by hitting enter. Try
ls again to see your file!
How can I delete a file?
rm acts to ReMove a file. Type
rm myfile.txt, or just start typing
rm and the first few characters of the file name and hit tab. Autocompletion when there’s no ambiguity is a nice feature!
Step 4: Submit a job!
Let’s take a look at an example job. This is a “bash script”, which is generally how you’ll want to submit jobs. Type the following:
You should see a file that you can scroll through using your up and down arrows. You’ll see a lot of lines that start with the hash mark (#), which are comments. Then you’ll see some “if” statements (that end in “fi”) to pull in any parameters (when you run the script with add-on parameters). The script then issues the “sleep” command. It also prints a couple of commands. Pretty simple, very low demand.
Exit out of the nano session with CTRL + X (don’t save any accidental changes).
The most basic way to submit the job (remember that it will sleep for 60 seconds, so it’ll take just over 60 seconds to fully execute) is to type:
This means that qsub will use its default parameters for resources. Your job isn’t asking for much, so it should get scheduled immediately and begin. You can see where your job is by typing
qstat a try!
Once the job finishes (
qstat gives an empty answer), list the files in your home directory again, using
ls. You have two files, each of which includes the name and number of your job. Take a look at them using
nano. The one that begins Sleeper.e should be empty (no errors), and the one that begins Sleeper.o should have any output from the job.
Leave your ssh session by typing