Data
Data: How to surf, rather than drown!
Introduction
Data on Disk
A Salutary Tale of Copying Files
We'll start by considering data stored on a disk drive. One thing that you might not know is that file systems and disk drives perform best when they are dealing with larger files. But how large is large? Here's a simple example, which you can try yourself:
First of all, let's get ourselves a large-ish file. A compressed tarball of the source code for the Linux kernel will do. In this case it's about 70MB in size. We can time how long it takes to create a copy of it. Below are the results of doing this experiment on BlueCrystal phase 2:
BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy real 0m3.530s user 0m0.000s sys 0m0.068s
Now, that tar ball contains around 47,000 files, many of which are only a few hundred or thousand of bytes in size. These are files at the smaller end of the scale. Let's unpack the tarball and time how long it takes to copy these files, one-by-one:
BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz BCp2$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy real 18m17.102s user 0m0.214s sys 0m6.359s
Yikes! that took over 350 times longer than copying the single, large file. (These timings were taken at ~10:45 on the 25 Jun 2013. Any differences from the above values will be due to differences in load on the filesystem, which is shared with all the other users of the cluster. More of that in a moment..)
Now, we can repeat these tests on a different system. I got the values below from my very standard desktop machine:
desktop$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz desktop$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy real 0m0.192s user 0m0.000s sys 0m0.156s desktop$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz desktop$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy real 0m25.961s user 0m0.168s sys 0m2.360s
That's a lot quicker! However, copying the small files still took over 130 times longer than copying the large file. (Again, your mileage may vary.)
But hang on, I thought BlueCrystal was meant to be "super"?! Well it is. It's just that it's filesystem is servicing much more than just your file copying request.
Modern SATA disks have read and write bandwidths close to 100MB/s--for example, I just tested my Linux desktop machine with a handy, built-in utility (System > Administration > Disk utility) and recorded a read performance of ~75MB/s. We can compare this to the filesystem on BlueCrystal phase 2, where we see a peak of about 500MB/s throughput on the mixed workload of full, running cluster.
Another test below highlights a key difference between a parallel filesystem and that on a single disk. If I start several processes writing to the disk in my desktop machine, I see a rapid drop-off in performance as the number of processes increases. In contrast the parallel filesystem is able to support many processes with modest degradation to file writing performance.
System | Clients x1 | Clients x4 | Clients x8 |
Desktop | ~65MB/s | ~10MB/s | ~0.5MB/s |
Parallel FS | ~160MB/s | 130MB/s | 67MB/s |
So what have we learned?
- You'll get closer to the theoretical peak performance of a storage device if you:
- Use larger files.
- Make use of contiguous access patterns, rather than random ones.
- Make use of minimal seek operations.
- If you are concerned about disk access performance, you may want to consider using local disk, which is serving only your processing. All the nodes on BCp2 have (300GB) local disks accessible via /local. Please do clean up after your process if you use them and you may well want to check the available free space on the local disk as part of your job.
Data over the Network
Filling the pipe.
Data when Writing your own Code
Memory Hierarchy
L1 Cache | Picking up a book off your desk (~3s) |
L2 Cache | Getting up and getting a book off a shelf (~15s) |
Main Memory | Walking down the corridor to another room (several minutes) |
Disk | Walking the coastline of Britain (about a year) |
Files & File Formats
Data Analytics
Some common operations you may want to perform on your data:
- Cleaning
- Filtering
- Calculating summary statics (means, medians, variances)
- Creating plots & graphics
- Tests of statistical significance
- Sorting and searching
Selecting the right tools.
Databases
GUI. Accessing from a program or script. Enterprise Grade The data haven.
Numerical Packages
Such as R, MATLAB & Python.
Bespoke Applications
Rolling Your Own
Principles: Sort & binary search.
Tools: Languages, libraries and packages.
When Data gets Big
Quotas.
Local Disks.
Swapping.
Data the Google way - Map-Reduce.
Hadoop & Friends.
Summary
- Use large files whenever possible.
- Disks are poor at servicing a large number of seek requests.
- Check that you're making best use of a computer's memory hierarchy, i.e.:
- Think about locality of reference.
- Go to main memory as infrequently as possible.
- Go to disk as infrequently as possible as possible.
- Check that your are still using the right tools if your data grows.