Data: How to surf, rather than drown!

Introduction

Data on Disk

A Salutary Tale of Copying Files

We'll start by considering data stored on a disk drive. One thing that you might not know is that file systems and disk drives perform best when they are dealing with larger files. But how large is large? Here's a simple example, which you can try yourself:

First of all, let's get ourselves a large-ish file. A compressed tarball of the source code for the Linux kernel will do. In this case it's about 70MB in size. We can time how long it takes to create a copy of it. Below are the results of doing this experiment on BlueCrystal phase 2:

BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m3.530s
user	0m0.000s
sys	0m0.068s

Now, that tar ball contains around 47,000 files, many of which are only a few hundred or thousand of bytes in size. These are files at the smaller end of the scale. Let's unpack the tarball and time how long it takes to copy these files, one-by-one:

BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
BCp2$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy

real	18m17.102s
user	0m0.214s
sys	0m6.359s

Yikes! that took over 350 times longer than copying the single, large file. (These timings were taken at ~10:45 on the 25 Jun 2013. Any differences from the above values will be due to differences in load on the filesystem, which is shared with all the other users of the cluster. More of that in a moment..)

Now, we can repeat these tests on a different system. I got the values below from my very standard desktop machine:

desktop$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
desktop$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m0.192s
user	0m0.000s
sys	0m0.156s

desktop$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
desktop$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy

real	0m25.961s
user	0m0.168s
sys	0m2.360s

That's a lot quicker! However, copying the small files still took over 130 times longer than copying the large file. (Again, your mileage may vary.)

But hang on, I thought BlueCrystal was meant to be "super"?! Well it is. It's just that it's filesystem is servicing much more than just your file copying request.

Modern SATA disks have read and write bandwidths close to 100MB/s--for example, I just tested my Linux desktop machine with a handy, built-in utility (System > Administration > Disk utility) and recorded a read performance of ~75MB/s. We can compare this to the filesystem on BlueCrystal phase 2, where we see a peak of about 500MB/s throughput on the mixed workload of full, running cluster.

Another test below highlights a key difference between a parallel filesystem and that on a single disk. If I start several processes writing to the disk in my desktop machine, I see a rapid drop-off in performance as the number of processes increases. In contrast the parallel filesystem is able to support many processes with modest degradation to file writing performance.

System	Clients x1	Clients x4	Clients x8
Desktop	~65MB/s	~10MB/s	~0.5MB/s
Parallel FS	~160MB/s	130MB/s	67MB/s

So what have we learned?

You'll get better file access performance if you use larger rather than smaller files.
If you are using BlueCrystal, you may want to consider using disks local to the compute nodes. These nodes are 300GB in size and are accessible via /local. If you do elect to use local disks:
- It is advisable to check the available space in /local as part of your job.
- Please do clean up /local after your job has finished
- when transferring data from node to node, please use the suffix .data.cluster when referring to a node, e.g. scp u04n037.data.cluster:/local/mydata $HOME.

Additionally (and not demonstrated above):

Make use of minimal seek operations, which implies
Make use of contiguous access patterns, rather than random ones.

Data over the Network

Filling the pipe.

We can think of networks as pipelines for data.
Unless we fill the pipeline we won't achieve it's throughput capacity.

Data when Writing your own Code

Locality of Reference: http://en.wikipedia.org/wiki/Locality_of_reference

Temporal locality: We expect to re-use of data already seen. Spatial locality: We expect to access data stored close to data that we've already seen.

Computer hardware is optimised to exploit these principles. We will get the most from our hardware if we design software accordingly.

Memory Hierarchy

L1 Cache	Picking up a book off your desk (~3s)
L2 Cache	Getting up and getting a book off a shelf (~15s)
Main Memory	Walking down the corridor to another room (several minutes)
Disk	Walking the coastline of Britain (about a year)

Files & File Formats

Data Analytics

Some common operations you may want to perform on your data:

Cleaning
Filtering
Calculating summary statics (means, medians, variances)
Creating plots & graphics
Tests of statistical significance
Sorting and searching

Selecting the right tools.

Databases

SQLite

A very popular relational database. Lightweight as it is unmanaged and so has very simple access controls. However does support SQL. The command line interface is widely available. For example, it is installed on BlueCrystal.

sqlite3 test.db

sqlite> CREATE TABLE planets(Id INT, Name TEXT, Diameter REAL, Mass REAL, Orbital_Period REAL);
sqlite> INSERT INTO planets VALUES(1,'Mercury',0.382,0.06,0.24);
sqlite> INSERT INTO planets VALUES(2,'Venus',0.949,0.82,0.72);
sqlite> INSERT INTO planets VALUES(3,'Earth',1.0,1.0,1.0);
sqlite> INSERT INTO planets VALUES(4,'Mars',0.532,0.11,1.52);
sqlite> INSERT INTO planets VALUES(5,'Jupiter',11.209,317.8,5.20);
sqlite> INSERT INTO planets VALUES(6,'Saturn',9.449,95.2,9.54);
sqlite> INSERT INTO planets VALUES(7,'Uranus',4.007,14.6,19.22);
sqlite> INSERT INTO planets VALUES(8,'Neptune',3.883,17.2,30.06);

sqlite> .mode column
sqlite> .headers on
sqlite> SELECT * FROM planets;

Id          Name        Diameter    Mass        Orbital_Period   
----------  ----------  ----------  ----------  --------------
1           Mercury     0.382       0.06        0.24      
2           Venus       0.949       0.82        0.72      
3           Earth       1.0         1.0         1.0       
4           Mars        0.532       0.11        1.52      
5           Jupiter     11.209      317.8       5.2       
6           Saturn      9.449       95.2        9.54      
7           Uranus      4.007       14.6        19.22     
8           Neptune     3.883       17.2        30.06

    
sqlite> SELECT * FROM planets WHERE Mass >= 1.0;

Id          Name        Diameter    Mass        Orbital_Period   
----------  ----------  ----------  ----------  --------------
3           Earth       1.0         1.0         1.0       
5           Jupiter     11.209      317.8       5.2       
6           Saturn      9.449       95.2        9.54      
7           Uranus      4.007       14.6        19.22     
8           Neptune     3.883       17.2        30.06

Let's create an additional table called moons:

sqlite> CREATE TABLE moons(Name TEXT, Num_Moons INT);
sqlite> INSERT INTO moons VALUES('Mercury',0);
sqlite> INSERT INTO moons VALUES('Venus',0);
sqlite> INSERT INTO moons VALUES('Earth',1);
sqlite> INSERT INTO moons VALUES('Mars',2);
sqlite> INSERT INTO moons VALUES('Jupiter',67);
sqlite> INSERT INTO moons VALUES('Saturn',62);
sqlite> INSERT INTO moons VALUES('Uranus',27);
sqlite> INSERT INTO moons VALUES('Neptune',13);
sqlite> INSERT INTO moons VALUES('Pluto',5);

Joins. Natural inner join.

sqlite> SELECT Name, Orbital_Period, Num_moons FROM planets NATURAL JOIN moons;

Name        Orbital_Period  Num_Moons 
----------  --------------  ----------
Mercury     0.24            0         
Venus       0.72            0         
Earth       1.0             1         
Mars        1.52            2         
Jupiter     5.2             67        
Saturn      9.54            62        
Uranus      19.22           27        
Neptune     30.06           13

Outer join. Need to swap the table order as a natural right outer join is not currently supported.

sqlite> SELECT Name, Orbital_Period, Num_moons FROM moons NATURAL LEFT OUTER JOIN planets;

Notice that Pluto does not have an orbital period as it is not listed in the planets table.

Name        Orbital_Period  Num_Moons 
----------  --------------  ----------
Mercury     0.24            0         
Venus       0.72            0         
Earth       1.0             1         
Mars        1.52            2         
Jupiter     5.2             67        
Saturn      9.54            62        
Uranus      19.22           27        
Neptune     30.06           13        
Pluto                       5

If you would like to export to, e.g., a CSV file (this is useful for subsequent import into, e.g., R):

sqlite> .mode csv
sqlite> .output planets.csv
sqlite> SELECT * FROM planets WHERE Mass >= 1.0;

Where the contents of the file planets.csv is:

Id,Name,Diameter,Mass,Orbital
3,Earth,1.0,1.0,1.0
5,Jupiter,11.209,317.8,5.2
6,Saturn,9.449,95.2,9.54
7,Uranus,4.007,14.6,19.22
8,Neptune,3.883,17.2,30.06

There is, of course a corresponding command to import data from a file into a table. For example, if I had information about stars in an appropriately formatted CSV files, I could load it into a table called stars using the commands:

sqlite> .separator ','
sqlite> .import stars.csv stars

To exit the SQLite command line interpreter type:

sqlite> .exit

It is perfectly possible--indeed, perhaps preferable--to access a database from inside a program. (SQLite was really designed with that in mind.) For example, you can learn more about accessing an SQLite database from inside a python script at:

https://source.ggy.bris.ac.uk/wiki/Python1#Relational_Databases Relational

More information about SQLite and other interfaces for access is at:

SQLite: e.g. http://zetcode.com/db/sqlite/

MySQL

http://dev.mysql.com/tech-resources/articles/mysql_intro.html

UoB Data Haven

GUI. Accessing from a program or script. Enterprise Grade The data haven.

GUI Tools

Numerical Packages

Such as R, MATLAB & Python.

Bespoke Applications

Rolling Your Own

Principles: Sort & binary search.

Tools: Languages, libraries and packages.

When Data gets Big

Quotas.

Local Disks.

Swapping.

Data the Google way - Map-Reduce.

Hadoop & Friends.

http://hadoop.apache.org/docs/r0.18.3/streaming.html

Summary

Use large files whenever possible.
Disks are poor at servicing a large number of seek requests.
Check that you're making best use of a computer's memory hierarchy, i.e.:
- Think about locality of reference.
- Go to main memory as infrequently as possible.
- Go to disk as infrequently as possible as possible.
Check that your are still using the right tools if your data grows.

Data

Contents