Data

Data: How to surf, rather than drown!

=Introduction=

=Data on Disk=

BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m3.530s user	0m0.000s sys	0m0.068s

BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz BCp2$ time cp

These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary!

=Data over the Network=

Filling the pipe.

=Data when Writing your own Code=

Files & File Formats
=Data Analytics=

Some common operations you may want to perform on your data:


 * Cleaning
 * Filtering
 * Calculating summary statics (means, medians, variances)
 * Creating plots & graphics
 * Tests of statistical significance
 * Sorting and searching

Selecting the right tools.

Databases
GUI. Accessing from a program or script. Enterprise Grade The data haven.

Numerical Packages
Such as R, MATLAB & Python.

Rolling Your Own
Principles: Sort & binary search.

Tools: Languages, libraries and packages.

=When Data gets Big=

Quotas.

Local Disks.

Swapping.

Data the Google way - Map-Reduce.

Hadoop & Friends.

=Summary=


 * Use large files whenever possible.
 * Disks are poor at servicing a large number of seek requests.
 * Check that you're making best use of a computer's memory hierarchy, i.e.:
 * Think about locality of reference.
 * Go to main memory as infrequently as possible.
 * Go to disk as infrequently as possible as possible.
 * Check that your are still using the right tools if your data grows.