Data

From SourceWiki
Jump to navigation Jump to search

Data: How to surf, rather than drown!

Introduction

Data on Disk

BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m3.530s
user	0m0.000s
sys	0m0.068s
BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
BCp2$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy

real	18m17.102s
user	0m0.214s
sys	0m6.359s

These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary!

desktop$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
desktop$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m0.192s
user	0m0.000s
sys	0m0.156s
desktop$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
desktop$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy

real	0m25.961s
user	0m0.168s
sys	0m2.360s

Again, your mileage may vary.

Data over the Network

Filling the pipe.

Data when Writing your own Code

Memory Hierarchy

L1 Cache Picking up a book off your desk (~3s)
L2 Cache Getting up and getting a book off a shelf (~15s)
Main Memory Walking down the corridor to another room (several minutes)
Disk Walking the coastline of Britain (about a year)

Files & File Formats

Data Analytics

Some common operations you may want to perform on your data:

  • Cleaning
  • Filtering
  • Calculating summary statics (means, medians, variances)
  • Creating plots & graphics
  • Tests of statistical significance
  • Sorting and searching

Selecting the right tools.

Databases

GUI. Accessing from a program or script. Enterprise Grade The data haven.

Numerical Packages

Such as R, MATLAB & Python.

Bespoke Applications

Rolling Your Own

Principles: Sort & binary search.

Tools: Languages, libraries and packages.


When Data gets Big

Quotas.

Local Disks.

Swapping.

Data the Google way - Map-Reduce.

Hadoop & Friends.

Summary

  • Use large files whenever possible.
  • Disks are poor at servicing a large number of seek requests.
  • Check that you're making best use of a computer's memory hierarchy, i.e.:
    • Think about locality of reference.
    • Go to main memory as infrequently as possible.
    • Go to disk as infrequently as possible as possible.
  • Check that your are still using the right tools if your data grows.