Difference between revisions of "Data"

From SourceWiki
Jump to navigation Jump to search
Line 6: Line 6:
 
=Data on Disk=
 
=Data on Disk=
  
 +
<pre>
 +
BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
 +
BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy
  
 +
real 0m3.530s
 +
user 0m0.000s
 +
sys 0m0.068s
 +
</pre>
 +
 +
 +
 +
<pre>
 +
BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
 +
BCp2$ time cp
 +
</pre>
 +
 +
These timings were taken at ~10:45 on the 25 Jun 2013.  Your mileage may vary!
  
 
=Data over the Network=
 
=Data over the Network=

Revision as of 09:51, 25 June 2013

Data: How to surf, rather than drown!

Introduction

Data on Disk

BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz
BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy

real	0m3.530s
user	0m0.000s
sys	0m0.068s


BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz
BCp2$ time cp 

These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary!

Data over the Network

Filling the pipe.

Data when Writing your own Code

Memory Hierarchy

Files & File Formats

Data Analytics

Some common operations you may want to perform on your data:

  • Cleaning
  • Filtering
  • Calculating summary statics (means, medians, variances)
  • Creating plots & graphics
  • Tests of statistical significance
  • Sorting and searching

Selecting the right tools.

Databases

GUI. Accessing from a program or script. Enterprise Grade The data haven.

Numerical Packages

Such as R, MATLAB & Python.

Bespoke Applications

Rolling Your Own

Principles: Sort & binary search.

Tools: Languages, libraries and packages.


When Data gets Big

Quotas.

Local Disks.

Swapping.

Data the Google way - Map-Reduce.

Hadoop & Friends.

Summary

  • Use large files whenever possible.
  • Disks are poor at servicing a large number of seek requests.
  • Check that you're making best use of a computer's memory hierarchy, i.e.:
    • Think about locality of reference.
    • Go to main memory as infrequently as possible.
    • Go to disk as infrequently as possible as possible.
  • Check that your are still using the right tools if your data grows.