Difference between revisions of "Data"
Jump to navigation
Jump to search
Line 25: | Line 25: | ||
These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary! | These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary! | ||
+ | |||
+ | <pre> | ||
+ | desktop$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz | ||
+ | desktop$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy | ||
+ | |||
+ | real 0m0.192s | ||
+ | user 0m0.000s | ||
+ | sys 0m0.156s | ||
+ | </pre> | ||
+ | |||
+ | <pre> | ||
+ | desktop$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz | ||
+ | desktop$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy | ||
+ | |||
+ | real 0m25.961s | ||
+ | user 0m0.168s | ||
+ | sys 0m2.360s | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | Again, your mileage may vary. | ||
=Data over the Network= | =Data over the Network= |
Revision as of 10:23, 25 June 2013
Data: How to surf, rather than drown!
Introduction
Data on Disk
BCp2$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz BCp2$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy real 0m3.530s user 0m0.000s sys 0m0.068s
BCp2$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz BCp2$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy real 18m17.102s user 0m0.214s sys 0m6.359s
These timings were taken at ~10:45 on the 25 Jun 2013. Your mileage may vary!
desktop$ wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.10-rc7.tar.xz desktop$ time cp linux-3.10-rc7.tar.xz linux-3.10-rc7.tar.xz.copy real 0m0.192s user 0m0.000s sys 0m0.156s
desktop$ tar --use-compress-program=xz -xf linux-3.10-rc7.tar.xz desktop$ time cp -r linux-3.10-rc7 linux-3.10-rc7.copy real 0m25.961s user 0m0.168s sys 0m2.360s
Again, your mileage may vary.
Data over the Network
Filling the pipe.
Data when Writing your own Code
Memory Hierarchy
L1 Cache | Picking up a book off your desk (~3s) |
L2 Cache | Getting up and getting a book off a shelf (~15s) |
Main Memory | Walking down the corridor to another room (several minutes) |
Disk | Walking the coastline of Britain (about a year) |
Files & File Formats
Data Analytics
Some common operations you may want to perform on your data:
- Cleaning
- Filtering
- Calculating summary statics (means, medians, variances)
- Creating plots & graphics
- Tests of statistical significance
- Sorting and searching
Selecting the right tools.
Databases
GUI. Accessing from a program or script. Enterprise Grade The data haven.
Numerical Packages
Such as R, MATLAB & Python.
Bespoke Applications
Rolling Your Own
Principles: Sort & binary search.
Tools: Languages, libraries and packages.
When Data gets Big
Quotas.
Local Disks.
Swapping.
Data the Google way - Map-Reduce.
Hadoop & Friends.
Summary
- Use large files whenever possible.
- Disks are poor at servicing a large number of seek requests.
- Check that you're making best use of a computer's memory hierarchy, i.e.:
- Think about locality of reference.
- Go to main memory as infrequently as possible.
- Go to disk as infrequently as possible as possible.
- Check that your are still using the right tools if your data grows.