Big Data is usually defined in terms of the 3Vs:

  • volume,
  • velocity,
  • and variety.

Doug Laney of Gartner originally defined the 3Vs 12 years ago in this paper.

Internet-scale data set.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely, 2013

4 - Sources

4.1 - Monitoring

Much data source of Big data occurs with online recording:

  • every click on a website,
  • every ad viewed,
  • every billing event,
  • every fast-forward or pause while you're watching a video,
  • every request that's made from a client to a server,
  • every transaction,
  • every network message,
  • and every fault.

Anything that occurs potentially could be recorded.

A lot of it is recorded, but very little of it gets analyzed, and that's why we get to know the picture of an iceberg because a phenomenal amount of data is collected but only a tiny amount of that data is analyzed.

4.2 - User-generated content

  • post on Facebook
  • picture on Instagram
  • review on Yelp or TripAdvisor
  • tweet on Twitter
  • video on YouTube.

4.3 - Health and scientific computing

  • the Large Hadron Collider. It generates more data in a year than all the other data sources combined.
  • genome sequencing data. The cost of performing sequencing, is dropping exponentially, much faster than Moore's Law, so as result we're collecting more sequencing data than ever before.


4.4 - Graphs

Graphs include things like:

  • social networks,
  • telecommunication networks,
  • computer networks,
  • road networks,
  • and collaborations or relationships.

Some of these graphs can be absolutely enormous (Facebook's user graph)

4.5 - Log files

4.6 - Internet of things

  • sensor
  • RFID tag (California FasTrak Electronic Toll Collection transponder to pay our tolls on the highways but also used to collect data that's used for traffic reporting)

5 - Documentation / Reference

