Data Science - Big Data

1 - About

Big Data is usually defined in terms of the 3Vs:

  • volume,
  • velocity,
  • and variety.

Doug Laney of Gartner originally defined the 3Vs 12 years ago in this paper.

Internet-scale data set.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely, 2013

3 - Word Cloud

Apache Cassandra, Machine Learning, Hadoop, NoSQL, Apache Hive, Map/Reduce and HDFS, Data Visualization, ZooKeeper, NoSQL, Distributed Search and Real Time Analytics, Avro, Visualizing Your Graph, Analytics Maturity Model, R

4 - Sources

4.1 - Monitoring

Much data source of Big data occurs with online recording:

  • every click on a website,
  • every ad viewed,
  • every billing event,
  • every fast-forward or pause while you're watching a video,
  • every request that's made from a client to a server,
  • every transaction,
  • every network message,
  • and every fault.

Anything that occurs potentially could be recorded.

A lot of it is recorded, but very little of it gets analyzed, and that's why we get to know the picture of an iceberg because a phenomenal amount of data is collected but only a tiny amount of that data is analyzed.

4.2 - User-generated content

  • post on Facebook
  • picture on Instagram
  • review on Yelp or TripAdvisor
  • tweet on Twitter
  • video on YouTube.

4.3 - Health and scientific computing

  • the Large Hadron Collider. It generates more data in a year than all the other data sources combined.
  • genome sequencing data. The cost of performing sequencing, is dropping exponentially, much faster than Moore's Law, so as result we're collecting more sequencing data than ever before.


4.4 - Graphs

Graphs include things like:

  • social networks,
  • telecommunication networks,
  • computer networks,
  • road networks,
  • and collaborations or relationships.

Some of these graphs can be absolutely enormous (Facebook's user graph)

4.5 - Log files

4.6 - Internet of things

  • sensor
  • RFID tag (California FasTrak Electronic Toll Collection transponder to pay our tolls on the highways but also used to collect data that's used for traffic reporting)

5 - Documentation / Reference

  • Bookmark "Data Science - Big Data" at
  • Bookmark "Data Science - Big Data" at Digg
  • Bookmark "Data Science - Big Data" at Ask
  • Bookmark "Data Science - Big Data" at Google
  • Bookmark "Data Science - Big Data" at StumbleUpon
  • Bookmark "Data Science - Big Data" at Technorati
  • Bookmark "Data Science - Big Data" at Live Bookmarks
  • Bookmark "Data Science - Big Data" at Yahoo! Myweb
  • Bookmark "Data Science - Big Data" at Facebook
  • Bookmark "Data Science - Big Data" at Yahoo! Bookmarks
  • Bookmark "Data Science - Big Data" at Twitter
  • Bookmark "Data Science - Big Data" at myAOL
data_mining/big_data.txt · Last modified: 2017/02/12 21:02 by gerardnico