Introduction to Data Science on Coursera
Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels – scalable data management on and off the cloud, parallel algorithms, statistical modeling, and proficiency with a complex ecosystem of tools and platforms – span a variety of disciplines and are not easy to obtain through conventional curricula. Tour the basic techniques of data science, including both SQL and NoSQL solutions for massive data management (e.g., MapReduce and contemporaries), algorithms for data mining (e.g., clustering and association rule mining), and basic statistical modeling (e.g., linear and non-linear regression).
-
Introduction. Examples, data science articulated, history and context, technology landscape
Readings
- (example) Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-László Barabási, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
- (example) Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
- (example) Google Flu Trends
- Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, Larry Brilliant, Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014 (19 February 2009) (paywalled)
- David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani, The Parable of Google Flu Trends: Traps in Big Data Analysis, Science 14 March 2014: Vol. 343 no. 6176 pp. 1203-1205 (paywalled)
- Steve Lohr, Google Flu Trends: The Limits of Big Data, NYTimes, March 28, 2014
- (example) Eigenfactor, and publications
- (example) L'Aquila quake: Italy scientists guilty of manslaughter, BBC
- Discussion of data science and data scientists
- Drew Conway's Venn Diagram
- Mike Loukides, What is data science?, O'Reilly Radar, 2010
- Mike Driscoll, "The Seven Secrets of Successful Data Scientists"
- Origins of "Volume, Velocity, Variety"
- eScience: The Fourth Paradigm (Foreward and Introduction, pages xi - xxxi; Gray's Laws, pages 5-12)
- Chris Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” , Wired magazine, 2008
-
Databases and the relational algebra
Readings
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- E. F. Codd, 1981 Turing Award Lecture, " Relational Database: A Practical Foundation for Productivity", 1981 (Think about which arguments from this short piece are still relevant today.)
- [Advanced] Cohen et al.“MAD Skills: New Analysis Practices for Big Data”, 2009
- [Advanced] Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011
-
Parallel databases, parallel query processing, in-database analytics
Readings for step 3-4-5
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM, January 2010.
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, January 2010.
- Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
- Optional Technical Background: The Hadoop Distributed File System
Data cleaning, entity resolution, data integration, information extraction
(NOT COVERED IN LECTURES)Readings / Talks
Elmagarmid, et. al. Duplicate Record Detection: A Survey, Koudas, et. al. Record Linkage: Similarity Measures and Algorithms -
MapReduce, Hadoop, relationship to databases, algorithms, extensions, languages
Readings for step 3-4-5
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM, January 2010.
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, January 2010.
- Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
- Optional Technical Background: The Hadoop Distributed File System
Data cleaning, entity resolution, data integration, information extraction
(NOT COVERED IN LECTURES)Readings / Talks
Elmagarmid, et. al. Duplicate Record Detection: A Survey,Koudas, et. al. Record Linkage: Similarity Measures and Algorithms -
Key-value stores and NoSQL; tradeoffs of SQL and NoSQL
Readings for step 3-4-5
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM, January 2010.
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, January 2010.
- Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
- Optional Technical Background: The Hadoop Distributed File System
Data cleaning, entity resolution, data integration, information extraction
(NOT COVERED IN LECTURES)Readings / Talks
Elmagarmid, et. al. Duplicate Record Detection: A Survey,Koudas, et. al. Record Linkage: Similarity Measures and Algorithms -
Topics in statistical modeling: basic concepts, experiment design, pitfalls
Readings
- Chapter 3 of A Handbook of Statistical Analyses Using R
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
- John P. A. Ioannidis, Why Most Published Research Findings Are False, PLOS One, August 30, 2005
- Benford's Law (wikipedia)
-
Topics in machine learning
- Ssupervised learning (rules, trees, forests, nearest neighbor, regression),
- Optimization (gradient descent and variants),
- Unsupervised learning
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read section on C4.5)
- Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
- Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
Unsupervised learning: k-means, multi-dimensional scaling
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read section on k-means)
-
Visualization, data products, visual data analytics
Readings (well, watchings)
- Hans Rosling, The Joy of Stats
- Pat Hanaran, Tools for Data Enthusiasts
- Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010
-
Provenance, privacy, ethics, governance
Backlash: Ethics, privacy, unreliable methods, irreproducible results
(NOT COVERED IN LECTURES)- Howard Wen, "Big Ethics for Big Data", O'Reilly Media
- John Markoff, New York Times, Unreported Side Effects of Drugs Are Found Using Internet Search Data, March 13, 2013
- Mike Loukides, Data Skepticism, O'Reilly Media, April 2013
- Gary Marcus and Ernest Davis, Eight (No, Nine!) Problems With Big Data, New York Times, April 6, 2014
- Tim Harford, Big data: are we making a big mistake?, March 28, 2014
- K.N.C., The backlash against big data, The Economist, Apr 20th 2014 (very short)
- See also: Gartner Hype cycle
- George Johnson, New Truths That Only One Can See, New York Times, January 20, 2014
- John P. A. Ioannidis, Why Most Published Research Findings Are False, PLOS One, August 30, 2005
- Dan Mckinley, Whom the Gods Would Destroy, they First Give Real-Time Analytics
-
Guest Lectures
-
Graph Analytics
- structure
- traversals
- analytics
- PageRank
- community detection
- recursive queries
- semantic web
Readings
Sherif Sakr, Processing large-scale graph data: A guide to current technology, June 2013(more to come)
- 2381
- 10 July 2014, 11:15
Don't miss new posts!
Subscribe for the Goal and follow through to its completion