Data Science
With the massive data amount that can be generated and collected nowadays – so called big data, it demands a highly advanced skills set to analyze the data and derive new solutions for a vast number of applications.
To become a data scientist means to be “better at statistics than any software engineer and better at software engineering than any statistician,” according to Josh Wills.
What is Big Data & Data Science?
- Big data: The next frontier for innovation, competition, and productivity
- McKinsey’s public report, published in May 2011 but its content is still relevant
- A thorough report on big data & data science techniques, technologies, applications, and market forecast
- Compare data science with other related terms
- 8 steps to become a data scientist – source: blog.datacamp.com
Common Techniques and Technologies
Below is a short (uncompleted) list of common skills required for undertaking data science tasks:
- Statistics, Probabilities, Statistical Inference, Statistical Modeling, and Data Visualization
- Mathematics, Discrete Math, Calculus, Linear Algebra, Numerical Analysis, and Algorithm
- Machine Learning
- The study of computer algorithms that improve automatically through experience, by Tom Mitchell
- Mitchell’s fundamental book on Machine Learning, published in 1997
- Andrew Ng’s online course, including short video lectures
- Data Mining
- The study of automatic sophisticated process that discover data patterns for segmentation and prediction
- Free tools like WEKA, RapidMiner, KNIME, NLTK, Orange, and Apache Mahout.
- Popular commercial tools like Microsoft Azure, IBM SPSS Modeler, Rattle, MATLAB, SAS Enterprise Miner…
- Could platform like AWS, Azure, Cloudera
- Hadoop, HBase, Hive, Pig, Spark…
- Framework for processing large data sets with a cluster of smaller machines
- Be familiar with Linux/Unix environment and shell scripts
- Online Hadoop tutorials using Hortonworks Sandbox
- Brief tutorials for Hadoop, HDFS, and MapReduce for experienced programmers
- Introduction to MapReduce
- Programming languages like R, Python, Java (incl. Data Structures)…
- Structure Query Language (in general); relational vs. non-relational database
- SQL tutorials with web practices
- Other statistical softwares like SAS, SPSS, STATA, Tableau, and QlikView…
- And soft skills like Domain Knowledge, Communication, Cooperation, Management, Creativity, Curiosity, and Ethics
Other Resources for Data Science
- Recommended books available ONLINE through Mason
- An Introduction to Data Science by Jeffrey Stanton, Syracuse University
- Free online courses
- Some academic programs:
- List of Master’s programs in Data Science in the U.S.
- GMU’s M.S. in Data Analytics Engineering
- Also consider graduate courses in CSI, Statistics, Mathematics, and CS departments
Community Support
- The industry’s online resource for big data practitioners
- Data Miners’ Blogs
- IBM’s Big Data and Analytics Hub
- IBM’s Big Data University
- Kaggle – home for prediction competitions and tutorial scripts (mostly in Python or R)
- Sample Projects from “Data Science for Social Good”