Why Data Science?
The Digital age is here. The world’s most valuable resource is no longer oil, but data (the Economist). Smartphones and the internet have made data abundant, ubiquitous and far more valuable.
One of the most important aspects of data science is that its results are applicable to almost any sector like healthcare, e-commerce, travel and education among others. A sound understanding of the implications of data science can help sectors identify and quantify their challenges and address them in an effective manner. For instance, retail giant Target identifying pregnant women based on products they purchased, or the recommendation engines being used by Amazon, Netflix and the likes. Data Science enables all that and more.
A lot of people are baffled by the number of tools available for data science and what tools to start with. The one I recommend is R. Let’s discuss why.
Why R?
R is a modern implementation of S Language which started as a research project at Bell Labs in 1975. R was developed for data analysis, statistical modeling, simulation and graphics. However, with the powerful features it offers, its uses go beyond the generic data analysis and statistics. It is therefore more of general purpose language than a domain specific language (DSL).
R is Not Just a Statistics Package, It’s a Language
R is not just a statistics package which allows you to use certain predefined functionalities, it is rather a language through which you can develop and execute functionalities according to your own need. Take a look at this interview of Joe Cheng – “…people say that one of the differences between say Python or Julia and R is that R is a DSL for stats, whereas these other things are general purpose languages. R is not a DSL. It’s a language for writing DSLs, which is something that’s altogether more powerful.”
R is Designed to Operate the Way that Problems are Thought About
One of the goals of R is that the language should imitate the way that people think about real-world problems. For instance, take the concept of vectorization in R. Suppose we want to change a field denoting time in minutes to one in seconds. The command to do this in R would be –
time.sec <- time.min*60
The vector time.min may contain hundreds and thousands of numbers. Humans tend to think of the situation — rather than as a collection of individual numbers. R makes the statement more human by hiding the series of multiplications as in any other logical programming language like C/C++, Java.
R is Both Flexible and Powerful
R is also known for its flexibility. Being open source, one can easily obtain the source code of a piece of functionality and tweak it according to one’s own needs in no time. Vectorization, as discussed above, gives R an edge over other languages too. One would have to write loops to execute the same thing in a low level language that R can do in a single line of code. Functions in R are treated as first class objects, which is another example of R’s flexibility. With R you can do an awful lot of things. From making a web application to generating mesmerizing harmonographs, you can implement a wide variety of concepts with R.R also allows incorporating C and C++ functionalities from within R to powerfully complement your R code. So you can mix tools – choosing the best one for a specific job.
R also allows incorporating C and C++ functionalities from within R to powerfully complement your R code. So you can mix tools – choosing the best one for a specific job. Rcpp package helps connect C++ to R. Several other packages also use low level compiled languages like C++ in their implementation. For instance ReadXL package surpasses it’s counterparts for importing Excel files in R, by leveraging capabilities from C++.
Let’s add a few more reasons to choose R over other similar languages.
R Package Ecosystem
A large part of R’s success is due to the ecosystem of open-sourced packages that add functionality other than what the core installation of R (base R) offers. There are about 10,000 packages on CRAN (as of 2017) and more on gitHub. Users can make their own packages and publish it on CRAN or more easily on gitHub.Amazing community
Amazing Community
R has an amazing community online, committed to improving data analytics. Various platforms offer online community support and discussion portals. Others sources include exploration and starter scripts on competition platforms where one can observe great deal of knowledge dissipation.
Functions as First-Class Objects
R treats functions as first class objects. Therefore, users can create functions, pass them as arguments, and have them returned as the result of other computations. For instance, functions like mean, sd or a user defined function can be treated as data itself and be used in other function arguments.
Data Structures in R
When programming in C or C++ and the likes, the data type of every object must be specified by the user. This allows the compiler to perform type-specific optimization. But, it is inefficient to type and produces fragile code. Essentially, there is a trade-off between processor run time and developer’s thinking time. R data types allow programmers to use the data in a natural form without having to put it into a particular predefined structure.
Graphics
Graphics can undoubtedly said to be central to data science. Humans perceive visual information far better than numbers. It is easy to produce publication-quality graphs in R. A very famous package for developing elegant graphics is ggplot by Hadley Wikham.
End Notes
Hope this article helps you in your journey with Data Science. Once you’re comfortable with R, you would be competent enough to proceed further confidently, adding more tools to your skillset – Python, D3, Tableau, and others. Choosing which tool to take up is dependent on a variety of factors, such as
Choosing which tool to take up is dependent on a variety of factors, such as background in programming or statistics, or what you want as your output, or simply availability. There cannot be one universal answer to this. However, with R, you can execute a wide variety of tasks and this set only keeps increasing as CRAN and other repositories keep adding packages. R is also the first choice of academicians to publish implemented techniques as R-packages. The flexibility and ease of doing things in R appeal to the mass, which reflects clearly in results of KDnuggets’ 18th annual poll of data science software usage and several other similar surveys.
Of course, mastering R or any other tool should not be the absolute goal. Keep in mind that languages and tools are just a means to execute. One should be thorough with techniques and methodologies and at the same time up to speed with the latest tools at disposal.