Statistics
R or Python?
One of the first major decisions to make when getting into data analysis is about which language to choose. Python and R are the widely used programming languages in the field of statistics and data science and both are great choices. Below you will find some condensed information about their application for statistics. However, you also might want to have a look at this infographic by Datacamp, as it provides some helpful general guidance about both languages and might be a good starting point to inform yourself.
Table of contents
R for statistics
R was initially used mostly in academia and is nowadays popular among social science scholars, statisticians, engineers, and scientists without strong computer programming skills. It is great for exploratory data analysis and all kinds of statistical tests and models can easily be implemented.
Usability
- visualization tools of r can be very eye-pleasing compare to other programming libraries
- Statistical models can be build using very few lines of code
Advantages
- Great for statistical models and tests as well as exploratory data analysis
- Very popular for visualization
- Good amounts of functionalities for data science
Disadvantages
- In general, less popular than Python and less of a multifunctional tool beyond data science
- Compared to Python, not as popular for sophisticated NPL applications
- For some functionalities, it can be considered as comparable slow and poorly written code
Popular libraries
- Data manipulation: dplyr, tidyr and data.table
- String manipulation: stringr
- Time series: zoo
- Machine learning: caret is also known as classification and regression training. this package contains tools for data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation.
- visualization: ggplot2
- NLP and text analysis: quanteda, Text2vec, tidytext
IDE
Writing code can be a messy task and to get some support on this, programmers rely on some shortcuts and helpful tools to get the work done. An IDE (integrated development environment) can combine several functions and tools like a code editor, syntax highlighting, autocomplete, and debugging to make your life easier. Especially when you are a beginner (and all this stuff just mentioned sounds kinda confusing for you)a must-have one. For R, RStudio is by far the most popular IDE and it’s great, actually!
Ecosystems
- Packages are a collection of R functions and compiled code, usually provided by the community. This work by others allows you to execute complex tasks with just a few lines of code, as you profit from the shared effort by your fellow R coders.
- Packages are available at CRAN and at the Github pages of the developers
- For more advanced users, R has a rich ecosystem of functionalities to enable a smooth stringing of your data analysis workflow
Communities
Python + pandas
Python is used by programmers who dealt with data analysis, statistical techniques, or by developers. Python can be used as a single tool that can be integrated with every part of your workflow. Python is really flexible for beginners to build anything that was never built before.
Usability
- Coding debugging easy due to simple syntax
- People who come from a computer science background may find python more naturally than R
- Indentation easily reflects by the code meaning
Advantages
- General-purpose programming language which can be used strongly in data analysis
- Popular among the users for its code readability
- Great for math and statistical computation
Disadvantages
- Doesn’t have as many libraries dedicated to data science as R
- Default visualization results are often not as eye-pleasing compared to R
Popular libraries
IDE
Similar to RStudio for R, also python has some IDEs which can support your workflow and are especially useful for beginners. Examples of more popular IDEs for python are jupyter notebooks, spyder, vs code
Ecosystems
- Python code is clear and easy to interpret
- Good with data science, machine learning which is also integrated with web frameworks
- Python package index PyPi and anaconda are repositories of python software with all libraries.
Where user can also contribute
Communities