Analytics - Perceptive Analytics

Recreating John Snow’s Viz in Tableau

admin — Tue, 19 Jan 2021 11:53:38 +0000

London cholera outbreak viz Tableau workbook

The post Recreating John Snow’s Viz in Tableau first appeared on Perceptive Analytics.

Tableau Web Data Connector for FACTSET

admin — Thu, 11 Oct 2018 10:34:10 +0000

The project was to create Tableau Dashboards for visualizing financial information of different stocks/tickers that would be used by Financial Analysts. The data had to be obtained through an Application Programming Interface (API) of a third-party service provider called FactSet. The data had to be obtained on the fly as and when the dashboard user inputs a specific ticker.
What is a Web Data Connector?
A Web Data Connector (WDC) is a data connection option in Tableau that can be used to fetch information from web. A WDC is a web page with HTML, CSS & JavaScript. Whenever a user inputs a ticker, the WDC would take that ticker and make multiple AJAX calls to the third-party API asking for the required data. The third-party API’s servers would check authorization & authentication. If successful, they return the requested data in JSON / XML formats. The WDC parses the received JSON / XML information and transforms them into a tabular structure which is submitted to Tableau. Tableau then receives this information and populates the dashboard with updated data.
Project Experience & Learning

Part I: To build the dashboards in Tableau, we built a WDC that can fetch the required data from FactSet’s servers. While developing the WDC, we used JavaScript extensively to get the ticker input by user, frame the appropriate URL, make the AJAX calls, receive information, parse the response & transform it into the requires shape so that it can be given to Tableau. HTML and CSS were used to build the user interface of the WDC where the Tableau user can input the ticker data. We dealt with Cross Origin Resource Sharing (CORS) limitations to fetch the required data. To overcome CORS, we had used a proxy server to route our request.

Part II: We built multiple dashboards to display the fetched data in a way useful for the analyst to understand the trading history of the stock/ticker, look into its financials, look into the broker estimates for future & make buy/hold/sell decision.

How to Get in touch
You can reach out to us by emailing cs AT perceptive-analytics.com

Tableau Expert

The post Tableau Web Data Connector for FACTSET first appeared on Perceptive Analytics.

Basic Statistics in Tableau: Correlation

Saneesh V — Thu, 30 Aug 2018 09:00:21 +0000

Statistics in Tableau

Data in the right hands can be extremely powerful and can be a key element in decision making. American statistician, W. Edwards Deming quoted that, “In God we trust. Everyone else, bring data”. We can employ statistical measures to analyze data and make informed decisions. Tableau enables us to calculate multiple types of statistical measures like residuals, correlation, regression, covariance, trend lines and many more.

Today let’s discuss how people misunderstand causation and correlation using Tableau.

Correlation and Causation

Correlation is a statistical measure that describes the magnitude and direction of a relationship between two or more variables.

Causation shows that one event is a result of the occurrence of another event, which demonstrates a causalrelationship between the two events. This is also known as cause and effect.

Types of correlation:

1 → Positive correlation.
-1 → Negative Correlation.
0 → No correlation.

Why are correlation and causation important?

The objective of analysing data is to identify the extent by which a variable relates to another variable.

Examples of Correlation and Causation

Vending machines and obesity in schools: people gain weight due to junk food. One important source of junk food in schools is vending machines. So if we remove vending machines from schools obesity must reduce, right? But it isn’t true. Research shows that children who move from schools without vending machines to schools with vending machines don’t gain weight. Here we can find a correlation between children who were overweight and eating junk food from vending machines. In actuality, the “causal” point (which is the removal of vending machines from schools) has a negligible effect on obesity.
Ice cream sales and temperature: If we observe ice cream sales and temperature in the summer, we can determine that they are causally related; i.e. there is a strong correlation between them. As temperature increases, ice cream consumption also increases. Understanding correlation and causation allows people to understand data better.

Now let’s explore correlation using Tableau. We are going to use the orders table from the superstore dataset which comes default with Tableau.

Before going further let’s understand how to calculate the correlation coefficient ‘r’.

We can easily understand the above formula by breaking it into pieces.

In Tableau, we can represent the above formula as 1/SIZE() -1 where SIZE is function in Tableau.

We can use WINDOWSUM function for doing this summation in Tableau.

x_i is the sum of profit and x-bar is the mean of profit, which is window average of sum of profit, and s_xis standard deviation of profit. That means that we need to subtract mean from sum of profit and divide that by standard deviation.

(SUM([Profit])-WINDOW_AVG(SUM([Profit]))) / WINDOW_STDEV(SUM([Profit])))

This is similar to the formula above but we only need to swap profit with sales.

(SUM([Sales])-WINDOW_AVG(SUM([Sales]))) / WINDOW_STDEV(SUM([Sales])))

Now we have to join all these formulae to get the value of the correlation coefficient of r. Be careful while using parenthesis or you may face errors. Here is our final formula to calculate r.

1/(SIZE()-1) * WINDOW_SUM(( (SUM([Profit])-WINDOW_AVG(SUM([Profit]))) / WINDOW_STDEV(SUM([Profit]))) * (SUM([Sales])-WINDOW_AVG(SUM([Sales]))) / WINDOW_STDEV(SUM([Sales])))

Let’s implement this in Tableau to see how it works. Load superstore data into Tableau before getting started.

After loading the superstore excel file into Tableau, examine the data in the orders sheet. You can see that it contains store order details complete with sales and profits. We will use this data to find correlation between profit and sales.

Let’s get our hands dirty by making a visualization. Go to sheet1 to get started. I made a plot between profit and sales per category.

Now in order to find the correlation between profit and sales, we need to use our formula to make a calculated field which serves our purpose.

Now drag and drop our calculated field onto the colors card and make sure to compute using customer name as we are using it for detailing.

Here we can see the strength of the relationship between profits and sales of data per category; the darker the color, th he stronger the correlation.

Next we’ll add trend lines to determine the direction of forecasted sales.

These trend lines help demonstrate which type of correlation (positive, negative or zero correlation) there is in our data. You can explore some more and gain additional insights if you add different variables like region.

From this analysis we can understand how two or more variables are correlated with each other. We begin to understand how each region’s sales and profits are related.

Let’s see how a correlation matrix helps us represent the relationship between multiple variables.

A correlation matrix is used to understand the dependence between multiple variables at same time. Correlation matrices are very helpful in obtaining insights between the same variables or commodities. They are very useful in market basket analysis.

Let’s see how it works in Tableau. Download the “mtcars” dataset from this link. After downloading it, connect it to Tableau and explore the dataset.

The dataset has 35 variables where each row represents one model of car and each column represents an attribute of that car.

Variables present in dataset:

Mpg = Miles/gallon.

Cyl = Number of Cylinders.

Disp = Displacement (cubic inches)

Hp = Gross Horsepower

Drat = Rear axle ratio

Wt = Weight (lb/1000)

Qsec = ¼ mile time

Vs = V/Sec

Am = Transmission (0 = automatic, 1 = manual)

Gear = Number of forward gears

Carb =Number of Carburetors

Let’s use these variables to make our visualization. I made this amazing visualization showing correlation between models by referring to Bore Beran’s blog article, in which he explained how to make this visualization which helps us understand more about using Tableau to understand correlation.

Conclusion

We must keep in mind that if we want to measure the dependence between two variables, correlation is the best way to do it. A correlation value always lies between -1 and 1. The closer the value of the correlation coefficient is to 1, the stronger their relationship. We must remember that correlation is not causation and many people misunderstand this. There are many more relations and insights that can be unlocked from this dataset. Explore more by experimenting with this dataset using Tableau. Practice to be perfect.

The post Basic Statistics in Tableau: Correlation first appeared on Perceptive Analytics.

Tableau for Marketing: Become a Segmentation Sniper

Saneesh V — Wed, 29 Aug 2018 09:00:07 +0000

Did you know that Netflix has over 76,000 genres to categorize its movie and tv show database? I am sure this must be as shocking to you as this was to me when I read about it first. Genres, rather micro-genres, could be as granular as “Asian_English_Mother-Son-Love_1980.” This is the level of granularity to which Netflix has segmented its product offerings, which is movies and shows.

But do you think is it necessary to go to this level to segment the offerings?

I think the success of Netflix answers this question on its own. Netflix is considered to have one of the best recommendation engines. They even hosted a competition on Kaggle and offered a prize money of USD 1 million to the team beating their recommendation algorithm. This shows the sophistication and advanced capabilities developed by the company on its platform. This recommendation tool is nothing but a segmentation exercise to map the movies and users. Sounds easy, right?

Gone are the days when marketers used to identify their target customers based on their intuition and gut feelings. With the advent of big data tools and technologies, marketers are relying more and more on analytics software to identify the right customer with minimal spend. This is where segmentation comes into play and makes our lives easier. So, let’s first understand what is segmentation? and why do we need segmentation?

Segmentation, in very simple terms, is grouping of customers in such a way that that customers falling into one segment have similar traits and attributes. The attributes could be in terms of their likings, preference, demographic features or socio-economic behavior. Segmentation is mainly talked with respect to customers, but it can refer to products as well. We will explore few examples as we move ahead in the article.

With tighter marketing budgets, increasing consumer awareness, rising competition, easy availability of alternatives and substitutes, it is imperative to use marketing budgets to prudently to target the right customers, through the right channel, at the right time and offer them the right set of products. Let’s look at an example and understand why segmentation is important for marketers.

There is an e-commerce company which is launching a new service for a specific segment of customers who shop frequently and whose ticket size is also high. For this, the company wants to see which all customers to target for the service. Let’s first look at the data at an aggregate level and then further drill down to understand in detail. There are 5 customers for whom we want to evaluate the spend. The overall scenario is as follows:

Should the e-commerce company offer the service to all the five customers?

Who is the right customer to target for this service? Or which is the right customer segment to target?

We will see the details of each of the customers and see the distribution of data.

Looking at the data above, it looks like Customer 1 and Customer 2 would be the right target customers for company’s offering. If we were to segment these 5 customers into two segments, then Customer 1 and Customer 2 would fall in one segment because they have higher total spend and higher number of purchases than the other three customers. We can use Tableau to create clusters and verify our hypothesis. Using Tableau to create customer segments, the output would look like as below.

Customer 1 and customer 2 are part of cluster 1; while customer 3, customer 4 and customer 5 are part of cluster 2. So, the ecommerce company should focus on all the customers falling into cluster 1 for its service offering.

Let’s take another example and understand the concept further.

We will try to segment the countries in the world by their inbound tourism industry (using the sample dataset available in Tableau). Creating four segments we get the following output:

There are few countries which do not fall into any of the clusters because data for those countries is not available. Looking at clusters closely, we see that the United States of America falls in the cluster 4; while India, Russia, Canada, Australia, among others fall in the cluster 2. Countries in the Africa and South America fall in the cluster 1; while the remaining countries fall in the cluster 3. Thus, it makes it easier for us to segment countries based on certain macro-economic (or other) parameters and develop a similar strategy for countries in the same cluster.

Now, let’s go a step further and understand how Tableau can help us in segmentation.

Segmentation and Clustering in Tableau

Tableau is one of the most advanced visualization and business intelligence tool available in the market today. It provides a lot of interactive and user-friendly visualizations and can handle large amounts of data. It can handle millions of rows at once and provides connection support to almost all the major databases in the market.

With the launch of Tableau 10 in 2016, the company offered a new feature of clustering. Clustering was once considered a technique to be used only by statisticians and advanced data scientists, but with this new feature in Tableau it becomes as easy as simple drag and drop. This feature can provide a big support to marketers in segmenting their customers and products, and get better insights.

Steps to Becoming a Segmentation Sniper

Large number of sales channels, increase in product options and rise in advertisement cost has made it inevitable not only for marketers but for almost all the departments to analyze customer data and understand their behavior to maintain market position. We will now take a small example and analyze the data using Tableau to understand our customer base and zero-in on the target customer segment.

There is a market research done by a publishing company which is mainly into selling of business books. They want to further expand their product offerings to philosophy books, marketing, fiction and biographies. Their objective is to use customer responses and find out which age group like which category of books the most.

For an effective segmentation exercise, one should follow the below four steps.

Understand the objective
Identify the right data sources
Creating segments and micro-segments
Reiterate and refine

We will now understand each of the steps and use Tableau, along with, to see the findings at every step.

Understand the objective

Understanding the objective is the most important thing that you should do before starting the segmentation exercise. Having a clear objective is the most imperative thing because it will help you channelize your efforts towards the objective and prevent you from just spending endless hours in plain slicing and dicing. In our publishing company example, the objective is to find out the target age group which the company should focus on in each of the segments, namely philosophy, marketing, fiction and biography. This will help the publishing company in targeting its marketing campaign to specific set of customers for each of the genres. Also, it will help the company in identifying the target age group that like both business and philosophy or business and marketing, or similar other groups.

Identify the right data sources

In this digital age, data is spread across multiple platforms. Not using the right data sources could prove to be as disastrous as not using analytics at all. Customer data residing in CRM systems, operational data in SAP systems, demographic data, macro-economic data, financial data, social media footprint – there could be endless list of data sources which could prove to be useful in achieving our objective. Identifying right variables from each of the sources and then integrating them to form a data lake forms the basis of further analysis.

In our example, dataset is not as complex as it might be in real life scenarios. We are using a market survey data gathered by a publishing company. The data captures the age of customer and their liking/disliking for different genres of books, namely philosophy, marketing, fiction, business and biography.

Creating segments and micro-segments

At this stage, we have our base data ready in the analyzable format. We will start analyzing data and try to form segments. Generally, you should start by exploring relationships in the data that you are already aware of. Once you establish few relationships among different variables, keep on adding different layers to make it more granular and specific.

We will start by doing some exploratory analysis and then move on to add further layers. Let’s first see the results of the market survey at an aggregate level.

From the above analysis, it looks like fiction is the most preferred genre of books among the respondents. But before making any conclusions, let’s explore a little further and move closer to our objective.

If we split the results by age group and then analyze, results will look something like the below graph.

In the above graph, we get further clarity on the genre preferences by respondents. It gives us a good idea as to which age group prefers which genre. Fiction is most preferred by people under the age of 20; while for other age groups fiction is not among the top preference. If we had only taken the average score and went ahead with that, we would have got skewed results. Philosophy is preferred by people above the age of 40; while others prefer business books.

Now moving a step ahead, for each of the genre we want to find out the target age group.

The above graph gives us the target group for each of the genres. For biography and philosophy genres, people above the age of 40 are the right customers; while for business and marketing, age group 20-30 years should be the target segment. For fiction, customers under the age of 20 are the right target group.

Reiterate and refine

In the previous section, we created different customer segments and identified the target segment for publishing company. Now, let’s say we need to move one more step ahead and identify only those age groups and genres which have overlap with business genres. To put it the other way, if the publishing company was to target only one new genre (remember, they already have customer base for business books) and one age group, which one should it be?

Using Tableau to develop a relation amongst the different variables, our chart should look like the one below.

Starting with the biography genre, age group 30-40 years comes closest to our objective, i.e., people in this age group like both biography and business genre (Biography score – 0.22, Business score – 0.31). Since, we have to find only one genre we will further explore the relationships.

For fiction, there is no clear overall with any of the age groups. For marketing, age group 20-30 year looks to be clear winner. The scores for the groups are – marketing – 0.32, business – 0.34. The relation between philosophy and business is not as strong as it is for business and marketing.

To sum it up, if the publishing company was to launch one more genre of books then it should be marketing and target customer group should be in the range of 20-30 years.

Such analysis can be refined further depending on the data we have. We can add gender, location, educational degree, etc. to the analysis and further refine our target segment to make our marketing efforts more focused.

I think after going through the examples in the article, you can truly appreciate the level of segmentation that Netflix has done and it clearly reflects the reason behind its success.

The post Tableau for Marketing: Become a Segmentation Sniper first appeared on Perceptive Analytics.

Tableau Sales Dashboard Performance

Saneesh V — Tue, 28 Aug 2018 09:00:24 +0000

Business heads often use KPI tracking dashboards that provide a quick overview of their company’s performance and well-being. A KPI tracking dashboard collects, groups, organizes and visualizes the company’s important metrics either in a horizontal or vertical manner. The dashboard provides a quick overview of business performance and expected growth.

An effective and visually engaging way of presenting the main figures in a dashboard is to build a KPI belt by combining text, visual cues and icons. By using KPI dashboards, organizations can access their success indicators in real time and make better informed decisions that support long-term goals.

What is a KPI?

KPIs (i.e. Key Performance Indicators) are also known as performance metrics, performance ratios or business indicators. A Key Performance Indicator is a measurable value that demonstrates how effectively a company is achieving key business objectives.

A sales tracking dashboard provides a complete visual overview of the company’s sales performance by year, quarter or month. Additional information such as the number of new leads and the value of deals can also be incorporated.

Example of KPIs on a Sales Dashboard:

Number of New Customers and Leads
Churn Rate (i.e. how many people stop using the product or service)
Revenue Growth Rate
Comparison to Previous Periods
Most Recent Transactions
QTD (quarter to date) Sales
Profit Rate
State Wise Performance
Average Revenue for Each Customer

Bringing It All Together with Dashboards and Stories

An essential element of Tableau’s value is delivered via dashboards. Well-designed dashboards are visually engaging and draw in the user to play with the information. Dashboards can facilitate details-on-demand that enable the information consumer to understand what, who, when, where, how and perhaps even why something has changed.

Best Practices to Create a Simple and Effective Dashboard to Observe Sales Performance KPIs

A well-framed KPI dashboard instantly highlights problem areas. The greatest value of a modern business dashboard lies in its ability to provide real-time information about a company’s sales performance. As a result, business leaders, as well as project teams, are able to make informed and goal-oriented decisions, acting on actual data instead of gut feelings. The choice of chart types on a dashboard should highlight KPIs effectively.

Bad Practices Examples in a Sales Dashboard:

A sales report displaying 12 months of history for twenty products; 12 × 20 = 240 data points.
- Multiple data points do not enable the information consumer to effectively discern trends and outliers as easily as a time-series chart comprised of the same information
The quality of the data won’t matter if the dashboard takes five minutes to load
The dashboard fails to convey important information quickly
The pie chart has too many slices, and performing precise comparisons of each product sub-category is difficult
The cross-tab at the bottom requires that the user scroll to see all the data

Now, we will focus on the best practices to create an effective dashboard to convey the most important sales information. Tableau is designed to supply the appropriate graphics and chart types by default via the “Show me” option.

I. Choose the Right Chart Types

With respect to sales performance, we can use the following charts to show the avg. sales, profits, losses and other measures.

Bar charts to compare numerical data across categories to show sales quantity, sales expense, sales revenue, top products and sales channel etc. This chart represents sales by region.

Line charts to illustrate sales or revenue trends in data over a period of time:

A Highlight table allows us to apply conditional formatting (a color scheme in either a continuous or stepped array of colors from highest to lowest) to a view.

Use Scatter plots or scatter graphs to investigate the relationship between different variables or to observe outliers in data. Example: sales vs profit:

Use Histograms to see the data distribution across groups or to display the shape of the sales distribution:

Advanced Chart Types:

Use Bullet graphs to track progress against a goal, a historical sales performance or other pre-assigned thresholds:

The Dual-line chart (or dual-axis chart), is an extension of the line chart and allows for more than one measure to be represented with two different axis ranges. Example: revenue vs. expense
The Pareto chart is the most important chart in a sales analysis. The Pareto principle is also known as 80-20 rule; i.e roughly 80% of the effects come from 20% of the causes.

When performing a sales analysis, this rule is used for detecting the 80% of total sales derived from 20% of the products.

Use Box plots to display the distribution of data through their quartiles and to observe the major data outliers

Tableau Sales Dashboard

Here is a Tableau dashboard comprised of the aforementioned charts. This interactive dashboard enables the consumer to understand sales information by trend, region, profit and top products.

II. Use Actions to filter instead of Quick Filters

Using actions in place of Quick Filters provides a number of benefits. First, the dashboard will load more quickly. Using too many Quick Filters or trying to filter a very large dimension set can slow the load time because Tableau must scan the data to build the filters. The more quick filters enabled on the dashboard, the longer it will take the dashboard to load.

III. Build Cascading Dashboard Designs to Improve Load Speed

By creating a series of four-panel, four cascading dashboards the load speed was improved dramatically and the understandability of the information presented was greatly enhanced. The top-level dashboard provided a summary view, but included filter actions in each of the visualizations that allowed the executive to see data for different regions, products, and sales teams.

IV. Remove All Non-Data-Ink

Remove any text, lines, or shading that doesn’t provide actionable information. Remove redundant facts. Eliminate anything that doesn’t help the audience understand the story contained in the data.

V. Create More Descriptive Titles for Each Data Pane

Adding more descriptive data object titles will make it easier for the audience to interpret the dashboard. For example:

Bullet Graph—Sales vs. Budget by Product
Sparkline—Sales Trend
Cross-tab—Summary by Product Type
Scatter Plot—Sales vs. Marketing Expense

VI. Ensure That Each Worksheet Object Fits Its Entire View

When possible, change the graphs fit from “Normal” to “Entire View” so that all data can be displayed at once.

VII. Adding Dynamic Title Content

There is an option to use dynamic content and titles within Tableau. Titles can be customized in a dynamic way so that when a filter option is selected, the title and content will change to reflect the selected value. A dynamic title expresses the current content. For example: if the dashboard title is “Sales 2013” and the user has selected year 2014 from the filter, the title will update to “Sales 2014”.

VIII. Trend Lines and Reference Lines

Visualizing granular data sometimes results in random-looking plots. Trend lines help users interpret data by fitting a straight or curved line that best represents the pattern contained within detailed data plots. Reference lines help to compare the actual plot against targets or to create statistical analyses of the deviation contained in the plot; or the range of values based on fixed or calculated numbers.

IX. Using Maps to Improve Insight

Seeing the data displayed on a map can provide new insights. If an internet connection is not available, Tableau allows a change to locally-rendered offline maps. If the data includes geographic information, we can very easily create a map visualization.

This map represents sales by state. The red color represents negative numbers and the green color represents positive numbers.

X. Developing an Ad Hoc Analysis Environment

Tableau facilitates ad hoc analysis in three ways:

Generating new data with forecasts
Designing flexible views using parameters
Changing or creating designs in Tableau Server

XI. Using Filters Wisely

Filters generally improve performance in Tableau. For example, when using a dimension filter to view only the West region, a query is passed to the underlying data source, resulting in information returned for only that region. We can see the sales performance of the particular region in the dashboard. By reducing the amount of data returned, performance improves.

Enhance Visualizations Using Colors, Labels etc.

I. Using colors:

Color is a vital way of understanding and categorizing what we see. We can use color to tell a story about the data, to categorize, to order and to display quantity. Color helps with distinguishing the dimensions. Bright colors pop at us, and light colors recede into the background. We can use color to focus attention on the most relevant parts of the data visualization. We choose color to highlight some elements over others, and use it to convey a message.

Red is used to denote smaller values, and blue or green is used to denote higher values. Red is often seen as a warning color to show the loss or any negative number whereas blue or green is seen as a positive result to show profit and other positive values.

Without colors:

With colors:

II. Using Labels:

Enable labels to call out marks of interest and to make the view more understandable. Data labels enable comprehension of exact data point values. In Tableau, we can turn on mark labels for marks, selected marks, highlighted marks, minimum and maximum values, or only the line ends.

Without labels:

With labels:

Using Tableau to enhance KPI values

The user-friendly interface allows non-technical users to quickly and easily create customized dashboards. Tableau can connect to nearly any data repository, from MS Excel to Hadoop clusters. As mentioned above, using colors and labels, we can enhance visualization and enhance KPI values. Here are some additional ways by which we can enhance the values especially with Tableau features.

I. Allow for Interactivity

Playing, exploring, and experimenting with the charts is what keeps users engaged. Interactive dashboards enable the audiences to perform basic analytical tasks such as filtering views, drilling down and examining underlying data – all with little training.

II. Custom Shapes to Show KPIs

Tableau shapes and controls can be found in the marks card to the right of the visualization window. There are plenty of options built into Tableau that can be found in the shape palette.

Custom shapes are very powerful when telling a story with visualizations in dashboards and reports. We can create unlimited shape combinations to show mark points and create custom formatting. Below is an example that illustrates how we can represent the sales or profit values with a symbolic presentation.

Here green arrows indicate good sales progress and red arrows indicate a fall in Year over Year Sales by Category

III. Creating Calculated Fields

Calculated fields can be used to create new dimensions such as segments, or new measures such as ratios. There are many reasons to create calculated fields in Tableau. Here are just a few:

Segmentation of data in new ways on the fly
Adding a new dimension or a new measure before making it a permanent field in the underlying data
Filtering out unwanted results for better analyses
Using the power of parameters, putting the choice in the hands of end users
Calculating ratios across many different variables in Tableau, saving valuable database processing and storage resources

IV. Data-Driven Alerts

With version 10.3, Tableau has introduced a very useful feature: Data-Driven Alerts. We may want to use alerts to notify users or to remind that a certain filter is on and want to be alerted somehow if performance is ever higher or lower than expected. Adding alerts to dashboards can help elicit necessary action by the information consumer. This is an example of a data driven alert that we can set while displaying a dashboard or worksheet.

In a Tableau Server dashboard, we can set up automatic mail notifications to a set of recipients when a certain value reaches a specific threshold.

Summary

For an enterprise, a dashboard is a visual tool to help track, monitor and analyze information about the organization. The aim is to enable better decision making.

A key feature of sales dashboards in Tableau is interactivity. Dashboards are not simply a set of reports on a page; they should tell a story about the business. In order to facilitate the decision-making process, interactivity is an important part of assisting the decision-maker to get to the heart of the analysis as quickly as possible.

The post Tableau Sales Dashboard Performance first appeared on Perceptive Analytics.

Tableau Filtering Actions Made Easy

Saneesh V — Mon, 27 Aug 2018 09:00:06 +0000

TABLEAU FILTERING ACTIONS MADE EASY

This is a guest post provided by Vishal Bagla, Chaitanya Sagar, and Saneesh Veetil of Perceptive Analytics.

Tableau is one of the most advanced visualization tools available on the market today. It is consistently ranked as a ‘Leader’ in Gartner’s Magic Quadrant. Tableau can process millions of rows of data and perform a multitude of complex calculations with ease. But sometimes analyzing large amounts of data can become tedious if not performed properly. Tableau provides many features that make our lives easier with respect to handling datasets big and small, which ultimately enables powerful visualizations.

Tableau’s filtering actions are useful because they create subsets of a larger dataset to enable data analysis at a more granular level. Filtering also aids user comprehension of data. Within Tableau data can be filtered at the data source level, sheet level or dashboard level. The application’s filtering capabilities enable data cleansing and can also increase processing efficiency. Furthermore, filtering aids with unnecessary data point removal and enables the creation of user defined date or value ranges. The best part is that all of these filtering capabilities can be accessed by dragging and dropping. Absolutely no coding or elaborate data science capabilities are required to use these features in Tableau.

In this article, we will touch upon the common filters available in Tableau and how they can be used to create different types of charts. After reading this article, you should be able to understand the following four filtering techniques in Tableau:

Keep Only/Exclude Filters
Dimension and Measure Filters
Quick Filters
Higher Level Filters

We will use the sample ‘Superstore’ dataset built in Tableau to understand these various functions.

1. Keep Only/Exclude Filters in Tableau

These filters are the easiest to use in Tableau. You can filter individual/multiple data points in a chart by simply selecting them and choosing the “Keep Only” or “Exclude” option. This type of filter is useful when you want to focus on a specific set of values or a specific region in a chart.

While using the default Superstore dataset within Tableau, if we want to analyze sales by geography, we’d arrive at the following chart.

However, if we want to keep or exclude data associated with Washington state, we can just select the “Washington” data point on the map. Tableau will then offer the user the option to “Keep Only” or “Exclude”. We can then simply choose the option that fits our need.

2. Dimension and Measure Filters

Dimension and measure filters are the most common filters used while working with Tableau. These filters enable analysis at the most granular level. Let’s examine the difference between a dimension filter and a measure filter.

Dimension filters are applied to data points which are categorical in nature (e.g. country names, customer names, patient names, products offered by a company, etc.). When using a dimension filter, we can individually select each of the values that we wish to include or exclude. Alternatively, we can identify a pattern for the values that we wish to filter.

Measure filters can be applied to data points which are quantitative in nature, (e.g. sales, units, etc.). For measure filters, we generally work with numerical functions such as sum, average, standard deviation, variance, minimum or maximum.

Let’s examine dimension filters using the default Tableau Superstore dataset. The chart below displays a list of customers and their respective sales.

Let’s examine how to exclude all customers whose names start with the letter ‘T’ and then subsequently keep only the top 5 customers by Sales from the remaining list.

One way would be to simply select all the customers whose names start with ‘T’ and then use the ‘Exclude’ option to filter out those customers. However, this is not a feasible approach when we have hundreds or thousands of customers. We will use a dimension filter to perform this task.

When you move the Customer Name field from the data pane to the filters pane, a dialogue box like the one shown below will appear.

As shown in the above dialogue box, you can select all the names starting with “T” and exclude them individually. The dialogue box should look like the one shown below.

The more efficient alternative is to go to the Wildcard tab in the dialogue box and select the “Exclude” check box. You can then choose the relevant option “Does not start with”.

To filter the top 5 customers by sales, right click on “Customer Name” in the Filters area, select “Edit Filter” and then go to the “Top” tab in the filter dialogue box. Next, choose the “By Field” option. Make your selections align to the following screenshot.

After performing the necessary steps, the output will yield the top 5 customers by sales.

Let’s move on to measure filtering within the same Tableau Superstore dataset. We’re going to filter the months where 2016 sales were above $50,000. Without a measure filter applied, our sales data for 2016 would look like the following:

To filter out the months where sales were more than $50,000, move the sales measure from the data pane to the filter pane. Observe the following:

Here, we can choose any one of the filter options depending upon our requirement. Let’s choose sum and click on “Next”. As shown below, we are provided with four different options.

We can then choose one of the following filter options:

Enter a range of values;
Enter the minimum value that you want to display using the “At least” tab;
Enter the maximum value that you want to display using the “At most” tab;
From the Special tab, select “all values”, “null values” or “non-null” values;

Per our example, we want to filter for sales that total more than $50,000. Thus, we will choose the “At least” tab and enter a minimum value of 50,000.

In the output, we are left with the six months (i.e. March, May, September, October, November, December) that have a sum of sales that is greater than $50,000.

Similarly, we can choose other options such as minimum, maximum, standard deviation, variance, etc. for measure filters. Dimension and measure filters make it very easy to analyze our data. However, if the dataset is very large, measure filters can lead to slow performance since Tableau needs to analyze the entire dataset before it filters out the relevant values.

3. Quick Filters

Quick filters are radio buttons or check boxes that enable the selection of different categories or values that reside in a data field. These filters are very intuitive and infuse your visualizations with additional interactivity. Let’s review how to apply quick filters in our Tableau sheet.

In our scenario, we have sales data for different product segments and different regions from 2014 to 2019. Our data looks like the following:

We want to filter the data by segments and see data for only two segments (Consumer and Corporate). One way to do this would be to use a dimension filter, but what if we want to compare segments and change the segment every now and then? In this scenario, a quick filter would be a useful addition to the visualization. To add a quick filter, right click on the “Segment” dimension in Marks pane and choose “Show Filter”.

Once we click on “Show Filter”, a box will appear on the right side of the Tableau screen. The box contains all constituent values of the Segment dimension. At this point, we could choose to filter on any segment value available in the quick filter box. If we were to select both Consumer and Corporate values, Tableau will display two charts instead of three.

Similarly, we can add other quick filters for region, country, ship status or any other dimension.

4. Higher Level Filters

Dimension, measure and quick filters are very easy to use and make the process of analyzing data hassle free. However, when multiple filters are used on a large data source, processing becomes slow and inefficient. Application performance degrades with each additional filter.

The right way to begin working with a large data source is to initially filter when making a connection to the data. Once the data is filtered at this stage, any further analysis will be performed on the remaining data subset; in this manner, data processing is more efficient. These filters are called Macro filters or Higher-Level filters. Let’s apply a macro level filter on our main data source.

We can choose the “Add” option under the Filters tab in top right corner of the Data Source window.

Once we click on “Add”, Tableau opens a window which presents an option to add various filters.

Upon clicking “Add” in the Edit Data Source Filters dialogue box, we’re presented with the entire list of variables in the dataset. We can then add filters to the one we select. Let’s say we want to add a filter to the Region field and include only the Central and East region in our data.

Observe that, our dataset is filtered at the data source level. Only those data points where the region is either Central or East will be available for our analyses. Let’s turn our attention back to the sales forecast visualization that we used to understand quick filters.

In the above window, we observe options for only “Central” and “East” in the Region Filter pane. This means that our filter applied at the data source level was successful.

Hopefully after reading this article you are more aware of both the importance and variety of filters available in Tableau. However, using unnecessary filters in unorthodox ways can lead to performance degradation and impact overall productivity. Therefore, always assess if you’re adding unnecessary options to your charts and dashboards that have the potential to negatively impact performance.

The post Tableau Filtering Actions Made Easy first appeared on Perceptive Analytics.

How to Perform the Principal Component Analysis in R

Saneesh V — Sun, 26 Aug 2018 09:00:37 +0000

Implementing Principal Component Analysis (PCA) in R

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

—- Abraham Lincoln

The above Abraham Lincoln quote has a great influence in the machine learning too. When it comes to modeling different machine learning models most of the time need to spend on data preprocessing and feature engineering stages.

The general idea of feature engineering is to identify the influence features over all the available features. So the identified features used to train the model.

Identifying the influence features doesn’t mean picking the features in an analytical way. Some times converting latent features into a meaningful feature. This is known as dimensionality reduction.

In this article, you will learn the basic concept to perform the dimensionality reduction with one famous approach know as Principal component analysis. In short PCA

Before we drive further let’s look at the table of content for this article.

Lifting the curse using principal component analysis
- Curse of dimensionality in layman’s terms
Shlen’s paper nuggets on Principal component analysis
PCA conceptual background
Principal component analysis implementation in R programming language
- Loading Iris data set
- Covariance matrix calculation
- Eigen values and Eigen vectors calculation
- PCA component calculation
- Importance of the components
Summary

The main intention of this article is to explain how to perform the principal component analysis in R.

So let’s begin.

Lifting the Curse using Principal Component Analysis

Many problems in Analytics are often visioned to have incomplete data with a few features. This led to the development of the common myth.

Having more features and more data will always improve the accuracy of solving the machine learning problem.

In reality, this is a curse more than a boon.

Sometimes, we face a situation. When there are a lot of features with few data points. Fitting a model in this scenario often leads to a low accuracy model even with many features. It has been encountered so many times that it is also called as the “Curse of dimensionality”.

Curse of dimensionality in layman’s terms

In layman’s language, the curse of dimensionality refers to the phenomenon when an increase in the number of features results in decrease the model accuracy.

We can understand the curse of dimensionality in another way also. Which is like increase in the number of features increase the model complexity (more precise the model complexity increase exponentially)

There are two ways to stay away from this curse of dimensionality.

Add more data to the problem.
Reduce the number of features in the data.

Adding data may not be possible in many scenarios as the data is often limited to what was collected. Reducing the number of features, a technique is known as

Adding data may not be possible in many scenarios as the data is often limited to what was collected. Reducing the number of features is more preferable. Such a technique is known as “Dimensionality reduction” is thus more preferable. PCA or Principal component analysis is a very popular dimensionality reduction technique.

Shlen’s paper nuggets on Principal component analysis

Principal component analysis aptly described in the famous Shlen’s paper.

Shlen’s Principal component analysis paper

The paper explains that even a simple problem such as recording the motion of a pendulum, which moves in only one direction. If one is unaware of the exact direction. The number of cameras required to record its movement will at least be three, given that we are able to place the cameras perpendicular to each other.

Shlens pca

" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/08/Shlens-pca.png?fit=300%2C205" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/08/Shlens-pca.png?fit=345%2C236" data-lazy-loaded="true" />

Shlens pca

If we do not have the knowledge of keeping the cameras perpendicular to each other, then more cameras will be required. The problem then keeps on increasing and more and more cameras are required as available information keeps on decreasing.

The PCA technique transforms our features (or cameras) into a new dimensional space and represents it as a set of new orthogonal variables so that our problem is observed with a reduced set of features.

These orthogonal features are known as “Principal Components”.

In practice, our data is like the motion of the pendulum. If we had complete knowledge of the system, we will require a small number of features. Without the knowledge, we have to observe the system using a set of features which will convey maximum information if they are orthogonal. This is done using principal component analysis.

The new set of features which are produced after PCA transformation are linearly uncorrelated as they are orthogonal. Moreover, the features are arranged in decreasing order of their importance.

This means that the first principal component alone will explain a very large component of the data. The second principal component will explain less than the first major component but more than all other components. The last principal component will explain only a small change in the data. Typically, one can run PCA and take the top principal components such that they together explain most of the data. In most analytical problems, explaining 95-99% of the is considered very high.

PCA Conceptual Background

Before getting our hands dirty with PCA in R. Let’s understand the concept with a simple example.

Suppose we have a dataset with ‘m’ data points and ‘n’ features. We can call this dataset as a m*n matrix A

A₁₁ A₁₂ … A_1n

A= A₂₁ A₂₂ … A_2n

…………

A_m1 A_m2 … A_mn

We will transform this matrix A to A’ such that A’ is a m*k matrix and k. The number ‘k’ represents the subset of the number of transformed features.

How do we transform a given set of features into a new feature set such that they are orthogonal?

The answer is the eigenvectors of the matrix.

We know that eigenvectors are orthogonal to each other so transforming our features in the direction of the eigenvectors will also make them orthogonal. But wait! Before transforming a matrix, it is always recommended to normalize.

If the matrix is not normalized, our transformation will always be in favor of the feature with the largest scale of values. This is why PCA is sensitive to the relative scaling of the original variables.

We do that by observing the loadings of the PCA. Loadings are the factors which show what amount of load is shared by a particular feature. So, the component with maximum loading value can be considered as the most influential feature.

The loadings are arranged in such a manner that the higher significance factors are placed first, followed by subsequent factors. On the basis of our tolerance level or precision level required, we select the number of components to be considered. One important thing for PCA to get started is to get the same type of data sets, and therefore, we normalize the original data. In order to capture the variances, we need to identify the distribution of data, and therefore we transform the data in a normal distribution.

Variance is a nice measure of letting us know the kind of distribution. If we can maximize the variance, then our information is maximized. PCA can be done on any distribution, but as variance measure is the best parametric measure for normal distribution and also, data sets generally trend to follow the normal distribution or can be transformed easily to be made to follow the normal distribution, so we normally take normal distribution variance analysis in PCA.

To capture the variance of each feature with respect to other feature we try to get the variance – covariance matrix of the features, followed by finding out the eigenvalues of the matrix and then finding the eigenvectors of the matrix of the dataset which gives the various principal components.

We now construct the covariance matrix of A by multiplying A with its transpose A^t. This will give us a m*m matrix Cov A

Cov A = A.A^t

We can now simply calculate the eigenvectors and eigenvalues of the Covariance matrix Cov A. The Eigenvalues will represent the relative variance of the data.

Principal component analysis implementation in R programming language

Now that we understand the concept of PCA. We can implement the same in R programming language.

The princomp() function in R calculates the principal components of any data. We will also compare our results by calculating eigenvectors and eigenvalues separately. Let’s use the IRIS dataset.

Let’s start by loading the dataset.

1

2

# Taking the numeric part of the IRIS data

data_iris <– iris[1:4]

The iris dataset having 150 observations (rows) with 4 features.

Let’s use the cov() function to calculate the covariance matrix of the loaded iris data set.

1

2

# Calculating the covariance matrix

Cov_data <– cov(data_iris )

The next step is to calculate the eigenvalues and eigenvectors.

We can use the eigen() function to do this automatically for us.

1

2

# Find out the eigenvectors and eigenvalues using the covariance matrix

Eigen_data <– eigen(Cov_data)

We have calculated the Eigen values from the data. We will now look at the PCA function princomp() which automatically calculates these values.

Let’s calculate the components and compare the values.

1

2

# Using the inbuilt function

PCA_data <– princomp(data_iris ,cor=“False”)

Let’s now compare the output variances

1

2

# Let’s now compare the output variances

Eigen_data$values

Output

1

# The output is 4.22824171 0.24267075 0.07820950 0.02383509

1

PCA_data$sdev^2

Output

1

# The output is 4.20005343 0.24105294 0.07768810 0.02367619

There is a slight difference due to squaring in PCA_data but the outputs are more or less similar. We can also compare the eigenvectors of both models.

1

2

PCA_data$loadings[,1:4]

Eigen_data$vectors

Output

1

2

3

4

5

6

7

8

9

10

11

               Comp.1      Comp.2      Comp.3     Comp.4

Sepal.Length  0.36138659 –0.65658877 –0.58202985  0.3154872

Sepal.Width  –0.08452251 –0.73016143  0.59791083 –0.3197231

Petal.Length  0.85667061  0.17337266  0.07623608 –0.4798390

Petal.Width   0.35828920  0.07548102  0.54583143  0.7536574

            [,1]        [,2]        [,3]       [,4]

[1,]  0.36138659 –0.65658877 –0.58202985  0.3154872

[2,] –0.08452251 –0.73016143  0.59791083 –0.3197231

[3,]  0.85667061  0.17337266  0.07623608 –0.4798390

[4,]  0.35828920  0.07548102  0.54583143  0.7536574

This time the eigenvectors calculated are same and there is no difference.

Let us now understand our model. We transformed our 4 features into 4 new orthogonal components. To know the importance of the first component, we can view the summary of the model.

1

summary(PCA_data)

Importance of components

1

2

3

4

5

Importance of components:

                          Comp.1     Comp.2     Comp.3       Comp.4

Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700

Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184

Cumulative Proportion   0.9246187 0.97768521 0.99478782 1.000000000

From the Proportion of Variance, we see that the first component has an importance of 92.5% in predicting the class while the second principal component has an importance of 5.3% and so on. This means that using just the first component instead of all the 4 features will make our model accuracy to be about 92.5% while we use only one-fourth of the entire set of features.

If we want the higher accuracy, we can take the first two components together and obtain a cumulative accuracy of up to 97.7%. We can also understand how our features are transformed by using the biplot function on our model.

1

biplot (PCA_data)

pca feature transformation
" data-medium-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/08/pca_feature_transformation.png?fit=300%2C295" data-large-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/08/pca_feature_transformation.png?fit=526%2C517" data-lazy-loaded="true" />

PCA feature transformation

The X-Axis represents the first principal component. We see that the Petal Width and Petal Length vectors are parallel to the X-Axis. Hence they are combined and completely transformed into the first principal component. The first component also contains some part of Sepal Length and Sepal Width. The vertical part of the Sepal Length and Sepal Width Vectors are explained by the second principal component.

To determine what should be an ‘ideal’ set of features we should take after using PCA, we use a screeplot diagram. The screeplot() function in R plots the components joined by a line. We look at the plot and find the point of ‘arm-bend’. This is the point where the cumulative contribution starts decreasing and becomes parallel to the x-axis.

1

screeplot(PCA_data, type=“lines”)

principle components
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/08/principle_components.png?fit=300%2C295" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/08/principle_components.png?fit=526%2C517" data-lazy-loaded="true" />

principle components

This plot shows the bend at the second principal component.

Let us now fit two naive Bayes models.

one over the entire data.

The second on the first principal component.

We will calculate the difference in accuracy between these two models.

1

2

3

4

5

6

7

8

9

10

11

12

13

#Select the first principal component for the second model

model2 = PCA_data$loadings[,1]

#For the second model, we need to calculate scores by multiplying our loadings with the data

model2_scores <– as.matrix(data_iris) %*% model2

#Loading libraries for naiveBayes model

library(class)

library(e1071)

#Fitting the first model over the entire data

mod1<–naiveBayes(iris[,1:4], iris[,5])

#Fitting the second model using the first principal component

mod2<–naiveBayes(model2_scores, iris[,5])

Accuracy of the first model

1

2

# Accuracy for the first model

table(predict(mod1, iris[,1:4]), iris[,5])

1

2

3

4

         setosa versicolor virginica

  setosa         50          0         0

  versicolor       0         47         3

  virginica       0           3         47

Accuracy of the second model

1

2

# Accuracy for the second model

table(predict(mod2, model2_scores), iris[,5])

We can see that there is a difference of 3 predictions between the two models. In return for reducing our data to one-fourth of the original, we lose on the accuracy only slightly. Hence it is a great tradeoff.

With this, we came to the end of the article. Let’s summarize key points.

Summary

PCA is a very popular method of dimensionality reduction because it provides a way to easily reduce the dimensions and is easy to understand. For this reason, PCA has been used in various applications from image compression to complex gene comparison. While using PCA, one should keep in mind its limitations well.

PCA is very sensitive to the scale of the data. It will create an initial basis in the direction of the largest variance in the data. Moreover, PCA applies a transformation over the data where all new components are orthogonal. The new features may not be interpretable in business.

Another limitation of PCA is the reliance on the mean and variance of the data. If the data has a relationship in higher dimensions such as kurtosis and skewness then PCA may not be the right technique to use on the data. In situations when the features are already orthogonal to each other and are uncorrelated, PCA will not produce any useful results except ordering the features in decreasing order of their variances.

PCA is very useful in situations when the data at hand is very large. Example, in case of image compression, PCA can be used to store the image in the first few hundred components and use less number of pixels.

The entire code used in the article is as follows:

You can also clone the complete code from DataAspirant GitHub.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

#Taking the numeric part of the IRIS data

data_iris <– iris[1:4]

#Calculating the covariance matrix

Cov_data <– cov(data_iris )

#Find out the eigenvectors and eigenvalues using the covariance matrix

Eigen_data <– eigen(Cov_data)

#Using the inbuilt function

PCA_data <– princomp(data_iris ,cor=“False”)

#Let’s now compare the output variances

Eigen_data$values #The output is 4.22824171 0.24267075 0.07820950 0.02383509

PCA_data$sdev^2 #The output is 4.20005343 0.24105294 0.07768810 0.02367619

PCA_data$loadings[,1:4]

Eigen_data$vectors

summary(PCA_data)

biplot (PCA_data)

screeplot(PCA_data, type=“lines”)

#Select the first principal component for the second model

model2 = PCA_data$loadings[,1]

#For the second model, we need to calculate scores by multiplying our loadings with the data

model2_scores <– as.matrix(data_iris) %*% model2

#Loading libraries for naiveBayes model

library(class)

library(e1071)

#Fitting the first model over the entire data

mod1<–naiveBayes(iris[,1:4], iris[,5])

#Fitting the second model using the first principal component

mod2<–naiveBayes(model2_scores, iris[,5])

#Accuracy for the first model

table(predict(mod1, iris[,1:4]), iris[,5])



#Accuracy for the second model

table(predict(mod2, model2_scores), iris[,5])

The post How to Perform the Principal Component Analysis in R first appeared on Perceptive Analytics.

How to Create Histograms in R

Saneesh V — Sat, 25 Aug 2018 09:00:58 +0000

Histogram in R
" data-medium-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/10/Histogram-in-R.png?fit=300%2C169" data-large-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/10/Histogram-in-R.png?fit=560%2C315" data-lazy-loaded="true" />

Histogram in R

How to create histograms in R

To start off with analysis on any data set, we plot histograms. Knowing the data set involves details about the distribution of the data and histogram is the most obvious way to understand it.

Besides being a visual representation in an intuitive manner. It gives an overview of how the values are spread.

We come across many depictions of data using histograms in our day to day life. For example, the distribution of marks in a class can be best represented using a histogram and so does the age distribution in an organization.

The good thing about histograms is that it can visualize a large amount of data in a single figure and convey lots of information.

It is quite easy to spot the median and mode by looking at histograms. A histogram can also indicate possible outliers and gaps in data. Thus a single figure can help know a lot about data.

So in this article, we are going implement different kinds of histograms. Starting with the basic histogram and to customize it to a great extended.

Before we drive further let’s look at the table of content for this article.

Table of contents:

Basics of Histogram

Implementing different kinds of Histograms

Basics of Histogram

A histogram consists of bars and is made for one variable at a time. That’s why knowledge of plotting a histogram is the foundation of univariate descriptive analytics.

To plot a histogram, we use one of the axis as the count or frequency of values and another axis as the range of values divided into buckets.Let’s jump to plotting a few histograms in R.

Implementing different kinds of Histograms

I will work on two different datasets and cite examples from them. The first data is the AirPassengers data.

This data is a time series denoting monthly totals of international airline passengers. There are 144 values from 1949 to 1960.

Let us see how the data looks like. The plot() function creates a plot of the time series.

Plot Air Passengers Data

1

2

# Plot Air Passengers data

plot(AirPassengers)

Histogram for Air Passengers Data with Time
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_air_passengers_data.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_air_passengers_data.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram for Air Passengers Data with Time

Some patterns are inherently visible in the time series. There are trends and seasonality component. The plot clearly shows how the values gradually increase from 100 to 600 due to increasing trend with a repeating seasonality pattern across years.

We can now use the built-in function hist() to plot histogram of the series in R

Histogram for Air Passengers Data with Frequency

1

2

# Plot a histogram for Air Passengers data

hist(AirPassengers)

Histogram for Air Passengers Data with Frequency
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_air_passengers_with_frequency.png?fit=300%2C184" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_air_passengers_with_frequency.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram for Air Passengers Data with Frequency

This plot is indicative of a histogram for time series data. The bars represent the range of values and their height indicates the frequency.

The data shows that most numbers of passengers per month have been between 100-150 and 150-200 followed by the second highest frequency in the range 200-250 and 300-350.

Since it is a time series with a gradual seasonality and trend, most of the values are towards the lower end of the spectrum.

That is why the histogram shows a decreasing trend as the values increases. Had it been a time series with a decreasing trend, the bars would have been in increasing order of the number of Air Passengers.

Let’s get into the game

The Air Passengers data was a single variable data. I will now use the iris dataset to help understand more about histograms.

The histogram can plot only one variable at a time. For plotting features of the iris dataset, the $ notation is used to specify the specific variable I start with plotting the petal length.

Petal Length in Distribution

1

2

# See how the petal length is distributed

plot(iris$Petal.Length)

Petal length is distributed
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Petal_length_distribution.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Petal_length_distribution.png?fit=690%2C422" data-lazy-loaded="true" />

Petal length is distributed

The data shows a clear demarcation of three clusters. The clusters have values from 1-2, 3-5 and 5-7 respectively. Let’s make the histogram to get additional insights

Histogram for iris petal length

1

2

# Plot the histogram for iris petal length

hist(iris$Petal.Length)

Histogram for iris petal length
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/histogram_for_iris_petal_length.png?fit=300%2C184" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/histogram_for_iris_petal_length.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram for iris petal length

There are a lot of values in the range 1-1.5 and a small number of values between 1.5-2. This range corresponds to the first cluster. The second and third clusters somewhat overlap. This can be seen from the rest of the values.

There is no gap and the maximum number of values lie between 4-5. There are a few values in the range 2.5-3.5 and beyond 6 which can belong to cluster 2 and cluster 3 respectively.

The iris dataset also contains a non-numeric feature – Species. The plot function can give the count of each species. For this feature, the hist() function will give an error.

Distribution of Species in Iris data

1

2

# Distribution or Species in iris data

plot(iris$Species)

Distribution or Species in iris data
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/distribution_species_in_iris.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/distribution_species_in_iris.png?fit=690%2C422" data-lazy-loaded="true" />

Distribution or Species in iris data

1

2

# Try making a histogram for iris species

hist(iris$Species)

Output:

1

Error in hist.default(iris$Species) : ‘x’ must be numeric

As apparent as it is, the plot function provides a count of all the values and thus histogram is not used to show the distribution of non-numeric features.

However, the hist() function in R is very rich. You can specify a lot of parameters. The important ones are specifying the axis, title, and color of the histogram. You can also specify limits to the axis and change bin size

Adding cheery to the cake – parameters for hist() function

Before looking at the commonly used parameters for a histogram, we first use the help function.

Get the documentation for hist() function

1

2

# Get the documentation for hist() function

?hist

Documentation Output

1

2

3

4

5

6

7

8

9

10

# ## Default S3 method:

# hist(x, breaks = “Sturges”,

# freq = NULL, probability = !freq,

# include.lowest = TRUE, right = TRUE,

# density = NULL, angle = 45, col = NULL, border = NULL,

# main = paste(“Histogram of” , xname),

# xlim = range(breaks), ylim = NULL,

# xlab = xname, ylab,

# axes = TRUE, plot = TRUE, labels = FALSE,

# nclass = NULL, warn.unused = TRUE, …)

The hist function contains about 20+ parameters. We first look at the way to describe the title to the plot and the labels to x and y-axis. The parameter main=”” specifies the title and labels to the axis are plotted by xlab and ylab for x-axis and y-axis respectively. I will go step by step and set these values for iris petal length feature

Adding the labels

1

2

# Add all the labels

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”)

Histogram for petal length with labels
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_petal_length_with_labels.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_petal_length_with_labels.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram for petal length with labels

The next step is to add colors and border. The border parameter and col parameter can be used to set this.

1

2

# Add all the labels and color

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”)

Histogram for petal length with Labels and Colors
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_petal_length_with_labels_and_colors.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_for_petal_length_with_labels_and_colors.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram for petal length with Labels and Colors

Now it looks a little lovelier. To make the y-axis indexes more readable, we can rotate it using the las function. las is 0 by default and is parallel to axis. It can be 1 for projecting horizontally. las=, 2 makes x-axis indexes perpendicular to axis along with y axis indexes 3 for y axis indexes parallel to axis while keeping x axis perpendicular.

1

2

# Add all the labels and color. Set the y axis indexes horizontal

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1)

Histogram with y-axis indexes horizontal
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_y_axis.png?fit=300%2C184" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_y_axis.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram with y-axis indexes horizontal

We now change the bars. hist() function allows setting the limits of the axis using the xlim and ylim parameters. We can also set the length of the bars using the breaks parameter

1

2

3

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

#set 6 breaks

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),breaks=6)

Histogram with axis limits
" data-medium-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_axis_limits.png?fit=300%2C184" data-large-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_axis_limits.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram with axis limits

Keep in mind that setting the breaks can be done by keeping breaks as a vector as well. Another parameter include.lowest also goes with the breaks when it is a vector. When this parameter is set to true, setting right = FALSE or TRUE will keep all values equal to breaks values in the left bars or right bars respectively.

Additionally, there are two types are histograms. This one is the common one which plots the frequencies of values. A similar plot can be made using probability by setting the freq=FALSE or probability=TRUE parameters.

1

2

3

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# set probabilistic plot

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,1),freq=FALSE)

Probabilistic Plot
" data-medium-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Probabilistic_plot.png?fit=300%2C184" data-large-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Probabilistic_plot.png?fit=690%2C422" data-lazy-loaded="true" />

Probabilistic Plot

The hist() function also provides color shading. The density and angle parameters come in the picture for this. We can set the color density in lines per inch and the angle of the lines. Let’s try an example with density set to 50 and an angle of 60

1

2

3

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# Setting color density in lines per inch and the angle

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),density=50,angle=60)

Histogram with Color Density in Lines
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_color_density_in_lines.png?fit=300%2C184" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Histogram_with_color_density_in_lines.png?fit=690%2C422" data-lazy-loaded="true" />

Histogram with Color Density in Lines

Additionally, it may not even be necessary to get a plot. We can get the output in our console. The plot parameter, when set to FALSE, gives us the relevant. Let’s try it out

1

2

# Getting the output instead of the plot

hist(iris$Petal.Length,plot=FALSE)

Console Output

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

$breaks

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

$counts

[1] 37 13 0 1 4 11 21 21 17 16 5 4

$density

[1] 0.49333333 0.17333333 0.00000000 0.01333333 0.05333333 0.14666667 0.28000000 0.28000000 0.22666667 0.21333333

[11] 0.06666667 0.05333333

$mids

[1] 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75

$xname

[1] “iris$Petal.Length”

$equidist

[1] TRUE

attr(,“class”)

[1] “histogram”

The final parameter which I am going to explore is the labels parameter. To get more clarity on our plots, the labels parameter writes the exact value the bar is representing when set to TRUE

1

2

3

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# Drawing lables on top of bars

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),labels=TRUE)

Drawing Labels on Bars
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Drawing_labels_on_bars.png?fit=300%2C184" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/09/Drawing_labels_on_bars.png?fit=690%2C422" data-lazy-loaded="true" />

Drawing Labels on Bars

Isn’t this a lot of features a built-in histogram function can do? With histograms, one can know so much about the data and make the plots look good as well. For practice, here is the complete R code used in this article.

You can get this post complete code on our Github account.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

# Plot Air Passengers data

plot(AirPassengers)

# Plot a histogram for Air Passengers data

hist(AirPassengers)

# See how the petal length is distributed

plot(iris$Petal.Length)

# Plot the histogram for iris petal length

hist(iris$Petal.Length)

# Distribution or Species in iris data

plot(iris$Species)

# Try making a histogram for iris species

hist(iris$Species)

# Get the documentation for hist() function

?hist

# ## Default S3 method:

# hist(x, breaks = “Sturges”,

# freq = NULL, probability = !freq,

# include.lowest = TRUE, right = TRUE,

# density = NULL, angle = 45, col = NULL, border = NULL,

# main = paste(“Histogram of” , xname),

# xlim = range(breaks), ylim = NULL,

# xlab = xname, ylab,

# axes = TRUE, plot = TRUE, labels = FALSE,

# nclass = NULL, warn.unused = TRUE, …)

# Add all the labels

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”)

# Add all the labels and color

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”)

# Add all the labels and color. Set the y axis indexes horizontal

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1)

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# set 6 breaks

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),breaks=6)

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# set probabilistic plot

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,1),freq=FALSE)

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# Setting color density in lines per inch and the angle

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),density=50,angle=60)

# Getting the output instead of the plot

hist(iris$Petal.Length,plot=FALSE)

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and

# Drawing lables on top of bars

hist(iris$Petal.Length,main=“Histogram for petal length”, xlab = “Petal length in cm”, ylab = “Count”,border=“red”, col=“blue”,las=1,xlim=c(1,7),ylim=c(0,40),labels=TRUE)

The post How to Create Histograms in R first appeared on Perceptive Analytics.

10 Smart R Programming Tips to become Better R Programmer

Saneesh V — Fri, 24 Aug 2018 09:00:12 +0000

Coding is the process by which a programmer converts tasks from human-readable logic to machine-readable language. The reason behind coding being so popular is that there are so many ways to do the same thing that programmers don’t know the right choice anymore.

As a result, each programmer has his/her own style in writing implementations to the same part of an algorithm.

Writing code can sometimes be the most difficult and time-consuming part of any project. If the code is written in such a way that it is hard to change or requires a lot of work for every small update, then the investments will keep on piling up and more and more issues will crop up as the project progresses.

A good and well-written code is reusable, efficient and written cleverly by a smart programmer. This is what differentiates programmers from each other.

So, here are some tips to becoming a SMART coder:

Table of contents:

Writing codes for Programmer, Developer, and Even for A Layman

Knowing how to improve the code

Writing robust code

When to use shortcuts and when not to use

Reduce effort through code reuse

Write planned out code

Active memory management

Remove redundant tasks

Learn to adapt

Peer review

1. Writing Codes for Programmer, Developer, and Even for A Layman

Though codes are primarily written for the machine to understand. They should be structured and well organized for other developers or for any layman to understand. In reality, codes should be written for all the three.

Those who keep this fact in mind are one step ahead of other coders while those who are able to make sure everyone can understand their code are miles ahead than their struggling friends.

Good programmers always document their codes and make use of IDE. I will use R language to explain the concept. Using IDE such as Rstudio makes it easier to write code quickly.

The main advantage available in almost all IDE is the auto-completion feature which suggests the function or command when part of it is written.

IDE is also known to suggest the syntax of the selected functions which saves time. Rstudio IDE environment also displays environment variables alongside with some basic details of each variable.

Documentation is another ability which differentiates good programmers from the rest.

Let’s look at this viewpoint using an example. Say you read the following code:

Code snippet 1

1

2

3

4

5

6

7

# Code snippet 1

a=16

b=a/2

c=(a+b)/2

Code snippet 2

1

2

3

4

5

6

7

8

9

10

11

12

13

# Code snippet 2

# store the max memory size

a=16

# taking half of the maximum memory as the minimum memory

b=a/2

# taking mean of maximum and minimum memory as the recommended memory

c=(a+b)/2

Code snippet 3

1

2

3

4

5

6

7

8

9

10

11

12

13

# Code snippet 3

# store the max memory size

max_mem=16

# taking half of the maximum memory as the minimum memory

min_mem=max_mem/2

# taking mean of maximum and minimum memory as the recommended memory

mean_mem=(max_mem+min_mem)/2

The difference in documentation is highlighted in these three code snippets and this is just a simple demonstration of code understandability.

The first code is difficult to understand. It just sets the values of three variables. There are no comments and the variable names do not explain anything.

The second code snippet explains that ‘a’ is the maximum memory, ‘b’ is the minimum memory and ‘c’ is the mean of the two.

Without the comments in code snippet 2, no one can understand whether the calculation for ‘c’ is correct or not.

The third code is a step further with the variables representing what is stored in them.

The third code is the easiest to understand even though all the three codes perform similar tasks. Moreover, when the variables are used elsewhere, the variables used in the third snippet are self-explanatory and will not require a programmer to search in the code for what they store until an error occurs in the code.

2. Knowing how to Improve

R has multiple ways to achieve a task. Each of the possibilities comes from using more memory, faster execution or different algorithm/logic.

Whenever possible, good programmers make this choice wisely.

R has the feature to execute code in parallel. Lengthy tasks such as fitting models can be executed in parallel, resulting in time-saving. Other tasks can also be executed faster based on the logic and packages used.

As an illustration, the following code snippets reflects the same task, one with sqldf package and another with dplyr package.

Using sqldf version

1

2

3

4

5

6

7

# Using sqldf version

install.packages(“sqldf”)

library(sqldf)

Out_df=sqldf(“select * from table_a left outer join table_b on table_a.var_x=table_b.var_x”)

Using dplyr version

1

2

3

4

5

6

7

# Using dplyr version

install.packages(“dplyr”)

library(dplyr)

Out_df=left_join(table_a,table_b)

I personally prefer the dplyr version whenever possible. However, there are some differences between the outputs.

The dplyr version will look at all variables with the same name and join using them. If there is more than one such variable, I need to use them by field. Moreover, left join using dplyr will not keep both copies of the variable used to join tables whereas sqldf does.

One advantage of sqldf is that sqldf is not case sensitive and can easily join tables even if the variable names in the two tables are completely different. However, it is slower than dplyr.

3. Writing Robust Code

While writing code, you can make the code simple but situation specific or write a generic code. One such way in which programmers write simple but situation-specific code is by ‘Hard Coding’.

It is the term given to fixing values of variables and is never recommended.

For example, dividing the sum of all salaries in a 50,000-row salary data by 50,000 rather than dividing the sum of that sum with the number of rows may seem to make the same sense but have a different meaning in programming.

If the data changes with the change in the number of rows, the number 50,000 needs to be searched and updated. If the programmer misses making the small change, all the work goes down the drain. On the other hand, the latter approach automatically does the task and is a robust method.

Another popular programming issue quite specific to languages such as R is Code Portability. Codes running on one computer may not work on another because the other computer does not have some packages installed or has outdated packages.

Such cases can be handled by checking for installed packages first and then installing them. These tasks can be collectively called as robust programming and make the code error free.

Using an illustration for checking and installing/updating h2o package.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

# If h2o package is already loaded, unload it and uninstall

if (“package:h2o” %in% search()) { detach(“package:h2o”, unload=TRUE) }

# Checking

if (“h2o” %in% rownames(installed.packages())) { remove.packages(“h2o”) }

# Next, we download packages that H2O depends on.

# methods

if (! (“methods” %in% rownames(installed.packages()))) { install.packages(“methods”) }

# statmod

if (! (“statmod” %in% rownames(installed.packages()))) { install.packages(“statmod”) }

# stats

if (! (“stats” %in% rownames(installed.packages()))) { install.packages(“stats”) }

# graphics

if (! (“graphics” %in% rownames(installed.packages()))) { install.packages(“graphics”) }

# Rcurl

if (! (“RCurl” %in% rownames(installed.packages()))) { install.packages(“RCurl”) }

# jsonlite

if (! (“jsonlite” %in% rownames(installed.packages()))) { install.packages(“jsonlite”) }

# tools

if (! (“tools” %in% rownames(installed.packages()))) { install.packages(“tools”) }

# utils

if (! (“utils” %in% rownames(installed.packages()))) { install.packages(“utils”) }

# Finally install and load h2o package

install.packages(“h20”)

library(h2o)



4. When to Use Shortcuts and When Not to

Using shortcuts may be tempting in the pursuit of writing code swiftly but the right practice is to know when to use them.

For instance, shortcut keys are something which is really helpful and can always be used. Using Ctrl+L in windows clears the console output screen, Using Ctrl+Shift+C in windows comments and un-comments all selected lines of code in one go are my favorite shortcuts in Rstudio.

Another shortcut is writing code for fixing code temporarily or writing faulty fixes which are not desired.

Here are some of the examples of faulty fixes.

This code changes a particular column name without checking its existing name

1

2

3

# This code changes a particular column name without checking its existing name

colnames(data_f)[5]=”new_name”

This removes certain columns using a number. This may remove important ones and code may give the error if the number of columns less than 10 in this case.

1

2

3

# This removes certain columns using a number. This may remove important ones and code may give error if the number of columns are less than 10 in this case

data_f=data_f[,1:4,6:10]

This converts a value to numeric without checking if it actually has all numbers. If the value does not contain numbers, it may produce NAs by coercion

1

2

3

# This converts a value to numeric without checking if it actually has all numbers. If the value does not contain numbers, it may produce NAs by coercion

Num_val=”123″

The following converts Num_val to 123 correctly

1

2

3

4

5

# The following converts Num_val to 123 correctly

Num_val=as.numeric(Num_val)

char_val=”A_Name”

The following issues a warning and converts Num_val to NA as it is not a number

1

2

3

# The following issues a warning and converts Num_val to NA as it is not a number

char_val=as.numeric(char_val)

5. Reduce Effort Through Code Reuse

When you start writing a code, you don’t need to waste time if a particular piece of logic has already been written for you. Better known as “Code Re-use”, you can always use your own code you previously wrote or even google to reach out the large R community.

Don’t be afraid to search. Looking up for already implemented solutions online is very helpful in learning the methods prevalent for similar situations and the pros and cons associated with them.

Even when it becomes necessary to reinvent the wheel, the existing solutions can serve as a benchmark to test your new solution. An equally important part of writing code is to make your own code reusable.

Here are two snippets which highlight reusability.

Code which needs to be edited before resuing it

1

2

3

4

5

6

7

# Code which needs to be edited before reusing it

for(i in 1:501)             {

df[,i]=as.numeric(df[,i])

}

Code which can be reused with lesser editing

1

2

3

4

5

6

7

# Code which can be reused with lesser editing

for(i in 1:ncol(df))      {

df[,i]=as.numeric(df[,i])

}

6. Write Planned Out Code

Writing code on the fly may be a cool-to-have skill but not helpful for writing efficient codes. Coding is most efficient when you know what you are writing.

Always plan and write your logic on a piece of paper before implementing it. Inculcating the habit of adding tabs and spaces and basic formatting as you code is another time-saving skill for a good programmer.

For instance, every new ‘if’, ‘for’ or ‘while’ statement can be followed by tabs so that indentation is clearly visible. Although optional, such actions separate out blocks of code and helpful in identifying breakpoints as well as debugging.

A more rigorous but helpful approach is to write code using functions and modules and explaining every section with examples in comments or printing progress inside loops and conditions. Ultimately it all depends on the programmer how he/she chooses to document and log in the code.

7. Active Memory Management

Adding memory handling code is like handling a double-edged sword. It may not be useful for small-scale programs due to a slowdown in execution speed but nevertheless a great skill to have for writing scalable code.

In Rstudio, removing variables and frames when they are no longer required with the rm() function, garbage collection using gc() command and selecting the relevant features and data for proceeding are ways to manage memory.

Adjusting RAM usage with memory.limit() and setting parallel processing are also tasks for managing your memory usage. Remember! Memory management goes hand in hand with data backup.

It only takes a few seconds create and store copies of data. It should be done to ensure that data loss does not occur if backtracking is required.

Have a look at this example snippet which stores the master data and then frees up memory.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# dividing master dataset into train and test with ratio 7:3

library(dplyr)

train<–sample_frac(master_data, 0.7

train_ind<–as.numeric(rownames(train))

test<–master_data[–train_ind,]

# saving backup of master_data and removing unneeded data

write.csv(master_data,”master_data.csv”)

rm(master_data)

rm(train_ind)

gc()<span style=“font-family: -apple-system, BlinkMacSystemFont, ‘Segoe UI’, Roboto, Oxygen-Sans, Ubuntu, Cantarell, ‘Helvetica Neue’, sans-serif; font-size: 16px; background-color: #ffffff;”> </span>

8. Remove Redundant Tasks

Sometimes programmers do some tasks repeatedly or forget to remove program code without knowing it.

Writing separate iterations for each data manipulation step, leaving libraries loaded even after they are no longer required, not removing features until the last moment, multiple joins and queries,etc. are some examples of redundancy lurking in your code.

While these happen somewhat naturally as more and more changes are made and new logic is added. It is a good practice to look at existing code and adjust your new lines to save runtime.

Redundancy can slow your code so much that removing it can do wonders in execution speed.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

# Redundant code

# Takes about 0.5 seconds for iris data

for(i in 1:ncol(df))      {

            df[,i]=as.numeric(df[,i])

}

for(i in 1:ncol(df))      {

            #storing missing values per column in mis vector

            mis[i]=length(which(is.na(df[,i])))

}

#Better implementation (implementations faster than the one below also exist)

#Gives a similar output but takes about 0.3 seconds for iris data – 35% improvement

for(i in 1:ncol(df))      {

            df[,i]=as.numeric(df[,i])

            #storing missing values per column in mis vector

            mis[i]=length(which(is.na(df[,i])))

}

9. Learn to Adapt

No matter how good a programmer you are, you can always be better! This tip is not related to typical coding practices but teamwork. Sharing and understanding codes from peers, Reading codes online (such as from repositories).

setting yourself up to date with books and blogs and learning about new technologies and packages which are released for R are some ways to learn.

Being flexible and adaptive to new methods and keeping yourself up to date with what’s happening in the analytics industry today can help you in avoiding becoming obsolete with old practices.

10. Peer Review

The code you write may be straightforward for you but very complex for everyone else. How will you know that? The only way is to know what others think about it.

Code review is thus the last but not the least in terms of importance for better coding. Ask people to go through your code and be open to suggested edits. You may come across situations when some code you thought is written beautifully can be replaced with more efficient code.

Code review is a process which helps both the coder and reviewer as it is a way of helping each other to improve and move forward.

The Path is Not So Difficult: Conclusion

Becoming a good programmer is no easy feat but becoming better at programming as you progress is possible. Though it will take time, persevering to add strong programming habits will make you a strong member in every team’s arsenal.

These tips are just the beginning and there may be more ways to improve. The knowledge to always keep improving will take you forward and let you taste the sweet results of being a hi-tech programmer.

In the rapidly changing analytics world, staying with the latest tools and techniques is a priority and being good at R programming can be a prime factor towards your progress in your analytics career.

So go out there and make yourself acquainted with the techniques of becoming better at R programming.
The post 10 Smart R Programming Tips to become Better R Programmer first appeared on Perceptive Analytics.

How to Perform Hierarchical Clustering in R

Saneesh V — Thu, 23 Aug 2018 09:00:48 +0000

Over the last couple of articles, We learned different classification and regression algorithms. Now in this article, We are going to learn entirely another type of algorithm. Which falls into the unsupervised learning algorithms.

If you were not aware of unsupervised learning algorithms, all the machine learning algorithms mainly classified into two main categories.

Supervised learning algorithms

Unsupervised learning algorithms

All the classification and regression algorithms belong to supervised learning algorithms. The other set of algorithms which fall under unsupervised learning algorithms are clustering algorithms.

In fact, the foremost algorithms to study in unsupervised learning algorithms is clustering analysis algorithms. Today we are going to learn an algorithm to perform the cluster analysis.

We have a decent number of algorithms to perform cluster analysis; In this article, we will be learning how to perform the clustering with the Hierarchical clustering algorithm.

Before we drive further. Let’s have a look at the table of contents.

Table of contents:

What is clustering analysis?

Clustering analysis example

Hierarchical clustering

Dendrogram

Agglomerative clustering

Divisive clustering

Clustering linkage comparison

Implementing hierarchical clustering in R programming language

Data preparation

Packages need to perform hierarchical clustering

Visualizing clustering in 3d view

Complete code

Summary

Related courses

Exploratory data analysis in r

Clustering analysis in Python

Machine learning clustering and information retrieval

What is clustering analysis?

Clustering the name itself has a deep meaning about the ongoing process which happens in the cluster analysis. The fundamental problem clustering address is to divide the data into meaningful groups (clusters).

When we say, meaningful groups, the meaningfulness purely depends on the intention behind the purpose of the group’s formations.

Suppose if we intend to cluster the search results for a particular keyword. We Intended to find all the search results which were meaningfully similar to the search keyword.

If we intend to cluster the search results for a particular location, then we need to group the search results belongs to one specific place.

The identified cluster elements within the same cluster should be similar to each other when compared to the other cluster elements.

Suppose we have 100 articles and we want to group them into different categories. Let’s consider the below categories.

Sports articles

Business articles

Entertainment articles

When we group all the 100 articles into the above 3 categories. All the articles belong to the sports category will be same, In the sense, the content in the sports articles belongs to sports category.

When you pick an article from sports category and the other article from business articles. Content-wise they will be completely different. This summarises the rule of thumb condition to form clusters.

All the elements in the same cluster should be similar and elements of the different cluster should not be similar.

Now let’s have a look at one clustering analysis example to understand it better.

Clustering analysis example

Clustering Example

Suppose from the above example of clustering gender based on hair length. You can notice all the female cluster elements are so close to each other, Whereas the male cluster elements are far from the female cluster elements.

Here it’s easy to say the clusters in the same cluster (female or male clusters ) elements are so close together by visualizing the plot. We use different clustering algorithms to create clusters by following the thumb rule all the elements in the same cluster has to be close together.

To calculate this closeness we use different similarity measures. These similarity measures determine whether the given point is closer by giving the similarity score.

Hierarchical clustering

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset and does not require to pre-specify the number of clusters to generate.

It refers to a set of clustering algorithms that build tree-like clusters by successively splitting or merging them. This hierarchical structure is represented using a tree.

Hierarchical clustering methods use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or we cannot split more clusters. We can use a dendrogram to represent the hierarchy of clusters.

Dendrogram

A dendrogram is a tree-like structure frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

Hierarchical classifications produced by either

Agglomerative

Divisive

The agglomerative or divisive route may be represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions or divisions made at each stage of the analysis. Agglomerative clustering usually yields a higher number of clusters, with fewer leaf nodes in the cluster.

In a hierarchical classification, the data are not partitioned into a particular number of classes or clusters at a single step. Instead, the classification consists of a series of partitions, which may run from a single cluster containing all individuals to n clusters each containing a single individual.

Hierarchical clustering algorithms can be either bottom-up or top-down.

Hierarchical clustering agglomerative and divisive methods
" data-medium-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Hierarchical_clustering_agglomerative_and_divisive_methods.png?fit=300%2C166" data-large-file="https://i1.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Hierarchical_clustering_agglomerative_and_divisive_methods.png?fit=599%2C332" data-lazy-loaded="true" />

Hierarchical clustering agglomerative and divisive methods

Agglomerative clustering

Agglomerative clustering is Bottom-up technique start by considering each data point as its own cluster and merging them together into larger groups from the bottom up into a single giant cluster.

Divisive clustering

Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.

Clustering linkage comparison

In this article, we describe the bottom-up approach in the detailed manner i.e. agglomerative algorithm

The necessary steps of an agglomerative algorithm are (diagrammed visually in the figure below):

Start with each point in its own cluster.

Compare each pair of data points using a distance metric. This could be any of the methods discussed above.

Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases), where the linkage is represented by a function such as:

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step, the pair of clusters with minimum between-cluster distance are merged.

Single linkage | Complete linkage | Average linkage clustering | Centroid linkage

Image Credit :: http://slideplayer.com/slide/9336538/
" data-medium-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Hierarchical-clustering-comparison.jpg?fit=300%2C225" data-large-file="https://i0.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Hierarchical-clustering-comparison.jpg?fit=690%2C518" data-lazy-loaded="true" />

Single-link | Complete-link | Average-link | Centroid distance Image Credit:: http://slideplayer.com/slide/9336538/

Implementing hierarchical clustering in R programming language

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

Rows are observations (individuals) and columns are variables

Any missing value in the data must be removed or estimated.

The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Here, we’ll use the built-in R dataset iris, which contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

1

df <– iris

To remove any missing value that might be present in the data, type this:

1

df <– na.omit(df)

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:

1

df <– scale(df)

Packages need to perform hierarchical clustering

hclust [in stats package]

agnes [in cluster package]

We can perform agglomerative HC with hclust. First, we compute the dissimilarity values with dist and then feed these values into hclust and specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”). We can plot the dendrogram after this.

1

2

3

4

5

6

7

8

# Dissimilarity matrix

d <– dist(df, method = “euclidean”)

# Hierarchical clustering using Complete Linkage

hc1 <– hclust(d, method = “complete” )

# Plot the obtained dendrogram

plot(hc1)

Alternatively, we can use the agnes function. These functions behave very similarly; however, with the agnes function, we can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).

1

2

# Compute with agnes

hc2 <– agnes(df, method = “complete”)

This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.

Cluster dendrogram
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Cluster_dendrogram.png?fit=300%2C225" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Cluster_dendrogram.png?fit=490%2C368" data-lazy-loaded="true" />

Cluster dendrogram

Visualizing clustering in 3d view

Let’s examine, this time visually. How this algorithm proceeds using a simple dataset.

As visual representations are limited to three dimensions, we will only use three attributes, but the computation is similar if we use more attributes. We will display these using the scatterplot3d() function of the plot3D package.

which we will install and load after creating the attributes. We then examine the clustering solution provided by hclust() in order to assess whether it confirms the impressions we get from visual inspection.

1

2

3

4

5

6

7

8

9

10

A1 = c(2,3,5,7,8,10,20,21,23)

A2 = A1

A3 = A1

install.packages(“scatterplot3d”)

library(scatterplot3d)

scatterplot3d(A1,A2,A3, angle = 25, type = “h”)

demo = hclust(dist(cbind(A1,A2,A3)))

plot(demo)

Cluster dendrogram 3D view
" data-medium-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Cluster_dendrogram_3D_view.jpg?fit=300%2C150" data-large-file="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2018/01/Cluster_dendrogram_3D_view.jpg?fit=690%2C345" data-lazy-loaded="true" />

Cluster dendrogram 3D view

In the left panel of the above plot, there are three groups of two points that are very close to each other. Another point is quite close to each of these groups of two.

Consider that the groups of two constitute a group of three with the points that lie closest to them. Finally, the two groups on the left are closer to each other than they are to the group of three on the right.

If we have a look at the dendrogram, we can see that the very same pattern is visible.

Complete Code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

df <– iris

df <– na.omit(df)

df <– scale(df)

d <– dist(df, method = “euclidean”)

# Hierarchical clustering using Complete Linkage

hc1 <– hclust(d, method = “complete” )

# Plot the obtained dendrogram

plot(hc1)

# Compute with agnes

hc2 <– agnes(df, method = “complete”)

A1 = c(2,3,5,7,8,10,20,21,23)

A2 = A1

A3 = A1

install.packages(“scatterplot3d”)

library(scatterplot3d)

scatterplot3d(A1,A2,A3, angle = 25, type = “h”)

demo = hclust(dist(cbind(A1,A2,A3)))

plot(demo)

You can clone complete codes of dataaspirant from our GitHub account

Summary

Hierarchical methods form the backbone of cluster analysis. The need for hierarchical clustering naturally emerges in domains where it is not only required to discover similarity-based groups but also need to organize them.

This is a valuable capability wherever the complexity of similarity patterns goes beyond the limited representation power of flat clustering models.

Hierarchical clustering tends to be presented as a form of descriptive rather than predictive modeling.
The post How to Perform Hierarchical Clustering in R first appeared on Perceptive Analytics.