your image

SQL as a Data Analysis Tool. How SQL is used to perform efficient… | by Soner Yıldırım | Towards Data Science

TOWARDSDATASCIENCE
Related Topic
:- SQL MS-Excel Data Analysis

SQL as a Data Analysis Tool

How SQL is used to perform efficient data filtering and transformation

 

 

Soner Yıldırım

Jan 13·7 min read

 

 

 

Photo by Isaac Smith on Unsplash

SQL is a language used for managing data in relational databases. The core component of a relational database is table that stores data in tabular form with labelled rows and columns.

We query data from a relational database with the select statement of SQL. The select statement is highly versatile and flexible in terms of data transformation and filtering operations.

In that sense, SQL can be considered as a data analysis tool. The advantage of using SQL for data transformation and filtering is that we only retrieve the data we need. It is more practical and efficient than retrieving all the data and then applying these operations.

In this article, we will go over 7 examples to demonstrate how SQL can be used as a data analysis tool. The complexity of the examples steadily increase so it is better if you reach the end.

I previously stored the medical cost dataset available on Kaggle into an SQL table called insurance. Let’s first take a look at the table by displaying the first 5 rows.

mysql> select * from insurance-> limit 5;

 

insurance table (image by author)

The “*” indicates that we want to display all column and the limit keyword specifies the number of rows to be displayed.

The dataset contains some personal information and the amount charged for the insurance.

Example 1

We can calculate the average charge amount for smokers and non-smokers.

mysql> select smoker, avg(charges)-> from insurance-> group by smoker;+--------+-----------------+| smoker | avg(charges)    |+--------+-----------------+| no     |  8434.268297857 || yes    | 32050.231831532 |+--------+-----------------+

It is similar to the group by function of pandas. We select the smoker and charges columns and apply the average function on the charges. The group by statement allows for separating rows based on the different categories in the smoker column. Thus, we get two averages in the result.

If we want to see the overall average, we only need to select the charges column.

mysql> select avg(charges) from insurance;+-----------------+| avg(charges)    |+-----------------+| 13270.422265142 |+-----------------+

Example 2

We may also want to see the number of smokers and non-smokers along with the average charges value.

mysql> select smoker, avg(charges), count(charges)-> from insurance-> group by smoker;+--------+-----------------+----------------+| smoker | avg(charges)    | count(charges) |+--------+-----------------+----------------+| no     |  8434.268297857 |           1064 || yes    | 32050.231831532 |            274 |+--------+-----------------+----------------+

In addition to the previous example, we select the count of the charges column.

Example 3

Consider a case where we need to further separate smokers and non-smokers based on gender. To accomplish this task, we need to the sex column to the group by statement.

mysql> select smoker, sex, avg(charges), count(charges)-> from insurance-> group by smoker, sex;+--------+--------+-----------------+----------------+| smoker | sex    | avg(charges)    | count(charges) |+--------+--------+-----------------+----------------+| no     | female |  8762.297299542 |            547 || no     | male   |  8087.204731276 |            517 || yes    | female | 30678.996276260 |            115 || yes    | male   | 33042.005975283 |            159 |+--------+--------+-----------------+----------------+

Example 4

The select statement also allows for filtering. For instance, we can run the query in the previous statement for people who live in the southeast region.

mysql> select smoker, sex, avg(charges), count(charges)-> from insurance-> where region = 'southeast'-> group by smoker, sex;+--------+--------+-----------------+----------------+| smoker | sex    | avg(charges)    | count(charges) |+--------+--------+-----------------+----------------+| no     | female |  8440.205551942 |            139 || no     | male   |  7609.003586716 |            134 || yes    | female | 33034.820716388 |             36 || yes    | male   | 36029.839366545 |             55 |+--------+--------+-----------------+----------------+

We use the where statement to specify the condition for filtering. It is important to note that the where statement must be written before the group by statement in the select query.

Example 5

We want to find the average bmi value for different categories based on the smoker and children columns but only display the top 3 categories in terms of the average bmi.

This examples includes sorting aggregated values which can be done with the order by statement.

mysql> select smoker, children, avg(bmi)    -> from insurance-> group by smoker, children-> order by avg(bmi) desc-> limit 3;+--------+----------+------------+| smoker | children | avg(bmi)   |+--------+----------+------------+| no     |        4 | 31.6754545 || yes    |        2 | 31.3041818 || no     |        2 | 30.8811622 |+--------+----------+------------+

The order by statement sorts the rows based on the given column in ascending order. We can change it to descending order by using the desc keyword after the name of the column.

Example 6

Let’s elaborate on the previous example. Consider a case where we need the groups with average bmi values higher than the overall average.

One way is to separately calculate the overall average and use it as a condition for filtering.

mysql> select avg(bmi) from insurance;+------------+| avg(bmi)   |+------------+| 30.6633969 |+------------+mysql> select smoker, children, avg(bmi)-> from insurance-> group by smoker, children-> having avg(bmi) > 30.6633969;+--------+----------+------------+| smoker | children | avg(bmi)   |+--------+----------+------------+| no     |        2 | 30.8811622 || no     |        3 | 30.7384322 || no     |        4 | 31.6754545 || yes    |        1 | 30.8743443 || yes    |        2 | 31.3041818 |+--------+----------+------------+

It is important to note that when we filter based on an aggregated value, we use the having statement instead of the where statement.

The second option is to combine these two queries in a nested query.

mysql> select smoker, children, avg(bmi)-> from insurance-> group by smoker, children-> having avg(bmi) > (-> select avg(bmi) from insurance-> );+--------+----------+------------+| smoker | children | avg(bmi)   |+--------+----------+------------+| no     |        2 | 30.8811622 || no     |        3 | 30.7384322 || no     |        4 | 31.6754545 || yes    |        1 | 30.8743443 || yes    |        2 | 31.3041818 |+--------+----------+------------+

I prefer the second option because the first one is more prone to making mistakes.

Example 7

This example is a little more complex than the previous one. Instead of only displaying the rows that have a more than average bmi value, we want to create a new column that indicates whether a row is more than average or not.

We will use a case-when statement inside our nested query.

mysql> select smoke, children, avg(bmi),-> (case when avg(bmi) > (-> select avg(bmi) from insurance) then "more than avg"-> else "less than avg" end) as compared_to_avg-> from insurance-> group by smoker, children;

Before displaying the result, let’s try to understand what each step in the query does.

We select three columns similar to the previous example. The fourth column is created with a case-when statement. The new column takes the values “more than avg” or “less than avg” based on the comparison of the average bmi value for the row and the overall average bmi value.

The overall average bmi value is calculated by using a nested select statement inside the case-when statement. Here is the result of this query.

+--------+----------+------------+-----------------+| smoker | children | avg(bmi)   | compared_to_avg |+--------+----------+------------+-----------------+| no     |        0 | 30.5519499 | less than avg   || no     |        1 | 30.5648859 | less than avg   || no     |        2 | 30.8811622 | more than avg   || no     |        3 | 30.7384322 | more than avg   || no     |        4 | 31.6754545 | more than avg   || no     |        5 | 30.2700000 | less than avg   || yes    |        0 | 30.5436957 | less than avg   || yes    |        1 | 30.8743443 | more than avg   || yes    |        2 | 31.3041818 | more than avg   || yes    |        3 | 30.5206410 | less than avg   || yes    |        4 | 29.3066667 | less than avg   || yes    |        5 | 18.3000000 | less than avg   |+--------+----------+------------+-----------------+

Conclusion

We have covered some query examples to demonstrate the data analysis capabilities of SQL.

I think SQL is a required skill for data scientists or analysts. We should at least know how to query a relational database. Complex queries that perform data transformation and manipulation operations while retrieving the data have the potential to save memory. They also ease the tasks that need to be done afterwards.

Thank you for reading. Please let me know if you have any feedback.

Comments