As we already know Data Science/Machine Learning is such a multi-disciplinary field means that the sheer volume of material you need to feel properly prepared can become overwhelming.The sure thing is that will involve some SQL technical questions as Data Scientists very often query some databases.

In this article, I will discuss some SQL interview questions which a Data Scientist should revise before going on to a technical interview.The part 1 can be found here.


Table of Contents

  • Trips and Users [Hard]
  • Duplicate Emails [Easy]
  • Rising Temperature [Easy]
  • Employees Earning More Than Their Managers [Easy]
  • Department Highest Salary [Medium]
  • Not Boring Movies [Easy]
  • Nth Highest Salary [Medium]
  • Conclusion
  • References
Photo by Cristina Gottardi / Unsplash

Trips and Users [Hard]

Problem description

The Trips table holds all taxi trips. Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the Users table. Status is an ENUM type of (‘completed’, ‘cancelled_by_driver’, ‘cancelled_by_client’).

+----+-----------+-----------+---------+--------------------+----------+
| Id | Client_Id | Driver_Id | City_Id |        Status      |Request_at|
+----+-----------+-----------+---------+--------------------+----------+
| 1  |     1     |    10     |    1    |     completed      |2013-10-01|
| 2  |     2     |    11     |    1    | cancelled_by_driver|2013-10-01|
| 3  |     3     |    12     |    6    |     completed      |2013-10-01|
| 4  |     4     |    13     |    6    | cancelled_by_client|2013-10-01|
| 5  |     1     |    10     |    1    |     completed      |2013-10-02|
| 6  |     2     |    11     |    6    |     completed      |2013-10-02|
| 7  |     3     |    12     |    6    |     completed      |2013-10-02|
| 8  |     2     |    12     |    12   |     completed      |2013-10-03|
| 9  |     3     |    10     |    12   |     completed      |2013-10-03| 
| 10 |     4     |    13     |    12   | cancelled_by_driver|2013-10-03|
+----+-----------+-----------+---------+--------------------+----------+

The Users table holds all users. Each user has an unique Users_Id, and Role is an ENUM type of (‘client’, ‘driver’, ‘partner’).

+----------+--------+--------+
| Users_Id | Banned |  Role  |
+----------+--------+--------+
|    1     |   No   | client |
|    2     |   Yes  | client |
|    3     |   No   | client |
|    4     |   No   | client |
|    10    |   No   | driver |
|    11    |   No   | driver |
|    12    |   No   | driver |
|    13    |   No   | driver |
+----------+--------+--------+

Write a SQL query to find the cancellation rate of requests made by unbanned users between Oct 1, 2013 and Oct 3, 2013. For the above tables, your SQL query should return the following rows with the cancellation rate being rounded to two decimal places.

+------------+-------------------+
|     Day    | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 |       0.33        |
| 2013-10-02 |       0.00        |
| 2013-10-03 |       0.50        |
+------------+-------------------+

Solution

The solution looks like that:

select 
	result.Request_at as "Day",
	round(sum(case when result.Status = 'completed' then 0 else 1 end) / count(*), 2) as "Cancellation Rate"
from (
    select 
		Driver_Id,
		Status,
		Request_at
    from trips left join users on trips.client_id=users.users_id
    where users.banned = 'NO'
     ) result 
	left join users on result.driver_id=users.users_id 
where 
	users.Banned ='NO' 
	and result.Request_at between '2013-10-01' and '2013-10-03'
group by 
	result.Request_at

Duplicate Emails [Easy]

Problem description

Write a SQL query to find all duplicate emails in a table named Person.

Table: Customers.

+----+---------+
| Id | Email   |
+----+---------+
| 1  | a@b.com |
| 2  | c@d.com |
| 3  | a@b.com |
+----+---------+

For example, your query should return the following for the above table:

+---------+
| Email   |
+---------+
| a@b.com |
+---------+

Note: All emails are in lowercase.

Solution

Since all email are in lowercase we can simply groupby email and print those that have a count >1.

SELECT EMAIL
FROM PERSON
GROUP BY EMAIL
HAVING COUNT(*)>1

Rising Temperature [Easy]

Problem description

Given a Weather table, write a SQL query to find all dates' Ids with higher temperature compared to its previous (yesterday's) dates.

+---------+------------------+------------------+
| Id(INT) | RecordDate(DATE) | Temperature(INT) |
+---------+------------------+------------------+
|       1 |       2015-01-01 |               10 |
|       2 |       2015-01-02 |               25 |
|       3 |       2015-01-03 |               20 |
|       4 |       2015-01-04 |               30 |
+---------+------------------+------------------+

For example, return the following Ids for the above Weather table:

+----+
| Id |
+----+
|  2 |
|  4 |
+----+

Solution

The solution is to join the table to itself when the dates differ by one day (DATEDIFF() function) and make sure that the temperature is higher than the previous date.

SELECT W1.ID
FROM WEATHER W1 INNER JOIN WEATHER W2 ON DATEDIFF(W1.RecordDate, W2.RecordDate) = 1
WHERE W1.Temperature > W2.Temperature

Employees Earning More Than Their Managers [Easy]

Problem description

The Employee table holds all employees including their managers. Every employee has an Id, and there is also a column for the manager Id.

+----+-------+--------+-----------+
| Id | Name  | Salary | ManagerId |
+----+-------+--------+-----------+
| 1  | Joe   | 70000  | 3         |
| 2  | Henry | 80000  | 4         |
| 3  | Sam   | 60000  | NULL      |
| 4  | Max   | 90000  | NULL      |
+----+-------+--------+-----------+

Given the Employee table, write a SQL query that finds out employees who earn more than their managers. For the above table, Joe is the only employee who earns more than his manager.

+----------+
| Employee |
+----------+
| Joe      |
+----------+

Solution

The solution is to join again the table to itself as shown below:

SELECT E1.NAME AS EMPLOYEE
FROM EMPLOYEE E1 INNER JOIN EMPLOYEE E2 ON E1.MANAGERID=E2.ID
WHERE E1.SALARY>E2.SALARY

Department Highest Salary [Medium]

Problem description

The Employee table holds all employees. Every employee has an Id, a salary, and there is also a column for the department Id.

+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |
+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+
| Id | Name     |
+----+----------+
| 1  | IT       |
| 2  | Sales    |
+----+----------+

Write a SQL query to find employees who have the highest salary in each of the departments. For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
+------------+----------+--------+

Solution

The solution looks like that:

select 
    department.name as Department,
	result.Name as Employee,
	result.Salary as Salary
from
    (select 
		e1.DepartmentId,
		e1.Name,e1.Salary
    from 
		Employee e1 left join Employee e2 on e1.Salary <= e2.Salary 
        and e1.DepartmentId = e2.DepartmentId 
    group by 
        e1.DepartmentId, 
        e1.Salary,
        e1.Name
    having count(distinct e2.Salary) = 1) 
   result join department on result.DepartmentId=department.id

The solution doesn't seem to be very easy to follow. In that case let's have a closer look at the query:

First let's try to understand from the innermost sub-query and expand our understanding outward.

select 
	e1.DepartmentId,
	e1.Name,
	e1.Salary
from 
	Employee e1 left join Employee e2 on e1.Salary <= e2.Salary 
	and e1.DepartmentId = e2.DepartmentId 
group by 
    e1.DepartmentId, 
    e1.Salary,
    e1.Name
having count(distinct e2.Salary) = 1

Now this is a self-join query, let's check the output without the group by + having clause:

     select 
            e1.DepartmentId,e1.Name,e1.Salary,e2.Salary
     from 
            Employee e1 left join Employee e2 
            on e1.Salary <= e2.Salary 
            and e1.DepartmentId = e2.DepartmentId 
+--------------+-------+--------+--------+
| DepartmentId | Name  | Salary | Salary | 
+--------------+-------+--------+--------+
|     1        | Joe   | 70000  | 90000  |
|     1        | Joe   | 70000  | 90000  |
|     1        | Joe   | 70000  | 70000  |
|     1        | Jim   | 90000  | 90000  |
|     1        | Jim   | 90000  | 90000  |
|     1        | Max   | 90000  | 90000  |
|     1        | Max   | 90000  | 90000  |
|     2        | Henry | 80000  | 80000  |
|     2        | Sam   | 60000  | 80000  |
|     2        | Sam   | 60000  | 60000  |

So now by adding the group by + having clause we get

 select 
	e1.DepartmentId,
	e1.Name,
	e1.Salary
from 
	Employee e1 left join Employee e2 on e1.Salary <= e2.Salary 
	and e1.DepartmentId = e2.DepartmentId 
group by 
    e1.DepartmentId, 
    e1.Salary,
    e1.Name
having count(distinct e2.Salary) = 1
+--------------+-------+--------+
| DepartmentId | Name  | Salary | 
+--------------+-------+--------+
|     1        | Max   | 90000  |
|     1        | Jim   | 90000  |
|     2        | Henry | 80000  |

The rest is simple we just do a join in order to fetch the Department name from the Department Table by joining on DepartmentID.

Not Boring Movies [Easy]

Problem description

X city opened a new cinema, many people would like to go to this cinema. The cinema also gives out a poster indicating the movies’ ratings and descriptions.

Please write a SQL query to output movies with an odd numbered ID and a description that is not 'boring'. Order the result by rating.

For example, table cinema:

+---------+-----------+--------------+-----------+
|   id    | movie     |  description |  rating   |
+---------+-----------+--------------+-----------+
|   1     | War       |   great 3D   |   8.9     |
|   2     | Science   |   fiction    |   8.5     |
|   3     | irish     |   boring     |   6.2     |
|   4     | Ice song  |   Fantacy    |   8.6     |
|   5     | House card|   Interesting|   9.1     |
+---------+-----------+--------------+-----------+

For the example above, the output should be:

+---------+-----------+--------------+-----------+
|   id    | movie     |  description |  rating   |
+---------+-----------+--------------+-----------+
|   5     | House card|   Interesting|   9.1     |
|   1     | War       |   great 3D   |   8.9     |
+---------+-----------+--------------+-----------+

Solution

The solution looks like that:

SELECT *
FROM CINEMA
WHERE ID%2=1 AND DESCRIPTION != 'BORING'
ORDER BY RATING DESC

Nth Highest Salary [Medium]

Problem description

Write a SQL query to get the nth highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+

For example, given the above Employee table, the nth highest salary where n = 2 is 200. If there is no nth highest salary, then the query should return null.

+------------------------+
| getNthHighestSalary(2) |
+------------------------+
| 200                    |
+------------------------+

Solution

The solution looks pretty similar with the one presented above for the Department Highest Salary problem:

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
  RETURN (
      SELECT E1.SALARY
      FROM EMPLOYEE E1 LEFT JOIN EMPLOYEE E2 ON E1.SALARY <= E2.SALARY
      GROUP BY E1.SALARY
      HAVING COUNT(DISTINCT E2.SALARY) = N
      
  );
END

Conclusion

Hope you find it interesting and please remember that probably every data scientist interview have at least one SQL question. Be prepared!!

If you like this article you can found the part 1 here.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Coding.


References