As we already know Data Science/Machine Learning is such a multi-disciplinary field means that the sheer volume of material you need to feel properly prepared can become overwhelming.The sure thing is that will involve some SQL technical questions as Data Scientists very often query some databases.

In this article, I will discuss some SQL interview questions which a Data Scientist should revise before going on to a technical interview.


Table of Contents

  • Customers Who Never Order [Easy]
  • Consecutive Numbers [Medium]
  • Rank Scores [Medium]
  • Department Top Three Salaries [Hard]
  • Conclusion
  • References
Photo by Kalen Emsley / Unsplash

Customers Who Never Order [Easy]

Problem description

Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customers.

+----+-------+
| Id | Name  |
+----+-------+
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |
+----+-------+

Table: Orders.

+----+------------+
| Id | CustomerId |
+----+------------+
| 1  | 3          |
| 2  | 1          |
+----+------------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry     |
| Max       |
+-----------+

Solution

If we have a list of customers who have ever ordered, it will be easy to know who never ordered. Thus:

SELECT Name as 'Customers'
FROM Customers
WHERE ID not IN (SELECT CustomerId FROM Orders) -- the above subquery gives a list 
-- of customers who never ordered.

Another solution would be to do a left join of the Customers Table with the Orders table and display the rows which have NAN values. Those rows would be the ones that belong to Customers Table but not to the Order table.

SELECT Name as 'Customers'
FROM Customers c LEFT JOIN Orders o ON c.Id=o.CustomerId
WHERE CustomerId is NULL

Consecutive Numbers [Medium]

Problem description

Write a SQL query to find all numbers that appear at least three times consecutively.

+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

Solution

The solution is to join the table to itself three times and make sure the Id field is consecutive and the Num is the same.

select distinct l3.Num as 'ConsecutiveNums'
FROM Logs l3 inner join
(SELECT l1.Id as Id,l1.Num as Num
FROM Logs l1 inner join Logs l2 ON l1.Id=l2.Id+1 AND l1.Num=l2.Num) doubles
ON l3.Id = doubles.Id+1 and l3.Num = doubles.Num

Rank Scores [Medium]

Problem description

Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking. Note that after a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no "holes" between ranks.

+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+

For example, given the above Scores table, your query should generate the following report (order by highest score):

+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+

Solution

The solution looks like that:

SELECT Scores.score as 'Score', count(ranking.Score) as 'Rank'
FROM Scores left join (select distinct Score FROM Scores) ranking ON Scores.Score <= ranking.Score
GROUP BY Scores.ID
ORDER BY Rank asc

The solution doesn't seem to be very easy to follow. In that case let's have a closer look at an example of expected input and output:

INPUT
+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+

OUTPUT
+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+

So, the task is to group all identical scores and then order them from largest to smallest. Let's see how the solution works step by step. First, it creates a helper table called ranking - note (select distinct Score from scores) ranking. Its content will be:

+----+--+
| Score |
+----+--+
| 3.50  |
| 3.65  |
| 4.00  |
| 3.85  |        
+----+--+

Notice how all duplicate scores have been eliminated (that is the purpose of distinct keyword). Next, there is a left join between tables ranking and scores where we join each record from scores table with all records from ranking table that have greater or equal score.

SELECT *FROM Scores left join (select distinct Score FROM Scores) ranking ON Scores.Score <= ranking.Score

So, the result of this mid-phase would be:

+----+-------+---------------+
| Id | Score | ranking.Score |
+----+-------+---------------+
| 1  | 3.50  | 3.50          |
| 1  | 3.50  | 3.65          |
| 1  | 3.50  | 3.85          |
| 1  | 3.50  | 4.00          |
| 2  | 3.65  | 3.65          |
| 2  | 3.65  | 3.85          |
| 2  | 3.65  | 4.00          |
| 3  | 4.00  | 4.00          |
| 4  | 3.85  | 3.85          |
| 4  | 3.85  | 4.00          |
| 5  | 4.00  | 4.00          |
| 6  | 3.65  | 3.65          |
| 6  | 3.65  | 3.85          |
| 6  | 3.65  | 4.00          |
+----+-------+---------------+

Next comes group by which groups all records with same Id into one record. Since in the select part we have count(ranking.Score), the result of grouping will be the count of different ranking scores for each Id. And since we joined from ranking only those scores that are greater or equal than the original score, this count will give the requested ranking. We are almost done:

+----+-------+--------------+-------+
| Id | count(ranking.Score) | Score |
+----+-------+--------------+-------+
| 1  |       4              | 3.50  |
| 2  |       3              | 3.65  |
| 3  |       1              | 4.00  |
| 4  |       2              | 3.85  |
| 5  |       1              | 4.00  |
| 6  |       3              | 3.65  |
+----+-------+--------+-------------+

Now the easiest part - order by which orders the results by Rank. Since the select does not include Id, that column is omitted and we get the final result.

Department Top Three Salaries [Hard]

Problem description

The Employee table holds all employees. Every employee has an Id, and there is also a column for the department Id.

+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+
| Id | Name     |
+----+----------+
| 1  | IT       |
| 2  | Sales    |
+----+----------+

Write a SQL query to find employees who earn the top three salaries in each of the department. For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Randy    | 85000  |
| IT         | Joe      | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

Explanation:

In IT department, Max earns the highest salary, both Randy and Joe earn the second highest salary, and Will earns the third highest salary. There are only two employees in the Sales department, Henry earns the highest salary while Sam earns the second highest salary.

Solution

The solution looks like that:

select 
    Department.Name as 'Department', 
    result.Name as 'Employee', 
    result.Salary
FROM 
    (select 
            e1.DepartmentId,e1.Name,e1.Salary
     from 
            Employee e1 left join Employee e2 on e1.Salary <= e2.Salary and e1.DepartmentId = e2.DepartmentId 
     group by 
            e1.DepartmentId, e1.Salary,e1.Name
     having 
            count(distinct e2.Salary) <= 3) result)
Join 
    Department on result.DepartmentId = Department.Id
ORDER BY 
    Department.Name, result.Salary

The solution doesn't seem to be very easy to follow. In that case let's have a closer look at the query:

First let's try to understand from the innermost sub-query and expand our understanding outward.

     select 
            e1.DepartmentId,e1.Name,e1.Salary
     from 
            Employee e1 left join Employee e2 
            on e1.Salary <= e2.Salary 
            and e1.DepartmentId = e2.DepartmentId 
     group by 
            e1.DepartmentId, e1.Salary,e1.Name
     having 
            count(distinct e2.Salary) <= 3

Now this is a self-join query, let's check the output without the group by + having clause:

     select 
            e1.DepartmentId,e1.Name,e1.Salary,e2.Salary
     from 
            Employee e1 left join Employee e2 
            on e1.Salary <= e2.Salary 
            and e1.DepartmentId = e2.DepartmentId 
+--------------+-------+--------+--------+
| DepartmentId | Name  | Salary | Salary | 
+--------------+-------+--------+--------+
|     1        | Joe   | 85000  | 85000  |
|     1        | Joe   | 85000  | 90000  |
|     1        | Joe   | 85000  | 85000  |
|     1        | Janet | 69000  | 85000  |
|     1        | Janet | 69000  | 90000  |
|     1        | Janet | 69000  | 69000  |
|     1        | Janet | 69000  | 85000  |
|     1        | Janet | 69000  | 70000  |
|     1        | Randy | 85000  | 85000  |
|     1        | Randy | 85000  | 90000  |
|     1        | Randy | 85000  | 85000  |
|     1        | Will  | 70000  | 85000  |
|     1        | Will  | 70000  | 90000  |
|     1        | Will  | 70000  | 85000  |
|     1        | Will  | 70000  | 70000  |
|     1        | Max   | 90000  | 90000  |
|     2        | Sam   | 60000  | 60000  |
|     2        | Sam   | 60000  | 80000  |
|     2        | Henry | 80000  | 80000  |

So now by adding the group by + having clause we get

 select 
        e1.DepartmentId,e1.Name,e1.Salary
 from 
        Employee e1 left join Employee e2 
        on e1.Salary <= e2.Salary 
        and e1.DepartmentId = e2.DepartmentId 
 group by 
        e1.DepartmentId, e1.Salary,e1.Name
 having 
        count(distinct e2.Salary) <= 3
+--------------+-------+--------+
| DepartmentId | Name  | Salary | 
+--------------+-------+--------+
|     1        | Joe   | 85000  |
|     1        | Randy | 85000  |
|     1        | Will  | 70000  |
|     1        | Max   | 90000  |
|     2        | Sam   | 60000  | 
|     2        | Henry | 80000  |

The rest is simple we just do a join in order to fetch the Department name from the Department Table by joining on DepartmentID.

Conclusion

Hope you find it interesting and please remember that probably every data scientist interview have at least one SQL question. Be prepared!!

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Coding.


References