As we already know Data Science/Machine Learning is such a multi-disciplinary field means that the sheer volume of material you need to feel properly prepared can become overwhelming.The sure thing is that will involve some SQL technical questions as Data Scientists very often query some databases.

In this article, I will discuss some SQL interview questions which a Data Scientist should revise before going on to a technical interview.

## Table of Contents

• Customers Who Never Order [Easy]
• Consecutive Numbers [Medium]
• Rank Scores [Medium]
• Department Top Three Salaries [Hard]
• Conclusion
• References

## Customers Who Never Order [Easy]

### Problem description

Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customers.

+----+-------+
| Id | Name  |
+----+-------+
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |
+----+-------+

Table: Orders.

+----+------------+
| Id | CustomerId |
+----+------------+
| 1  | 3          |
| 2  | 1          |
+----+------------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry     |
| Max       |
+-----------+

### Solution

If we have a list of customers who have ever ordered, it will be easy to know who never ordered. Thus:

SELECT Name as 'Customers'
FROM Customers
WHERE ID not IN (SELECT CustomerId FROM Orders) -- the above subquery gives a list
-- of customers who never ordered.


Another solution would be to do a left join of the Customers Table with the Orders table and display the rows which have NAN values. Those rows would be the ones that belong to Customers Table but not to the Order table.

SELECT Name as 'Customers'
FROM Customers c LEFT JOIN Orders o ON c.Id=o.CustomerId
WHERE CustomerId is NULL

## Consecutive Numbers [Medium]

### Problem description

Write a SQL query to find all numbers that appear at least three times consecutively.

+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

### Solution

The solution is to join the table to itself three times and make sure the Id field is consecutive and the Num is the same.

select distinct l3.Num as 'ConsecutiveNums'
FROM Logs l3 inner join
(SELECT l1.Id as Id,l1.Num as Num
FROM Logs l1 inner join Logs l2 ON l1.Id=l2.Id+1 AND l1.Num=l2.Num) doubles
ON l3.Id = doubles.Id+1 and l3.Num = doubles.Num

## Rank Scores [Medium]

### Problem description

Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking. Note that after a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no "holes" between ranks.

+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+

For example, given the above Scores table, your query should generate the following report (order by highest score):

+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+

### Solution

The solution looks like that:

SELECT Scores.score as 'Score', count(ranking.Score) as 'Rank'
FROM Scores left join (select distinct Score FROM Scores) ranking ON Scores.Score <= ranking.Score
GROUP BY Scores.ID
ORDER BY Rank asc

The solution doesn't seem to be very easy to follow. In that case let's have a closer look at an example of expected input and output:

INPUT
+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+

OUTPUT
+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+

So, the task is to group all identical scores and then order them from largest to smallest. Let's see how the solution works step by step. First, it creates a helper table called ranking - note (select distinct Score from scores) ranking. Its content will be:

+----+--+
| Score |
+----+--+
| 3.50  |
| 3.65  |
| 4.00  |
| 3.85  |
+----+--+

Notice how all duplicate scores have been eliminated (that is the purpose of distinct keyword). Next, there is a left join between tables ranking and scores where we join each record from scores table with all records from ranking table that have greater or equal score.

SELECT *FROM Scores left join (select distinct Score FROM Scores) ranking ON Scores.Score <= ranking.Score

So, the result of this mid-phase would be:

+----+-------+---------------+
| Id | Score | ranking.Score |
+----+-------+---------------+
| 1  | 3.50  | 3.50          |
| 1  | 3.50  | 3.65          |
| 1  | 3.50  | 3.85          |
| 1  | 3.50  | 4.00          |
| 2  | 3.65  | 3.65          |
| 2  | 3.65  | 3.85          |
| 2  | 3.65  | 4.00          |
| 3  | 4.00  | 4.00          |
| 4  | 3.85  | 3.85          |
| 4  | 3.85  | 4.00          |
| 5  | 4.00  | 4.00          |
| 6  | 3.65  | 3.65          |
| 6  | 3.65  | 3.85          |
| 6  | 3.65  | 4.00          |
+----+-------+---------------+

Next comes group by which groups all records with same Id into one record. Since in the select part we have count(ranking.Score), the result of grouping will be the count of different ranking scores for each Id. And since we joined from ranking only those scores that are greater or equal than the original score, this count will give the requested ranking. We are almost done:

+----+-------+--------------+-------+
| Id | count(ranking.Score) | Score |
+----+-------+--------------+-------+
| 1  |       4              | 3.50  |
| 2  |       3              | 3.65  |
| 3  |       1              | 4.00  |
| 4  |       2              | 3.85  |
| 5  |       1              | 4.00  |
| 6  |       3              | 3.65  |
+----+-------+--------+-------------+

Now the easiest part - order by which orders the results by Rank. Since the select does not include Id, that column is omitted and we get the final result.

ðŸš€ For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

ðŸ“š While for book lovers:

## Department Top Three Salaries [Hard]

### Problem description

The Employee table holds all employees. Every employee has an Id, and there is also a column for the department Id.

+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+
| Id | Name     |
+----+----------+
| 1  | IT       |
| 2  | Sales    |
+----+----------+

Write a SQL query to find employees who earn the top three salaries in each of the department. For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Randy    | 85000  |
| IT         | Joe      | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

Explanation:

In IT department, Max earns the highest salary, both Randy and Joe earn the second highest salary, and Will earns the third highest salary. There are only two employees in the Sales department, Henry earns the highest salary while Sam earns the second highest salary.

### Solution

The solution looks like that:

select
Department.Name as 'Department',
result.Name as 'Employee',
result.Salary
FROM
(select
e1.DepartmentId,e1.Name,e1.Salary
from
Employee e1 left join Employee e2 on e1.Salary <= e2.Salary and e1.DepartmentId = e2.DepartmentId
group by
e1.DepartmentId, e1.Salary,e1.Name
having
count(distinct e2.Salary) <= 3) result)
Join
Department on result.DepartmentId = Department.Id
ORDER BY
Department.Name, result.Salary

The solution doesn't seem to be very easy to follow. In that case let's have a closer look at the query:

First let's try to understand from the innermost sub-query and expand our understanding outward.

     select
e1.DepartmentId,e1.Name,e1.Salary
from
Employee e1 left join Employee e2
on e1.Salary <= e2.Salary
and e1.DepartmentId = e2.DepartmentId
group by
e1.DepartmentId, e1.Salary,e1.Name
having
count(distinct e2.Salary) <= 3

Now this is a self-join query, let's check the output without the group by + having clause:

     select
e1.DepartmentId,e1.Name,e1.Salary,e2.Salary
from
Employee e1 left join Employee e2
on e1.Salary <= e2.Salary
and e1.DepartmentId = e2.DepartmentId 
+--------------+-------+--------+--------+
| DepartmentId | Name  | Salary | Salary |
+--------------+-------+--------+--------+
|     1        | Joe   | 85000  | 85000  |
|     1        | Joe   | 85000  | 90000  |
|     1        | Joe   | 85000  | 85000  |
|     1        | Janet | 69000  | 85000  |
|     1        | Janet | 69000  | 90000  |
|     1        | Janet | 69000  | 69000  |
|     1        | Janet | 69000  | 85000  |
|     1        | Janet | 69000  | 70000  |
|     1        | Randy | 85000  | 85000  |
|     1        | Randy | 85000  | 90000  |
|     1        | Randy | 85000  | 85000  |
|     1        | Will  | 70000  | 85000  |
|     1        | Will  | 70000  | 90000  |
|     1        | Will  | 70000  | 85000  |
|     1        | Will  | 70000  | 70000  |
|     1        | Max   | 90000  | 90000  |
|     2        | Sam   | 60000  | 60000  |
|     2        | Sam   | 60000  | 80000  |
|     2        | Henry | 80000  | 80000  |

So now by adding the group by + having clause we get

 select
e1.DepartmentId,e1.Name,e1.Salary
from
Employee e1 left join Employee e2
on e1.Salary <= e2.Salary
and e1.DepartmentId = e2.DepartmentId
group by
e1.DepartmentId, e1.Salary,e1.Name
having
count(distinct e2.Salary) <= 3
+--------------+-------+--------+
| DepartmentId | Name  | Salary |
+--------------+-------+--------+
|     1        | Joe   | 85000  |
|     1        | Randy | 85000  |
|     1        | Will  | 70000  |
|     1        | Max   | 90000  |
|     2        | Sam   | 60000  |
|     2        | Henry | 80000  |

The rest is simple we just do a join in order to fetch the Department name from the Department Table by joining on DepartmentID.

## Conclusion

Hope you find it interesting and please remember that probably every data scientist interview have at least one SQL question. Be prepared!!

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Coding.