Learning SQL as Data Scientist — I

Mayur Jain
4 min readMay 29, 2020

When it comes to responsibilities, the role of Data Scientist is as mystical as the God’s existence in this world. I have been asked to write Flask API as DS, how to deploy model in large scale as DS like Docker and stuff, etc. So it’s a challenge to be Data Scientist, when roles and responsibilities is not clearly defined.

Data Analytics

Anyway, without being more vague, I’ll come to the point of SQL. SQL is one of the great employable skill to have in your arsenal while looking for Job as Data Scientist. I will not go through the topics like how to install sql and connecting to database server and creating databases etc.

But I will do teach, how to query in a database table, extracting the data, performing basic analytics and gain superifical insights before going indepth with pandas etc.

Let’s Learn SQL

There are a ton of businesses that use large, relational databases, which makes a basic understanding of SQL a great employable skill not only for data scientists, but for almost everyone.

Follow Link — Setting up SQL Server

After Setting up the server as mentioned in above link after adding dataset of your likeness to database table, we can go ahead without wasting any time, i will write down all the basic query structure of all basic queries as follows below

CREATING TABLE

CREATE TABLE table_name (col_1 datatype, col_2 datatype);

INSERTING INTO TABLE BY ORDER

INSERT INTO table_name VALUES (val_1, val_2);

INSERTING INTO TABLE BY NAME

INSERT INTO table_name (col_1, col_2) VALUES (val_1, val_2);

UPDATE A COLUMN VALUE

UPDATE table_name SET col_1 = val_1, col_2 = val_2 WHERE some_col= some_val;

ALTER TABLE TO ADD NEW COLUMN

ALTER TABLE table_name ADD col_name datatype;

DELETE A RECORD IN A TABLE

DELETE FROM table_name WHERE some_col = some_val;

Standard way of create database table with column constraints.

CREATE TABLE student ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, grade INTEGER NOT NULL, age INTEGER DEFAULT 10 );

Column constraints are the rules applied to the values of individual columns:

  • PRIMARY KEY constraint can be used to uniquely identify the row.
  • UNIQUE columns have a different value for every row.
  • NOT NULL columns must have a value.
  • DEFAULT assigns a default value for the column when no value is specified.

There can be only one PRIMARY KEY column per table and multiple UNIQUE columns.

As a data scientist, if you work on a dataset in CSV, Excel, Json formats etc, you can load it into pandas and start working on it. Otherwise we would be given access to databases, to extract the data according to our requirements, taking fields(columns) for our analysis. Since databases has multiple table, it will be our task to extract data from multiple tables before making it as a full proof dataset with help queries like join, case etc.

I have taken course from Codeacademy to refresh my SQL and sharing the knowledge here. Here the database i am working on is movie database with different columns, which are straightforward like genre, ratings, name etc.

I will keep reference to pandas/python for analogy with SQL:

Similar to ‘re’ library in Python for regular expression, we have LIKE keyword in SQL:

LIKE

Selects all the movies with letter ‘se’ at beginning and at the end.

SELECT *
FROM movies
WHERE name LIKE ‘Se_en’;

Selects all the movies that starts with ‘the’ at front.
SELECT *
FROM movies
WHERE name LIKE ‘the %’;

Selects all the movies which contains the word ‘man’ any in name.

SELECT *
FROM movies
WHERE name LIKE ‘%man%’;

WHERE

Selects movies with rating greater than 8.

SELECT *
FROM movies
WHERE imdb_rating > 8;

NULL or NOT NULL in a column similar to isna() in pandas:

Selects all movies where ‘imdb_rating’ is null or not.

SELECT name
FROM movies
WHERE imdb_rating IS NULL;

SELECT name
FROM movies
WHERE imdb_rating IS NOT NULL;

BETWEEN command is similar to ‘filter’ in pandas:

Selects all movies whose initial letter starts with A till J, excluding J.
SELECT *
FROM movies
WHERE name BETWEEN ‘A’ AND ‘J’;

SELECT *
FROM movies
WHERE year BETWEEN 1970 AND 1979;

Multiple Conditions in a statement, similar to python if /else.

SELECT *
FROM movies
WHERE year < 1985
AND genre IS ‘horror’;

SELECT *
FROM movies
WHERE genre = ‘romance’
OR genre = ‘comedy’;

Note: Above the ‘=’ is not an assignment operator but checking if genre is comedy or not.

DISTINCT — Similar to unique in Pandas

SELECT DISTINCT genre
FROM movies;

ORDER BY — Similar to sort keyword in python. DESC — Descending order.

SELECT name, year, imdb_rating
FROM movies
ORDER BY imdb_rating DESC;

LIMIT and Order BY- Similar to sort and head keyword in python and pandas.

SELECT *
FROM movies
ORDER BY imdb_rating DESC
LIMIT 3;

CASE statement - Similar to if/elif/else in python. We are creating a new column Review based on rating.

SELECT name,
CASE
WHEN imdb_rating > 8 THEN ‘Fantastic’
WHEN imdb_rating > 6 THEN ‘Poorly Received’
ELSE ‘Avoid at All Costs’
END AS ‘Review’
FROM movies;

SELECT name,
CASE
WHEN genre = ‘romance’ THEN ‘Chill’
WHEN genre = ‘comedy’ THEN ‘Chill’
ELSE ‘Intense’
END AS ‘Mood’
FROM movies;

Let’s summarize:

SELECT is the clause we use every time we want to query information from a database.
AS renames a column or table.
DISTINCT return unique values is same as Unique
WHERE is a popular command that lets you filter the results of the query based on conditions that you specify.
LIKE and BETWEEN are special operators.
AND and OR combines multiple conditions.
ORDER BY sorts the result.
LIMIT specifies the maximum number of rows that the query will return.
CASE creates different outputs.

If you have learnt anything new here, then don't forget to clap👏”.

Connect with me in LinkedIn

--

--