Useful and interesting BigQuery functions that you might not use

Lucas Misael
9 min readSep 7, 2021

--

Have you ever heard about BIG DATA? Do you know what this acclaimed concept means ?

In a nutshell BIG DATA is defined when we have a large volume of data being trafficked, and we have a wide variety of that data, have main characteristics that define it as BIG DATA, lets see:

VOLUME:

According to data from SBG (Social Good Brasil), the volume of data created in recent years is greater than the amount produced in the entire history of humanity where it has been growing exponentially, and it is predicted that in 2023 we will have three times more connected devices than the global population.

SPEED:

Would you cross a street blindfolded, if the last information you had was a photograph taken of moving traffic from 5 minutes ago? Probably not, because the photograph from 5 minutes ago is irrelevant, you need to know the current conditions to be able to cross the street safely. (Forbes, 2012) The same logic applies to businesses, as they need very current data about their business, i.e. SPEED.

VARIETY:

Data comes from a variety of sources: social media, apps, emails, gps, cookies, IoT, public databases, etc. This means that they neither follow the same standards nor provide the same types of information, which makes the task of compiling and organizing data quite challenging.

VERACITY:

One in 3 leaders do not trust the data they receive (IBM). To reap good fruits from the Big Data process, it is necessary to obtain truthful data, in accordance with reality.

VALUE:

The greater the richness of data, the more important it is to know how to ask the right question at the beginning of every analysis process (Brown, Eric, 2014).

In other words, we need to transform a veritable tsunami of information into data that can actually be used in business

Thinking about these points, especially the variety of data, it is remarkable that the ways in which we used to work should change, because once it was not necessary to think about unstructured data. It is now necessary to have some tools to manage this data in an efficient and fast way, (it is worth remembering that nowadays data is the new oil of humanity!)

Seeing this scenario, many tools have been developed to assist in this traffic of large volumes of data, such as Amazon’s, Microsoft’s or Google’s services.

Therefore I have chosen one of these tools (Google Cloud Platform — GCP) to show very cool and useful functions that maybe we don’t use in BigQuery.

To start, what is BigQuery?

BigQuery is a fully managed, serverless data base that enables scalable analysis of large volumes of data (we’re even talking petabytes of data here 😳). It is a platform-as-a-service that supports queries using our dear SQL. It also has built-in machine learning capabilities.

Without a tool like these, information gathering, which enables the business areas to understand the result of their investments, is slow, expensive, and unproductive.

Furthermore the BigQuery database offers its users some advantages, such as:

Speed of Data retrieval:

Instead of operating on multiple compute clusters and requiring individual management of each one, BigQuery distributes its compute resources dynamically. This reduces the time it takes to examine data and the cost of building a system.

Another point is, BigQuery is Serveless, i.e. a totally cloud project, where scalability and speed in queries are prioritized. This means that you can scale and perform analysis faster, even when working with petabytes of data.

Has a scalable pricing structure:

Being a Serveless tool, BigQuery offers a variety of pricing options that are more cost-effective for companies, because BigQuery’s models are based on power and computing needs, and we can set up and run new instances without having to pay the cost of full servers.

Artificial Intelligence to optimize your datasets:

One of the most interesting, but exceptionally useful, aspects of BigQuery is its ability to optimize your storage and datasets in the background. The platform uses Artificial Intelligence to continuously evaluate the data store.

After detecting patterns, BigQuery will use them to optimize data sets into structures that are better suited to the types of analysis and queries that users are regularly performing.

The result is a platform that makes your queries faster as you make more queries due to machine learning algorithms.

But without further ado, let’s see some cool BigQuery functions?

IEEE_DIVIDE

Have you ever had that annoying little mistake of dividing by zero? You probably have 😂

In bigQuery we use the IEEE_Divide function to prevent division by zero error!

Let’s see an example:

Imagine you work for a delivery company, and you want to know the percentage of unique runs that happened within the total number of runs of the day, but on Wednesday, for some reason there were no runs at all during the whole day. There would certainly be a division by zero error!

But if we use the IEEE_Divide function as in the example below, we will see that a NaN will be returned instead of the error.

WITH deliveries AS(
SELECT
'Sat' AS Day,
1451 AS numrides,
1018 AS oneways
UNION ALL
SELECT
'Sun' AS Day,
2376 AS numrides,
936 AS oneways
UNION ALL
SELECT
'Wed' AS Day,
0 AS numrides,
0 AS oneways
)
SELECT
*,
ROUND(IEEE_Divide(oneways, numrides), 2) AS frac_noaway_round FROM deliveries;

SAFE

Another cool function that we can use to prevent an error from being returned to our query is SAFE, where when an error is returned, it just shows NULL on the corresponding line, take a look at the example below:

SELECT SAFE.LOG (10, -3), SAFE.LOG (10, 3);

RANGE_BUCKET

For the next example, think of a situation: You are in HR and need to know the number of employees according to some age groups, and with this present the plurality of employees’ ages to your boss. The age groups are:

Young: under 20 years old
Adults: 20 to 59 years old
Old Adults: over 59 years old

There is a very nice way to do this counting, using RANGE_BUCKET, where for this case, we pass the column AGE(age) as the first parameter, and the second parameter is an array with the age ranges we want to look for:

WITH employees AS(
SELECT
'Nicolas Cage' AS EMPLOYEE,
70 AS AGE
UNION ALL
SELECT
'Brad Pitt' AS EMPLOYEE,
40 AS AGE
UNION ALL
SELECT
'Angelina Jolie' AS EMPLOYEE,
35 AS AGE
UNION ALL
SELECT
'Sandra Bullock' AS EMPLOYEE,
25 AS AGE
UNION ALL
SELECT
'Samuel L. Jackson' AS EMPLOYEE,
19 AS AGE
UNION ALL
SELECT
'Bruce Willis' AS EMPLOYEE,
23 AS AGE
UNION ALL
SELECT
'Nicole Kidman' AS EMPLOYEE,
62 AS AGE
UNION ALL
SELECT
'Ann-Margret' AS EMPLOYEE,
58 AS AGE
UNION ALL
SELECT
'Elizabeth Taylor' AS EMPLOYEE,
35 AS AGE
UNION ALL
SELECT
'Audrey Hepburn' AS EMPLOYEE,
40 AS AGE
UNION ALL
SELECT
'Adam Sandler' AS EMPLOYEE,
34 AS AGE
UNION ALL
SELECT
'Robert De Niro' AS EMPLOYEE,
18 AS AGE
UNION ALL
SELECT
'Greta Garbo' AS EMPLOYEE,
53 AS AGE
)
SELECT
RANGE_BUCKET( AGE, [18, 20, 59])
age_group, COUNT(*) quantity
FROM employees
GROUP BY 1;

Cool right ? 😎

DATE_ADD and DATE_SUB

Have you ever needed to do mathematical operations with dates?

If so, I imagine you must have a hard time to develop a function to convert these dates and do the calculation, but did you know that BigQuery has a very cool function to do these calculations?

The functions DATE_ADD and DATE_SUB, came to make our lives easier!
Take a look at these examples 😎

SELECT
DATE_ADD (DATE(2008, 12, 25), INTERVAL 5 DAY) AS FIVE_DAYS_LATER,
DATE_ADD (DATE(2008, 12, 25), INTERVAL 4 YEAR) AS FOUR_YEARS_LATER,
TIMESTAMP_ADD (CURRENT_TIMESTAMP, INTERVAL 10 MINUTE) AS TEN_MINUTES_LATER;

And to subtract the dates just follow this example:

SELECT
DATE_SUB (DATE(2008, 12, 25), INTERVAL 5 DAY) AS FIVE_DAYS_BEFORE,
DATE_SUB (DATE(2008, 12, 25), INTERVAL 4 YEAR) AS FOUR_YEARS_BEFORE,
TIMESTAMP_SUB (CURRENT_TIMESTAMP, INTERVAL 10 MINUTE) AS TEN_MINUTES_BEFORE;

DATE_DIFF and DATETIME_DIFF

If we need to get the days or hours between dates we can use the functions DATE_DIFF and DATETIME_DIFF,
Take a look at these examples:

SELECT
DATE_DIFF (DATE(2010,12,25), DATE(2008, 9, 15), DAY) AS DIFF_DAYS,
DATETIME_DIFF (CURRENT_DATETIME, DATETIME(TIMESTAMP('2020–07–01 10:00:00')), MINUTE) AS DIFF_MINUTES;

EXTRACT

If you use or have ever used SUBSTRING to get a piece of a specific date, I have good news !

Your problems are over !!!

you can use the EXTRACT function to extract a desired part of your date 😎
Look at these examples:

SELECT DATE,
EXTRACT(MONTH FROM DATE) AS MONTH,
EXTRACT(DAY FROM DATE) AS DAY,
EXTRACT(YEAR FROM DATE) AS YEAR,
EXTRACT(DAYOFWEEK FROM DATE) AS WEEK
FROM UNNEST (GENERATE_DATE_ARRAY('2016-10-05', '2017-10-08')) AS DATE
ORDER BY DATE;

LAST_DAY

We can also get the last day of a specific month, or even the last day of a specific year, as in the example below:

SELECT
DATETIME_ADD(CURRENT_DATETIME, INTERVAL 90 DAY) AS DATE_90_LATER,
LAST_DAY(DATETIME_ADD(CURRENT_DATETIME, INTERVAL 90 DAY), MONTH) AS LAST_DAY_FROM_MONTH_90_DAYS_LATER,
LAST_DAY(DATETIME_ADD(CURRENT_DATETIME, INTERVAL 90 DAY), YEAR) AS LAST_DAY_FROM_YEAR_90_DAYS_YEAR;

Well, we’ve talked enough about dates, let’s see some cool features about Geolocation ?

BigQuery Geo Viz

First, what is Geo Viz ?

BigQuery Geo Viz is a web tool for visualizing geospatial data in BigQuery using Google Maps APIs. You can execute a SQL query and display the results in an interactive map. Besides we can use this tool in Google Data Studio, Earth Engine, and even use it in Jupyter Notebooks (through the GeoJSON extension).

ST_GEOGPOINT

To use Geo Viz we'll need a ST_GeogPoint object, which is used to convert the values in the latitude and longitude columns into GEOGRAPHY types (points).

Note: I am using a dataset provided by Google called Citi Bike to create these points.

Three columns in this table are relevant for this tutorial:

bike_stations.longitude: the longitude of a station. The values are valid WGS 84 longitudes in decimal degrees format.
bike_stations.latitude: the latitude of a station. The values are valid latitudes in WGS 84 in decimal degrees format.
num_bikes_available:
the number of bikes available for rent.

How to create a Geogpoint:

SELECT 
ST_GEOGPOINT(longitude, latitude) AS Station,
num_bikes_available
FROM
`bigquery-public-data.new_york.citibike_stations`
WHERE num_bikes_available > 10;

ST_MAKELINE

To create a line between two geographic points we can use the function ST_MAKELINE, this function needs an array with two ST_GEOGPOINT to works.

To create a line between two points you can type this code:

SELECT 
ST_MAKELINE(ARRAY_AGG(Point)) as Line
FROM
(
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS Point
UNION ALL
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG)
);

ST_DISTANCE

We can also calculate the distance between two geographical points, using the ST_DISTANCE function:

SELECT 
ST_DISTANCE (
ST_GEOGPOINT(YOUR_LAT, YOUR_LONG),
ST_GEOGPOINT(YOUR_LAT, YOUR_LONG)
)
AS Distance;

ST_MAKEPOLYGON

To plot the area of a polygon we will use the function ST_MAKEPOLYGON, for this function a chain of geographic points is expected, which is provided by a single GEOGRAPHY, as in the code below:

SELECT 
ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_AGG(point))) as polygon
FROM
(
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
UNION ALL
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
UNION ALL
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
);

ST_AREA

Using the polygons created, we can calculate area, distance, and other information about the polygon.

To calculate the area we will use the ST_AREA function, which returns the area in square meters covered by the polygons in the input GEOGRAPHY.

SELECT 
ST_AREA(ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_AGG(point)))) as Area
FROM
(
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
UNION ALL
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
UNION ALL
SELECT ST_GEOGPOINT(YOUR_LAT, YOUR_LONG) AS point
);

Tip:

Google provides a very nice dataset for testing, and even discovering new possibilities within BigQuery, the DataSet Citi Bike.

If you want to explore more about this dataset you can do so by clicking here:

References:

Google Geography Functions

Os 5 V’s do Big Data — Universidade Federal do Rio de Janeiro

Os 5 V’s do Big Data — ELISA PAIXÃO

FORBES- Volume, Velocity, Variety: What You Need to Know About Big Data

BIGQUERY GEOVIZ

--

--