return arrow  Return to Community

SQL Tip: Use the Database Layer

By Vincent Brandon, Data Coordinator

January 2, 2020

code

The rise in popularity of programming languages for analytics such as Python, R, and Julia have led many to do basic aggregation statistics client side. While this works for smaller files of a few hundred thousand records, this will not translate well to larger datasets. While there is a wealth of support for loading and lazy execution on datasets that can’t fit into memory, most researchers will waste more time setting them up than running early data exploration at the database level.

Also, imagine that the data stored in the database is ground truth, not the csv saved out two weeks ago. The most common data exploration statistics are already implemented or can be easily calculated in most any modern RDBMS (relational database management system).

GROUP BY + HAVING Logic for Faster Filtering:

It can be helpful to understand the order of function application in SQL. It’s not always first come first serve. SQL has a lot of tricks to optimize massively powerful filters. One of the most common is GROUP BY. View a demo of the code in Python here.

Say you have a list of students and want them grouped by their age.

SELECT

Age, AVG(GPA), MODE(Status)

FROM

               Students

GROUP BY

               Age;

Note the MODE() function. This function is available in recent versions of Oracle, PostgreSQL and other modern RDMS platforms. Well, what if you’re testing in SQLite? We can build powerful logic at the database level via HAVING conditions to groups generated in GROUP BY.

Let’s first create a view that looks at Status and Age specifically.

WITH STATUS_VIEW AS (

SELECT

Age, Status, COUNT(Status) AS Count

FROM

Students

GROUP BY

Status, Age

ORDER BY

AGE DESC).

** Clusters data by Age group and status, then orders the view by Age (Not a necessary step, but useful for visually validating results in this example).

Now, note that the Mode of any series of values is the element where COUNT is equal to the MAX of all elements’ counts. We can write that logic in with a HAVING statement.

SELECT

sv.Age, sv.Status

FROM

STATUS_VIEW as sv

GROUP BY

sv.Age

HAVING

sv.Count = MAX(sv.Count)

HAVING will operate on each group individually, giving us the row information where the maximum is found. We can also find the least frequent (inverse mode) in much the same way by changing MAX to MIN!

SELECT

sv.Age, sv.Status

FROM

STATUS_VIEW as sv

GROUP BY

sv.Age

HAVING

sv.Count = MIN(sv.Count)

In the next SQL tips blog post, we’ll look at calculating frequentist statistics such as confidence intervals and binning data to make histograms.



Utah Data Research Center
140 East 300 South, Salt Lake City, UT 84111 - Email: UDRC@utah.gov
Utah.gov Home | Terms of Use | Privacy Policy | Accessibility Policy
©2019 State of Utah