# Relation - Data Operator (Relational Algebra)

Relational algebra defines the relational database through a set of data operators (select, filter, join, sort, union, etc.) that defines an intermediate format for query planning/optimization.

Relational algebra is an extension to mathematical set theory, it was devised by E.F. Code (IBM) in 1970.

where:

Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation.

All operations on a table return a table. The operation can then be chained together.

Relational algebra expressions dictate how to achieve an answer by giving what operations to do and in what order to do them. A declarative language only expresses conditions that must be met in order for a result to be an answer, not how to get that answer.

For instance, SQL is the what not the how. It's clear what we want unclear how to get it

Every query returns a table. The language is “algebraically closed” A view is stored as an algebraic closure so it can be optimized when used with queries.

## 3 - Order

By design, query results in relational algebra are unordered. Repeating the same query multiple times is not guaranteed to return results in the same order. Similarly, database-backed relational data also do not guarantee row order.

## 4 - Relation Operator Algebra

Type Name Symbol Sql Operator
Normal select, Selection $\sigma$ of s
Normal project $\pi$
Normal join $\Join$
Normal union $\cup$ union (without duplicates)
of union all (with duplicates)
Normal intersection $\cap$
Normal difference - except, minus
Extended Grouping and aggregating g
Extended Duplicate elimination d
Extended Sorting t

### 4.1 - Intersection

• Derived operator using minus

$R1 \cap R2 = R1 - ( R1 - R2 )$

This expression $( R1 - R2 )$ returns anything that is only in R1

• Derived using join

$R1 \cap R2 = R1 x R2$

### 4.2 - Selection

condition: =, <, >, =<, >=, <>, …

### 4.3 - Project

A project operator keeps only a set of columns (in other words eliminates the others).

### 4.4 - Cross Product

$R1 x R2$

The size of this cross-join is then the size of R1 multiplied by the size of R2.

Find all pairs of similar images/tweets/songs: Compute the cross product, then compute a similarity function f(x,,x2) for every possible pair

### 4.5 - Equi-join

Equi-join is a join where the condition is an equality.

Equality Notation:

SELECT *
FROM R1 JOIN R2
ON R1.A = R2.A

Cross product notation:

SELECT *
FROM R1, R2
WHERE
R1.A = R2.A

### 4.6 - Theta-join

A theta-join is a difficult join where generally the condition is not a equality. Band join or range join.

It's not because you don't have any join key in the sql that you don't have a join physically.

• Example 1: Find all the hospitals within 5 miles of a school

SELECT DISTINCT h.name
FROM hospitals h, schools s
WHERE
distance(h.location,s.location) < 5
• Example 2: Find all user clicks made within 5 seconds of page load

SELECT *
WHERE
abs(c.click_time - p.load_time) < 5

## 5 - Bag vs Set

Relational Algebra has two semantics:

• Set semantics = standard Relational Algebra
• Bag semantics = extended Relational Algebra

Rule of thumb:

• Every paper will assume set semantics
• Every implementation will assume bag semantics

## 8 - Documentation / Reference

• Bill Howe, PhD (Director of Research, Scalable Data Analytics) University of Washington - eScience Institute