Mongo Aggregations in 5 Minutes

Published in

Universe Engineering

4 min readMay 30, 2017

At Universe, we use Mongo as a primary data store for our events, users, sessions, and more. The documents we store in our collections tend to be mostly uniform attribute-wise. We model relationships using the standard relationship_id approach implemented by most ORM/ODMs like Mongoid (a Ruby library).

To date, we’ve relied on the domain modelled in our application — a consumer-facing Rails app — to implement traversals of relationships between documents. Mongoid provides us with easy-to-use relationship helpers that makes our best code concise and expressive.

Alas, the tale often told about ORM/ODMs is true: sometimes you just need to write a query. But wait, isn’t querying for relationships going to be a pain? Aha! It doesn’t have to be if you use the Mongo Aggregation Pipeline.

What are aggregations?

Aggregation process documents and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.

Aggregations can be used to apply a sequence of query-operations to the documents in a collection, reducing and transforming them. With aggregations, we can perform queries that offer similar functionality to the behaviors we might expect to see in a relational database query.

In Mongo, aggregations work as a pipeline, or a list of operators/filters applied to the data. Operators come in three varieties: stages, expressions, and accumulators. When calling aggregate on a collection, we pass a list of stage operators. Documents are processed through the stages in sequence, with each stage applying to each document individually.

$match (where)

$match is a stage operator with this definition: { $match: { <query> } }

The syntax for query is identical read operation query syntax. Ideally, you will use $match as early in the pipeline as possible:

Because $match limits the total number of documents in the aggregation pipeline, earlier $match operations minimize the amount of processing down the pipe.

In effect, with $match Mongo will filter the collection accoring to the query parameters, and only pass through the documents matching the query, to the next stage of the pipeline. Take this example of “article” documents in a collection, with an author — and then filtering by the name of the author:

We can apply other conditions and constraints by using an expression operator (one of the two other operators in addition to stage, mentioned above). For example, we can pass $or to $match and then provide a list of matchable queries, one of which must be true in order for the document to be included by the filter. This is an example of a boolean aggregation.

We can also apply additional constraints such as comparisons, where document values must be greater than, less than, or equal to some other value. Here we use both $or and $gte in a $match stage:

$skip + $limit

$skip and $limit both accept a positive integer — and do what you expect them to do: skip documents, and limit the number returned. For the purposes of developing rich APIs on top of Mongo, this is obviously quite useful. A simple example building on the above example of an articles collection might look like:

db.articles.aggregate([
  { $match: { score: { $gt: 60 }}},
  { $limit: 2},
  { $skip: 1 }
])

First we’re filtering the collection for documents with a field called score with a value greater than 60. Next, we limit the number of documents to two, and then return the second one by skipping the first.

$skip and $limit are also subject to pipeline optimizations.

$lookup (join)

Ok, so far — so good. But what about modelling relationships?

$lookup kind of functions the way a join does — it matches each document in the pipeline stage to a set of documents from another collection, and then returns those documents as an attribute on the current one to the next stage of the pipeline. You can even match arrays.

The syntax of this operation looks like:

{ $lookup: {
  from: <collection to join>,
  localField: <field from the input documents>,
  foreignField: <field from the documents of the "from" collection>,
  as: <output array field>
} }

As a full fledged example inside of database modelling an inventory system:

$project (select)

Naturally, we’re going to want to reduce the documents into smaller objects — returning just the fields we want, or aliasing their names. In the SQL paradigm, this sounds like a SELECT , for Mongo it’s $project .

The structure you pass to $project is a field mapping:

We can actually compose very complex serialization routines using $project with <expression> . Otherwise, including a field is as easy as passing <field>: <1 or true> . The documents returned to the next stage of the pipeline will only contain the values specified by $project.

Hopefully you’re excited to try out aggregations yourself! The current version of the Mongo manual (3.4) provides some excellent example datasets to experiment with:

Aggregation with the Zip Code Data Set - MongoDB Manual 3.4

The stage also calculates the following four fields for each state. Using the expression, the operator creates the…

docs.mongodb.com

Aggregation with User Preference Data - MongoDB Manual 3.4

Aggregation Pipeline and Sharded Collections