Iterate a DynamoDB table with AWS StepFunctions

Benjamin Tamasi
3 min readSep 14, 2022

Step Functions provide a great way to iterate over data stored in a dynamodb table, and process it in parallel. There are some quirks which may make starting out difficult, but once addressed, step functions can be a powerful tool.

Iterating over data in DynamoDB

To iterate over the data we would first use either a Query or a Scan operation to get the data to work with. Sometimes all the data will fit in the result, but it might not. AWS sets a 1MB limit on the items it returns for a query. We can apply our own limits too, using the “Limit” parameter. If there is more data to be returned, we need to perform our query again and again, until there is no more data left. This is pretty straightforward pagination, where we use a token to get the results from a certain index.

In the case of dynamodb, if there are more results remaining that haven’t been returned, the LastEvaluatedKey is returned in the response, which we can pass on to subsequent queries as ExclusiveStartKey. This will give us the next batch of results. We can also place our own limits on the number of items returned, if we specify the Limit parameter.

Querying DynamoDB in a StepFunction

If you are using the Workflow Studio view, you can add the Dynamodb Scan or Query operation by dragging the block.

DynamoDB: Scan block

Or if you are editing in code, use the
arn:aws:states:::aws-sdk:dynamodb:scan or arn:aws:states:::aws-sdk:dynamodb:query.

The first issue we have is that on the first run, we don’t want to specify an ExclusiveStartKey. We only want to add this on subsequent runs. Unfortunately, you have to tell the scan operation what it’s payload is, and you can’t add optional parameters. Luckily for us, dynamodb will ignore this field if it’s set to null. Our approach can then be to use the input to set this, like so: "ExclusiveStartKey.$": "$.LastEvaluatedKey". The problem is, if the input doesn’t exist, we won’t simply get a null value, but rather our entire step will fail with an error. We tried to access an input which didn’t exist.

Pass State

We can get around the issue by making use of a pass state to “inject” the missing input on the first run, then have our loop go back directly to the scan operation on subsequent runs. This will ensure that the input parameter we want to use, is always there.

The Result

StepFunction graph

By adding a Map state after the scan, we can iterate each result from the database. After the Map, we add a choice to check for LastEvaluatedKey . If it’s there, it means we need to run our query again. If not, we are done. The only caveat is, we must remember to preserve the LastEvaluatedKey for when we loop back. I have added one more Pass state, which prepares the input data for the next Scan iteration.

Further Optimisation

In many cases we don’t actually need the entire content of each item. We can drastically decrease the amount of iterations we need by choosing which fields we need. In my example, I only need the id field. So I make use of the ProjectionExpression to select only the id. This way I can fit a lot more data in a single query before I hit the 1MB limit.

Code Example

--

--