Back to all articles

Processing CloudTrail Logs from S3: Discovery and Resumption Patterns

November 24, 20258 min readEngineering

We often found ourselves needing to pull CloudTrail logs quickly from an organization trail spanning hundreds of AWS accounts. Our detection engineering workflow involved performing actions across different accounts, then pulling logs to check if expected events were generated and examine the actual event data - so we could quickly validate our hypotheses. This loop allowed us over the last year to build and verify our assumptions and understand the various nuances that many folk understand CloudTrail to be guilty of.

While tools exist for synchronizing S3 buckets such as AWS's very own aws s3 sync - by nature the command checks every file in the prefix to determine what needs transferring, which becomes expensive across thousands of immutable CloudTrail logs. If we performed an action - and accounting for the delay in CloudTrail being flushed to S3 - a few minutes later pulled the logs across 10+ accounts and several regions, writing the command itself becomes a challenge to account for the hierarchical path. While we looked for existing tooling specifically for this, they were often poorly designed for the task.

CloudTrail writes to S3 with a deep hierarchy: AWSLogs/{org-id}/{account-id}/CloudTrail/{region}/{year}/{month}/{day}/. Files are named with timestamps: 295916874782_CloudTrail_us-east-1_20251123T2345Z_abc123.json.gz - the YYYYMMDDTHHmmZ format sorts chronologically when sorted alphabetically. Most paths are empty - an account might only have activity in a few regions, but naive enumeration checks all possible combinations.

If we wanted to sync 5 accounts across several regions for the last 30 minutes, sync utilities aren't designed to exploit the S3 hierarchy to avoid reprocessing files. They check everything, create thousands of compressed files locally, and don't understand that CloudTrail logs are append-only and immutable. We needed something that would grab only new logs since the last run without re-checking everything or manually tracking which account/region paths to process - so we could speed up our inner loop and build axioms while validating our assumptions about CloudTrail in a multi-region, multi-account environment. Here's how we did it and what we learned.

What You Need to Handle

Processing CloudTrail logs from S3 at scale means solving a few problems:

Discovery - Figure out where CloudTrail data actually exists. With potentially thousands of account/region/date combinations, you need to avoid checking empty paths.

Resumption - Track what you've already processed so you can pick up where you left off. CloudTrail files are immutable, so you don't need to re-check files that haven't changed.

Deduplication - Handle the case where multiple trails (organization-level and account-specific) log the same events.

Processing - Download the gzipped files, decompress them, parse the JSON, extract events.

The sections below explain how we solved each of these using S3's API features and CloudTrail's structure.

Discovery

With an organization spanning 50 accounts across AWS's 33 regions, that's 1,650 potential account/region combinations to check. Add the date hierarchy and you're looking at tens of thousands of possible paths. Most paths are empty - an account might only have activity in us-east-1, us-west-2, and eu-west-1, but naive enumeration still checks all 33 regions.

S3's ListObjectsV2 has a delimiter parameter that returns CommonPrefixes instead of individual objects - essentially letting you navigate the hierarchy like directories without fetching file lists.

1# Without delimiter: returns all objects under the prefix
2aws s3api list-objects-v2 --bucket my-bucket --prefix AWSLogs/o-abc123/123456789012/CloudTrail/
3
4# With delimiter: returns only immediate subdirectories
5aws s3api list-objects-v2 --bucket my-bucket --prefix AWSLogs/o-abc123/123456789012/CloudTrail/ --delimiter /
6

Start with prefix AWSLogs/{org-id}/ and delimiter / to discover which account IDs exist. For each account, list with prefix AWSLogs/{org-id}/{account-id}/CloudTrail/ and delimiter / to find only regions containing data. You end up with a map of active account/region pairs without listing empty paths - typically reducing discovery from 1,650+ checks to ~100-200 API calls that actually return data.

Resumption

CloudTrail files are immutable once written - a file named 295916874782_CloudTrail_us-east-1_20251123T2345Z_abc123.json.gz never changes after CloudTrail writes it. This means you can track the last processed file and skip everything before it on subsequent runs.

Each (bucket, account, region) combination needs independent state tracking. Organization trails and account-specific trails can write to different buckets, so the same account/region appears in multiple locations generating different (but potentially overlapping) events. The composite key handles this:

1state[bucket][account][region] = last_processed_key
2

On restart, ListObjectsV2 uses both Prefix and StartAfter. Prefix narrows to the specific account/region: AWSLogs/{org-id}/{account-id}/CloudTrail/{region}/. StartAfter skips past the checkpoint - the last processed S3 key from the previous run. S3 handles this filtering server-side, so you never see files you've already processed. Listing picks up exactly where you left off.

CloudTrail's YYYYMMDDTHHmmZ filename format sorts chronologically under lexicographic ordering, which is exactly how S3 returns keys from ListObjectsV2. This alignment means files arrive in chronological order by design - no client-side sorting needed.

Checkpoint progress every 100 files to avoid losing work on interruption.

Deduplication

Multiple trails can capture identical events - organization trails and account-specific trails often log the same API calls. Each CloudTrail event has a unique eventID field. Bloom filters provide probabilistic membership testing - check if you've seen an eventID before without storing every ID in memory. The tradeoff is accepting false positives (treating some duplicates as new) in exchange for significant memory savings, which works for analysis workloads where occasional duplicate processing is acceptable.

Algorithm

  1. Use S3 delimiter-based discovery to find active account/region pairs a. List bucket prefix with delimiter / to discover accounts b. For each account, list {account}/CloudTrail/ with delimiter / to discover regions with data
  2. For each discovered account/region pair: a. Query state database for last processed S3 key (resumption) b. List S3 objects with Prefix scoped to account/region and StartAfter set to checkpoint c. Filter for .json.gz files and queue for download
  3. Download worker pool processes queued files: a. Fetch object from S3 b. Decompress gzip stream c. Queue decompressed data for processing
  4. Processing worker pool handles events: a. Parse JSON to extract CloudTrail Records array b. For each record, extract eventID field c. Check bloom filter for eventID:
    • If seen: skip (duplicate)
    • If not seen: add to bloom filter and write event to JSONL output
  5. Checkpoint state periodically (every 100 files) by saving last processed S3 key per (bucket, account, region) tuple

Getting Started with gocloudtrail

We built and open-sourced gocloudtrail that implements these patterns. To get started you'll need s3:ListBucket and s3:GetObject permissions on CloudTrail buckets, additionally if you would like to discover and generate the configurations automatically cloudtrail:DescribeTrails.

Install it with:

1go install github.com/deceptiq/gocloudtrail@latest
2

You can then generate a config from your existing trails (that it will auto-discover):

1gocloudtrail generate-config config.json
2

And finally you can run the processor - you can interrupt it anytime/or not, and it'll resume efficiently from the last checkpoint!

1gocloudtrail run -config config.json
2

Want more insights like this?