← Back to blog
Detection & ResponseJun 1, 2026· 10 min

Reconstructing a Data-Exfiltration Breach from CloudTrail with Athena

When an S3 bucket leaks, your CloudTrail logs already hold the full story. Here is how to query them with Athena, pivot across identity and network signals, and rebuild the attacker's timeline from first API call to final GetObject.

Reconstructing a Data-Exfiltration Breach from CloudTrail with Athena

You get the call at 02:00: a customer dataset is for sale on a forum, and the only thing anyone knows is that it lived in an S3 bucket. No detection rule fired, no dashboard exists, no SIEM is ingesting your AWS logs. What you do have is a CloudTrail trail writing to S3, and that is enough. With Amazon Athena you can turn months of gzipped JSON log objects into a queryable table and reconstruct exactly what the attacker did, when, from where, and with which credentials. This walkthrough shows the method end to end: standing up the table, pivoting across identity and network fields, separating enumeration from exfiltration, and assembling a defensible timeline.

Step 1: Make CloudTrail queryable in Athena

Athena reads CloudTrail objects in place via the purpose-built CloudTrail SerDe, so you never move or duplicate data. One detail decides whether the case is winnable: management events alone are not enough. By default CloudTrail records the control plane (CreateAccessKey, AssumeRole, PutBucketPolicy) but NOT object-level reads (GetObject, PutObject). Those are data events, and you must have had S3 data event logging enabled before the incident to see the actual exfiltration. Create the management-events table first; if you captured data events, point a second table at that prefix. Partition both with projection on region and date, a flat scan over a year of org-wide CloudTrail reads terabytes and costs real money per query, while bounding to the relevant region and dates typically cuts scan volume 90 percent or more.

CREATE EXTERNAL TABLE cloudtrail_logs (
  eventTime STRING,
  eventSource STRING,
  eventName STRING,
  awsRegion STRING,
  sourceIPAddress STRING,
  userAgent STRING,
  errorCode STRING,
  requestParameters STRING,
  responseElements STRING,
  userIdentity STRUCT<
    type: STRING,
    arn: STRING,
    accountId: STRING,
    accessKeyId: STRING,
    userName: STRING,
    sessionContext: STRUCT<
      attributes: STRUCT<creationDate: STRING, mfaAuthenticated: STRING>>>,
  resources ARRAY<STRUCT<arn: STRING, accountId: STRING, type: STRING>>
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my-org-cloudtrail/AWSLogs/123456789012/CloudTrail/';

Step 2: Find the principal, then split recon from theft

Start from what you know, the bucket name, and collapse to the distinct identities that touched it. The userIdentity.arn tells you who, accessKeyId tells you which credential, and sourceIPAddress plus userAgent fingerprint the actor, the tell being a legitimate principal acting from an IP and user agent it has never used, often a generic aws-cli or SDK string from a hosting-provider or Tor-exit address rather than your corporate ranges. Then read the shape of the activity. Real intrusions orient before they steal, and CloudTrail makes each phase loud, especially the errorCode field, because enumeration probes things the principal cannot access and throws a wall of AccessDenied that legitimate automation almost never produces.

  • Enumeration: clustered GetCallerIdentity, ListBuckets, ListObjectsV2, GetBucketPolicy and ListRoles in minutes, with many AccessDenied errorCodes as the actor maps blast radius.
  • Persistence / escalation: CreateAccessKey, AttachUserPolicy, PutUserPolicy, or AssumeRole into a more privileged role.
  • Exfiltration: a sustained run of successful GetObject calls (errorCode is null) against one bucket prefix, from the same anomalous sourceIPAddress.
  • Anti-forensics: StopLogging, DeleteTrail, PutBucketPolicy or DeleteBucketEncryption to blind defenders and broaden public access.
SELECT
  eventTime,
  userIdentity.arn               AS principal,
  userIdentity.accessKeyId       AS access_key,
  eventName,
  sourceIPAddress,
  userAgent,
  errorCode
FROM cloudtrail_logs
WHERE eventName IN (
        'ListBuckets', 'GetBucketPolicy', 'ListObjectsV2', 'GetObject',
        'CreateAccessKey', 'GetCallerIdentity', 'AssumeRole')
  AND eventTime >= '2026-05-28T00:00:00Z'
  AND eventTime <  '2026-05-31T00:00:00Z'
ORDER BY eventTime;

Step 3: Quantify the theft, then pivot into a timeline

This is where S3 data events earn their keep: management events show the door opening, data events show what walked out. Query the data-event table for GetObject by the suspect key, count distinct objects, and group by hour to bound the window and see the burst. Then widen back out. Take your three high-confidence indicators, the access key, the source IP, and the anomalous user agent, and pivot on each independently across the full management table. The access key reveals everything that credential touched; the source IP catches activity even after the actor rotated to a freshly minted key; the user agent links sessions across both. Stitched together and ordered by eventTime they produce the canonical narrative: initial access, enumeration, key creation, the GetObject run, and any StopLogging or DeleteTrail at the end, the attacker trying to cut the very telemetry you are reading.

SELECT
  date_trunc('hour', from_iso8601_timestamp(eventTime)) AS hour_bucket,
  userIdentity.accessKeyId       AS access_key,
  sourceIPAddress,
  count(*)                       AS get_object_calls,
  count(DISTINCT json_extract_scalar(requestParameters, '$.key')) AS distinct_objects
FROM cloudtrail_data_events
WHERE eventSource = 's3.amazonaws.com'
  AND eventName = 'GetObject'
  AND errorCode IS NULL
  AND json_extract_scalar(requestParameters, '$.bucketName') = 'acme-customer-pii'
  AND eventTime >= '2026-05-30T00:00:00Z'
GROUP BY 1, 2, 3
ORDER BY get_object_calls DESC;

If you run CloudTrail Lake instead of raw S3 plus Athena, the same investigation collapses into native SQL against an immutable event data store, no table DDL or SerDe required, with retention configurable up to ten years. Lake removes the table-management toil and resists exactly the StopLogging and DeleteTrail tampering you are hunting for during a live breach. Whichever you use, centralize trails into a logging account the workload role cannot delete, enable log file validation, turn on S3 data events for sensitive buckets before you need them, and rehearse these queries, so at 02:00 you are running a known playbook, not writing SQL from scratch against an unfamiliar schema.

Learn it by doing

Spin up a real AWS security lab, or explore our training tracks.

24 people viewing now