GA4 Traffic Sources Explained
The two most common things people want from their GA4/analytics data is:
- How are people getting to my site?
- How is my traffic, conversions & revenue looking?
Today we’re talking about the first - the murky world of traffic sources and acquisition.
As with all our posts we’ve got the same 3 goals:
- Understand the fundamentals of how it all works.
- Apply that to using GA4.
- Get comfortable using GA4 BigQuery Export knowing the principles.
What is traffic source?
This part is quite easy.
Traffic source describes where someone came from when they visit your site.
Medium, source & default channel grouping are the three which are always populated automatically, but pretty much all of these can be manually set using UTM parameters.
Now let's look at 3 basic parts to traffic sources in analytics that we need to understand.
Concept 1: User vs Session Traffic Sources
The part that confuses most people immediately is how traffic source works with users & sessions.
In GA4 you’ll see two different windows in the UI:
Each with a different set of dimensions
- First User Medium, First User Source … etc.
- Session Medium, Session Source … etc.
They’re different. What’s going on here? Which one do you choose?
An example
Let’s say we’re tracking a single user’s activity on a specific day on our site.
- They get our newsletter and then visit our site.
- Later in the day, the come back to the site using Google.
There are two ways we can think about the way the user visited the site:
- At the session level - i.e. How did they arrive to the site for each individual visit?
- These are Session Dimensions - i.e Session Medium or Session Source
- At the user level - i.e How did the user initially come to the site?
- These are First User Dimensions - i.e First User Medium or First User Source.
Let’s look at session level dimensions
Using the example above, there are two sessions which have a unique session medium & session source.
In a table it would look like this:
User ID | Session Number | Session Medium | Session Source |
---|---|---|---|
user_1 | Session 1 | mailchimp | |
user_1 | Session 2 | organic |
Then let’s look at user level dimensions
When it comes to First User Medium & Source, we look at the first session for that user - which in this case is via the newsletter.
In a table it would look like this:
User ID | Session Number | Session Medium | Session Source | First User Medium | First User Source |
---|---|---|---|---|---|
user_1 | Session 1 | mailchimp | mailchimp | ||
user_1 | Session 2 | organic | mailchimp |
Note how the First User Medium & First User Source are the same across both sessions.
First user today or first time the user ever visited?
When we say first user medium, does GA4 mean today or over all time?
It means over all time.
This means that it doesn’t matter if the first session for a user is one day, or one month ago, the First User Medium & Source would still come from that first user session.
Should I use First User or Session Dimensions?
Ultimately, it depends on which metric you want to use.
You can’t combine any metric with any dimension as these numbers can be misleading or straight-up wrong. We discuss this in detail in this blog about scoping in GA4.
The short answer is that:
- If it's a website probably session.
- If it's an app probably user.
Just remember this rule when pairing metrics and dimensions.
Concept 2: What is Last Non Direct Click attribution?
There’s one really important last thing we need to remember with traffic sources in GA4.
It uses last non direct click attribution.
Let’s take our previous example and add on a 3rd visit. The user comes back direct to the homepage.
The session is direct. But that isn’t what GA4 will show. It will show:
Medium | Source | Sessions |
---|---|---|
mailchimp | 1 | |
organic | 2 |
What is happening?
Last Non Direct Click attribution
GA4 uses the last non direct click attribution model.
If the source of traffic is “direct”, GA4 then looks back through all the previous sessions (for 90 days) until it finds a source of traffic which isn’t direct.
Finally let’s talk about Google’s automatic groupings and customising them.
Concept 3: How do channel groupings work?
One last thing we need to cover is channel groupings. We'll be wanting this in BigQuery too.
If you’ve hopped into the GA4 interface, the main dimension you see isn’t source or medium it’s channel groupings.
How do channel groupings work?
The channel groupings group together similar sources of traffic under one label.
E.g. “Organic Search” groups together organic traffic from Google with organic traffic from Bing, and other search engines.
In a table it might look like this:
Default Channel Grouping | Session Medium | Session Source |
---|---|---|
Organic Search | organic | bing |
Organic Search | organic | |
Organic Search | organic | baidu |
Organic Search | referral | m.baidu.com |
Organic Search | referral | cn.bing.com |
What is Google's definition of Default Channel Grouping?
Google documentation defines it on this page.
We think there are a couple places it can be improved and you definitely should customise, which we’ve gone over in our Pipeline documentation.
What’s the difference between default channel grouping/primary channel grouping & custom channel groupings?
- Default Channel Grouping is the default channel grouping that Google has set.
- Custom Channel Groupings is a channel grouping you can set yourself.
- Primary Channel Grouping is just an interface option of what do you want to see by default in the UI - default channel grouping, custon 1, custom 2 etc.
Where is the traffic source data stored in GA4 BigQuery?
If you’ve set up an automatic GA4 export to BigQuery, you’ll be able to get the raw data about traffic acquisition.
There are different fields depending on whether you’re looking for a a user or session level dimension.
User Level Dimensions
The traffic_source
set of nested fields contains the information about the first time the user ever arrived at your website. This is over all time.
If you are looking for the very First User Source or First User Medium, this is where you’ll find this information.
Field Name | Data Type | What is it? |
---|---|---|
traffic.source | RECORD | This record contains traffic source information that first acquired the user. |
traffic_source.name | STRING | The name of the marketing campaign that first brought the user to the site (i.e. Spring_Sale, Holiday_Promo). |
traffic_source.medium | STRING | The type of channel that first brought in the user (i.e. organic search, email, paid search) |
traffic_source.source | STRING | The name of the platform or site that first brought in the user (i.e. Google, Facebook) |
Session Level Dimensions
The session level traffic source dimensions are found in several locations:
- As nested fields in the
event_params
i.e. Event Parameters. - In the
collected_traffic_source
fields (as of 2023/05/03). - In the
session_traffic_source_last_click
fields (as of 2024/07/17)
Let’s break these down
Nested Fields in Event Parameters
Looking at an example session_start
event, we can see where the session medium & source information is found.
This is the value just for the event it's set on.
Collected Traffic Fields
These are the same values we see in event parameters, but they’re now pulled out for you into their own column.
Session Traffic Source - Last Click
These fields contain the session level traffic source information attributed to last non-direct click.
There has been a bug with Paid Search historically in the GA4 BigQuery export
Paid Search results include a gclid in the parameters, which shows that the traffic came from Paid Google Ads.
However, historically, there was a bug in with the medium/source being set incorrectly and paid traffic wasn't being attributed correctly.
To fix this, you can search for the gclid in the URL and then set the medium and source correctly.
As mentioned before, in July 2024, Google Added the session_traffic_source_last_click
field to the GA4 BigQuery export which provides the last non direct session level traffic source.
This is what GA4 uses in the interface, but isn't technically session level (as it uses last non direct click.) This can be used to fix the issue, but only for data from July 2024 onwards.
In the example queries below, we’ve added in a fix to ensure that this same misattribution isn’t happening.
When we're using Pipeline, we pull everything from the params because it will work historically - whereas new fields such as session_traffic_source_last_click
only became available from July 2024.
How can we query traffic source metrics in GA4 BigQuery?
Right now you know all the fields and concepts, so let's look at some examples.
To make this easier, we've built a little widget which allows you to define the data set and date range which will automatically update the FROM statement in the SQL.
You can also custom set this based on whether you are looking at the Streaming or Daily GA4 tables.
Customise the queries.
Set-up variables to customise our BigQuery queries for your setup.
Calculating the number of Sessions by Session Medium
Take note when reading this and the next handful if queries. The main difference is the GROUP BY at the end.
This is the total number of sessions broken down by session medium.
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value from unnest(event_params) where key = 'medium'
) as session_medium,
(
select value.string_value from unnest(event_params) where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_medium using the event_timestamp.
session_data_2 as (
select
* except (session_medium),
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_medium, '(none)')
end ignore nulls
) over (session_window),
'(none)'
) as session_medium
from fixing_gclid_bug
window
session_window as (
partition by ga_session_id, user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct sessions grouped by session_medium
final_output as (
select
session_medium,
count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
from session_data_2
group by session_medium
)
-- Step 5: Output final session counts
select *
from final_output
;
Calculating the number of Sessions by Session Source
This is the total number of sessions broken down by session source.
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value
from unnest(event_params)
where key = 'medium'
) as session_medium,
(
select value.string_value
from unnest(event_params)
where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_source using the event_timestamp.
session_data_2 as (
select
* except (session_source),
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_source, '(none)')
end ignore nulls
) over (session_window),
'(direct)'
) as session_source
from fixing_gclid_bug
window session_window as (
partition by ga_session_id,user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct sessions grouped by session_source
final_output as (
select
session_source,
count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
from session_data_2
group by session_source
)
-- Step 5: Output final session counts
select *
from final_output;
Calculating the number of Sessions by Session Medium & Source
This is the total number of sessions broken down by session medium & source.
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value from unnest(event_params) where key = 'medium'
) as session_medium,
(
select value.string_value from unnest(event_params) where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_source using the event_timestamp.
session_data_2 as (
select
* except (session_source, session_medium),
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_source, '(none)')
end ignore nulls
) over (session_window),
'(direct)'
) as session_source,
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_medium, '(none)')
end ignore nulls
) over (session_window),
'(none)'
) as session_medium
from fixing_gclid_bug
window
session_window as (
partition by ga_session_id, user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct sessions grouped by session_source and session_medium
final_output as (
select
session_source,
session_medium,
count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
from session_data_2
group by session_source, session_medium
)
-- Step 5: Output final session counts
select *
from final_output
;
Calculating the number of Sessions by other Campaign Parameters
As noted above, you can use campaign parameters to provide more information about how users have come to your site. These are contained in UTM parameters.
You can adapt the following query to look at whichever campaign parameters you have set up such as:
- campaign
- term
- gclid etc.
For this example, I am going to use “campaign”
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value from unnest(event_params) where key = 'medium'
) as session_medium,
(
select value.string_value from unnest(event_params) where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_source using the event_timestamp.
session_data_2 as (
select
* except (session_campaign),
coalesce(
first_value(
case
when session_campaign is not null and session_campaign <> ''
then session_campaign
end ignore nulls
) over (session_window),
'(none)'
) as session_campaign
from fixing_gclid_bug
window
session_window as (
partition by ga_session_id, user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct sessions grouped by session_source and session_medium
final_output as (
select
session_campaign,
count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
from session_data_2
group by session_campaign
)
-- Step 5: Output final session counts
select *
from final_output
;
First User Traffic Source Calculations
There are two ways we can use to calculate first user dimensions:
- Using the
Traffic Source
field which looks to the first user session ever. - Manually calculating this by looking at the first session for a user throughout the time period selected.
The Traffic Source
fields are great, but these fields don't exist in the streaming tables. Boo.
Let's talk through these two methods.
Calculating the number of Users by First User Medium (Using the Traffic Source Field)
This is the total number of users by First User Medium using the traffic_source
field which looks at the first every user session for traffic source attribution.
Note: This will not work for the intraday (streaming) tables.
with
data_sorting as (
select
user_pseudo_id,
coalesce(traffic_source.medium, '(none)') as medium, -- Coerce NULL to '(none)'
row_number() over (
partition by user_pseudo_id
order by
event_timestamp ASC -- Order by earliest timestamp
) as row_number
from
-- This table will need to be updated
`@project.analytics_@analytics_property_id.@event_type*`
where
-- The date range being selected will need to be updated
_table_suffix between '@start_date' and '@end_date'
)
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, medium
from data_sorting
where row_number = 1 -- Eliminate duplicate user_pseudo_ids across mediums
group by medium
order by total_pseudo_users desc
;
Calculating the number of Users by First User Source (Using the Traffic Source Field)
This is the total number of users by First User Source.
Note: This will not work for the intraday (streaming) tables.
with
data_sorting as (
select
user_pseudo_id,
coalesce(traffic_source.source, 'direct') as source, -- Coerce NULL to 'direct'
row_number() over (
partition by user_pseudo_id
order by
case when traffic_source.source = 'direct' then 1 else 0 end, -- Prioritise other sources over '(none)'
traffic_source.source -- Default order for remaining sources
) as row_number
from
-- This table will need to be updated
`@project.analytics_@analytics_property_id.@event_type*`
where
-- The date range being selected will need to be updated
_table_suffix between '@start_date' and '@end_date'
)
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, source
from data_sorting
where row_number = 1 -- Eliminate duplicate user_pseudo_ids across sources
group by source
order by total_pseudo_users desc
;
Calculating the number of Users by First User Medium (Manually Calculated)
This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value from unnest(event_params) where key = 'medium'
) as session_medium,
(
select value.string_value from unnest(event_params) where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_medium for the user using the event_timestamp.
session_data_2 as (
select
* except (session_medium),
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_medium, '(none)')
end ignore nulls
) over (user_pseudo_window),
'(none)'
) as first_user_medium
from fixing_gclid_bug
window
user_pseudo_window as (
partition by user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct usersr grouped by first_user_medium
final_output as (
select
first_user_medium,
count(distinct concat(user_pseudo_id)) as users
from session_data_2
group by first_user_medium
)
-- Step 5: Output final session counts
select *
from final_output
Calculating the number of Users by First User Source (Manually Calculated)
This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.
-- Step 1:
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
session_data as (
select
user_pseudo_id,
event_timestamp,
(
select value.int_value
from unnest(event_params)
where key = 'ga_session_id'
) as ga_session_id,
(
select value.string_value from unnest(event_params) where key = 'medium'
) as session_medium,
(
select value.string_value from unnest(event_params) where key = 'source'
) as session_source,
(
select value.string_value
from unnest(event_params)
where key = 'campaign'
) as session_campaign,
(
select value.string_value
from unnest(event_params)
where key = 'page_location'
) as page_location
from `@project.analytics_@analytics_property_id.@event_type*`
where _table_suffix BETWEEN '@start_date' AND '@end_date'
),
-- Step 2:
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location)
-- but is not attributed correctly to session_source or session_medium.
fixing_gclid_bug as (
select
* except (session_source, session_medium, session_campaign),
case
when page_location like '%gclid%' and session_source is null
then "google"
else session_source
end as session_source,
case
when page_location like '%gclid%' and session_medium is null
then "cpc"
else session_medium
end as session_medium,
case
when page_location like '%gclid%' and session_campaign is null
then "(cpc)"
else session_campaign
end as session_campaign,
regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
from session_data
),
-- Step 3:
-- This gets the first session_source for the user using the event_timestamp.
session_data_2 as (
select
* except (session_source),
coalesce(
first_value(
case
when session_source <> '(direct)'
then coalesce(session_source, '(none)')
end ignore nulls
) over (user_pseudo_window),
'(direct)'
) as first_user_source,
from fixing_gclid_bug
window
user_pseudo_window as (
partition by user_pseudo_id
order by event_timestamp asc
rows between unbounded preceding and unbounded following
)
),
-- Step 4: Count distinct userr grouped by first_user_source
final_output as (
select
first_user_source,
count(distinct concat(user_pseudo_id)) as users
from session_data_2
group by first_user_source
)
-- Step 5: Output final session counts
select *
from final_output
order by users DESC
;