BigQuery & Basics

GA4 Traffic Sources Explained

The two most common things people want from their GA4/analytics data is:

  1. How are people getting to my site?
  2. How is my traffic, conversions & revenue looking?

Today we’re talking about the first - the murky world of traffic sources and acquisition.

As with all our posts we’ve got the same 3 goals:

  1. Understand the fundamentals of how it all works.
  2. Apply that to using GA4.
  3. Get comfortable using GA4 BigQuery Export knowing the principles.

What is traffic source?

This part is quite easy.

Traffic source describes where someone came from when they visit your site.

Medium, source & default channel grouping are the three which are always populated automatically, but pretty much all of these can be manually set using UTM parameters.

Now let's look at 3 basic parts to traffic sources in analytics that we need to understand.

Concept 1: User vs Session Traffic Sources

The part that confuses most people immediately is how traffic source works with users & sessions.

In GA4 you’ll see two different windows in the UI:

user_vs_session_dimensions

Each with a different set of dimensions

  • First User Medium, First User Source … etc.
  • Session Medium, Session Source … etc.

They’re different. What’s going on here? Which one do you choose?

An example

Let’s say we’re tracking a single user’s activity on a specific day on our site.

  1. They get our newsletter and then visit our site.
  2. Later in the day, the come back to the site using Google.

single_users_sessions

There are two ways we can think about the way the user visited the site:

  1. At the session level - i.e. How did they arrive to the site for each individual visit?
    • These are Session Dimensions - i.e Session Medium or Session Source
  2. At the user level - i.e How did the user initially come to the site?
    • These are First User Dimensions - i.e First User Medium or First User Source.

Let’s look at session level dimensions

Using the example above, there are two sessions which have a unique session medium & session source.

In a table it would look like this:

User IDSession NumberSession MediumSession Source
user_1Session 1emailmailchimp
user_1Session 2organicgoogle

Then let’s look at user level dimensions

When it comes to First User Medium & Source, we look at the first session for that user - which in this case is via the newsletter.

first_session_via_newsletter

first_user_medium_source_example

In a table it would look like this:

User IDSession NumberSession MediumSession SourceFirst User MediumFirst User Source
user_1Session 1emailmailchimpemailmailchimp
user_1Session 2organicgoogleemailmailchimp

Note how the First User Medium & First User Source are the same across both sessions.

First user today or first time the user ever visited?

When we say first user medium, does GA4 mean today or over all time?

It means over all time.

This means that it doesn’t matter if the first session for a user is one day, or one month ago, the First User Medium & Source would still come from that first user session.

Should I use First User or Session Dimensions?

Ultimately, it depends on which metric you want to use.

You can’t combine any metric with any dimension as these numbers can be misleading or straight-up wrong. We discuss this in detail in this blog about scoping in GA4.

The short answer is that:

  • If it's a website probably session.
  • If it's an app probably user.

Just remember this rule when pairing metrics and dimensions.

first_user_dimensions_compatability

Concept 2: What is Last Non Direct Click attribution?

There’s one really important last thing we need to remember with traffic sources in GA4.

It uses last non direct click attribution.

Let’s take our previous example and add on a 3rd visit. The user comes back direct to the homepage.

this_session_is_direct

The session is direct. But that isn’t what GA4 will show. It will show:

MediumSourceSessions
emailmailchimp1
organicgoogle2

What is happening?

Last Non Direct Click attribution

GA4 uses the last non direct click attribution model.

If the source of traffic is “direct”, GA4 then looks back through all the previous sessions (for 90 days) until it finds a source of traffic which isn’t direct.

session_medium_source_previous

Finally let’s talk about Google’s automatic groupings and customising them.

Concept 3: How do channel groupings work?

One last thing we need to cover is channel groupings. We'll be wanting this in BigQuery too.

If you’ve hopped into the GA4 interface, the main dimension you see isn’t source or medium it’s channel groupings.

session_primary_channel_group

How do channel groupings work?

The channel groupings group together similar sources of traffic under one label.

E.g. “Organic Search” groups together organic traffic from Google with organic traffic from Bing, and other search engines.

In a table it might look like this:

Default Channel GroupingSession MediumSession Source
Organic Searchorganicbing
Organic Searchorganicgoogle
Organic Searchorganicbaidu
Organic Searchreferralm.baidu.com
Organic Searchreferralcn.bing.com

What is Google's definition of Default Channel Grouping?

Google documentation defines it on this page.

We think there are a couple places it can be improved and you definitely should customise, which we’ve gone over in our Pipeline documentation.

What’s the difference between default channel grouping/primary channel grouping & custom channel groupings?

  • Default Channel Grouping is the default channel grouping that Google has set.
  • Custom Channel Groupings is a channel grouping you can set yourself.
  • Primary Channel Grouping is just an interface option of what do you want to see by default in the UI - default channel grouping, custon 1, custom 2 etc.

Where is the traffic source data stored in GA4 BigQuery?

If you’ve set up an automatic GA4 export to BigQuery, you’ll be able to get the raw data about traffic acquisition.

There are different fields depending on whether you’re looking for a a user or session level dimension.

User Level Dimensions

The traffic_source set of nested fields contains the information about the first time the user ever arrived at your website. This is over all time.

If you are looking for the very First User Source or First User Medium, this is where you’ll find this information.

Field NameData TypeWhat is it?
traffic.sourceRECORDThis record contains traffic source information that first acquired the user.
traffic_source.nameSTRINGThe name of the marketing campaign that first brought the user to the site (i.e. Spring_Sale, Holiday_Promo).
traffic_source.mediumSTRINGThe type of channel that first brought in the user (i.e. organic search, email, paid search)
traffic_source.sourceSTRINGThe name of the platform or site that first brought in the user (i.e. Google, Facebook)

Session Level Dimensions

The session level traffic source dimensions are found in several locations:

  • As nested fields in the event_params i.e. Event Parameters.
  • In the collected_traffic_source fields (as of 2023/05/03).
  • In the session_traffic_source_last_click fields (as of 2024/07/17)

Let’s break these down

Nested Fields in Event Parameters

Looking at an example session_start event, we can see where the session medium & source information is found.

This is the value just for the event it's set on.

session_start_event

Collected Traffic Fields

These are the same values we see in event parameters, but they’re now pulled out for you into their own column.

Session Traffic Source - Last Click

These fields contain the session level traffic source information attributed to last non-direct click.

There has been a bug with Paid Search historically in the GA4 BigQuery export

Paid Search results include a gclid in the parameters, which shows that the traffic came from Paid Google Ads.

gclid

However, historically, there was a bug in with the medium/source being set incorrectly and paid traffic wasn't being attributed correctly.

To fix this, you can search for the gclid in the URL and then set the medium and source correctly.

As mentioned before, in July 2024, Google Added the session_traffic_source_last_click field to the GA4 BigQuery export which provides the last non direct session level traffic source.

This is what GA4 uses in the interface, but isn't technically session level (as it uses last non direct click.) This can be used to fix the issue, but only for data from July 2024 onwards.

In the example queries below, we’ve added in a fix to ensure that this same misattribution isn’t happening.

When we're using Pipeline, we pull everything from the params because it will work historically - whereas new fields such as session_traffic_source_last_click only became available from July 2024.

How can we query traffic source metrics in GA4 BigQuery?

Right now you know all the fields and concepts, so let's look at some examples.

To make this easier, we've built a little widget which allows you to define the data set and date range which will automatically update the FROM statement in the SQL.

You can also custom set this based on whether you are looking at the Streaming or Daily GA4 tables.

Customise the queries.

Set-up variables to customise our BigQuery queries for your setup.

Calculating the number of Sessions by Session Medium

Take note when reading this and the next handful if queries. The main difference is the GROUP BY at the end.

This is the total number of sessions broken down by session medium.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_medium using the event_timestamp.
    session_data_2 as (
        select
            * except (session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_medium
        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_medium
    final_output as (
        select
            session_medium,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_medium
    )

-- Step 5: Output final session counts
select *
from final_output
;

Calculating the number of Sessions by Session Source

This is the total number of sessions broken down by session source.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.

with session_data as (
    select
        user_pseudo_id,
        event_timestamp,
        (
            select value.int_value
            from unnest(event_params)
            where key = 'ga_session_id'
        ) as ga_session_id,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'medium'
        ) as session_medium,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'source'
        ) as session_source,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'campaign'
        ) as session_campaign,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'page_location'
        ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
),

-- Step 2: 
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location) 
-- but is not attributed correctly to session_source or session_medium.

fixing_gclid_bug as (
    select
        * except (session_source, session_medium, session_campaign),
        case
            when page_location like '%gclid%' and session_source is null
            then "google"
            else session_source
        end as session_source,
        case
            when page_location like '%gclid%' and session_medium is null
            then "cpc"
            else session_medium
        end as session_medium,
        case
            when page_location like '%gclid%' and session_campaign is null
            then "(cpc)"
            else session_campaign
        end as session_campaign,
        regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
    from session_data
),

-- Step 3: 
-- This gets the first session_source using the event_timestamp.
session_data_2 as (
    select
        * except (session_source),
        coalesce(
            first_value(
                case
                    when session_source <> '(direct)'
                    then coalesce(session_source, '(none)')
                end ignore nulls
            ) over (session_window),
            '(direct)'
        ) as session_source

    from fixing_gclid_bug
    window session_window as (
        partition by ga_session_id,user_pseudo_id
        order by event_timestamp asc
        rows between unbounded preceding and unbounded following
    )
),

-- Step 4: Count distinct sessions grouped by session_source
final_output as (
    select
        session_source,
        count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
    from session_data_2
    group by session_source
)

-- Step 5: Output final session counts
select *
from final_output;

Calculating the number of Sessions by Session Medium & Source

This is the total number of sessions broken down by session medium & source.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source using the event_timestamp.
    session_data_2 as (
        select
            * except (session_source, session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_source, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(direct)'
            ) as session_source,
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_medium

        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_source and session_medium
    final_output as (
        select
            session_source,
            session_medium,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_source, session_medium
    )

-- Step 5: Output final session counts
select *
from final_output
;

Calculating the number of Sessions by other Campaign Parameters

As noted above, you can use campaign parameters to provide more information about how users have come to your site. These are contained in UTM parameters.

You can adapt the following query to look at whichever campaign parameters you have set up such as:

  • campaign
  • term
  • gclid etc.

For this example, I am going to use “campaign”


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source using the event_timestamp.
    session_data_2 as (
        select
            * except (session_campaign),
            coalesce(
                first_value(
                    case
                        when session_campaign is not null and session_campaign <> ''
                        then session_campaign
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_campaign

        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_source and session_medium
    final_output as (
        select
            session_campaign,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_campaign
    )

-- Step 5: Output final session counts
select *
from final_output
;

First User Traffic Source Calculations

There are two ways we can use to calculate first user dimensions:

  1. Using the Traffic Source field which looks to the first user session ever.
  2. Manually calculating this by looking at the first session for a user throughout the time period selected.

The Traffic Source fields are great, but these fields don't exist in the streaming tables. Boo.

Let's talk through these two methods.

Calculating the number of Users by First User Medium (Using the Traffic Source Field)

This is the total number of users by First User Medium using the traffic_source field which looks at the first every user session for traffic source attribution.

Note: This will not work for the intraday (streaming) tables.


with
    data_sorting as (
        select
            user_pseudo_id,
            coalesce(traffic_source.medium, '(none)') as medium,  -- Coerce NULL to '(none)'
            row_number() over (
                partition by user_pseudo_id
                order by
                    event_timestamp  ASC -- Order by earliest timestamp
            ) as row_number
        from
            -- This table will need to be updated
            `@project.analytics_@analytics_property_id.@event_type*`
        where
            -- The date range being selected will need to be updated
            _table_suffix between '@start_date' and '@end_date'
    )
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, medium
from data_sorting
where row_number = 1  -- Eliminate duplicate user_pseudo_ids across mediums
group by medium
order by total_pseudo_users desc
;

Calculating the number of Users by First User Source (Using the Traffic Source Field)

This is the total number of users by First User Source.

Note: This will not work for the intraday (streaming) tables.


with
    data_sorting as (
        select
            user_pseudo_id,
            coalesce(traffic_source.source, 'direct') as source,  -- Coerce NULL to 'direct'
            row_number() over (
                partition by user_pseudo_id
                order by
                    case when traffic_source.source = 'direct' then 1 else 0 end,  -- Prioritise other sources over '(none)'
                    traffic_source.source  -- Default order for remaining sources
            ) as row_number
        from
            -- This table will need to be updated
            `@project.analytics_@analytics_property_id.@event_type*`
        where
            -- The date range being selected will need to be updated
            _table_suffix between '@start_date' and '@end_date'
    )
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, source
from data_sorting
where row_number = 1  -- Eliminate duplicate user_pseudo_ids across sources
group by source
order by total_pseudo_users desc
;

Calculating the number of Users by First User Medium (Manually Calculated)

This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_medium for the user using the event_timestamp.
    session_data_2 as (
        select
            * except (session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (user_pseudo_window),
                '(none)'
            ) as first_user_medium
        from fixing_gclid_bug
        window
            user_pseudo_window as (
                partition by user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct usersr grouped by first_user_medium
    final_output as (
        select
            first_user_medium,
            count(distinct concat(user_pseudo_id)) as users
        from session_data_2
        group by first_user_medium
    )

-- Step 5: Output final session counts
select *
from final_output

Calculating the number of Users by First User Source (Manually Calculated)

This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source for the user using the event_timestamp.
    session_data_2 as (
        select
            * except (session_source),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_source, '(none)')
                    end ignore nulls
                ) over (user_pseudo_window),
                '(direct)'
            ) as first_user_source,
        from fixing_gclid_bug
        window
            user_pseudo_window as (
                partition by user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct userr grouped by first_user_source
    final_output as (
        select
            first_user_source,
            count(distinct concat(user_pseudo_id)) as users
        from session_data_2
        group by first_user_source
    )

-- Step 5: Output final session counts
select *
from final_output
order by users DESC
;