BigQuery & Basics

GA4 Traffic Sources Explained

The two most common things people want from their GA4/analytics data is:

  1. How are people getting to my site?
  2. How is my traffic, conversions & revenue looking?

Today we’re doing the first. Traffic sources.

As with all our posts we’ve got the same 3 goals:

  1. Understand the fundamentals of how it all works.
  2. Apply that to using GA4.
  3. Get comfortable using GA4 BigQuery Export knowing the principles.

GA4 and BigQuery don't do the best job of explaining why they have so many traffic sources and what the difference is between them, so we'll fill in that gap.

What is traffic source?

Traffic source describes where someone came from when they visit your site.

The basic definition is quite simple. The fiddly part is each traffic source will have multiple scopes/attribution models which is going to make it messy.

Medium, source & campaign are the "big 3". Medium/source are populated automatically, but pretty much all of these can be manually set using UTM parameters.

Of course if you've scanned the BigQuery schema you're probably wondering how we go from simply collecting that data to:

  • traffic_source.medium
  • event_params.medium
  • collected_traffic_source.medium
  • session_traffic_source_last_click.manual_campaign.medium
  • session_traffic_source_last_click.google_ads_campaign.medium
  • session_traffic_source_last_click.cross_channel_campaign.medium

That's a lot of mediums.

So what's the difference between session_traffic_source_last_click, collected_traffic_source, traffic_source and the event_params?

What a mouthful. The "short" version is:

  • traffic_source is all time first user medium.
  • Event parameters is event level medium.
  • collected_traffic_source is event level medium but you don't have to unnest it.
  • session_traffic_source_last_click is last non direct click session level medium.
    • This then comes in different flavours:
      • manual: I.e. a manually set UTM.
      • google_ads_campaign: This got pulled from Google Ads. (GA4 can be tied to Google Ads.)
      • cross_channel_campaign: This got pulled from a Google Ads connection but it doesn't just have to be Google Ads. You could get a youtube campaign in here for example.

If that made no sense read on.

Let's look at 3 basic parts to traffic sources in GA4 that we need to understand to make sense of this.

Concept 1: User vs Session Traffic Sources

The first part that confuses most people is how traffic source works at different scopes.

Each traffic source has a different scope.

In GA4 you’ll see two different windows in the UI:

user_vs_session_dimensions

Each with a different set of dimensions

  • First User Medium, First User Source … etc.
  • Session Medium, Session Source … etc.

They’re different. What’s going on here? Which one do you choose?

An example

Let’s say we’re tracking a single user’s activity on a specific day on our site.

  1. They get our newsletter and then visit our site.
  2. Later in the day, the come back to the site using Google.

single_users_sessions

There are two ways we can think about the way the user visited the site:

  1. At the session level - i.e. How did they arrive to the site for each individual visit?
    • These are Session Dimensions - i.e Session Medium or Session Source
  2. At the user level - i.e How did the user initially come to the site?
    • These are First User Dimensions - i.e First User Medium or First User Source.

Let’s look at session level dimensions

Using the example above, there are two sessions which have a unique session medium & session source.

In a table it would look like this:

User IDSession NumberSession MediumSession Source
user_1Session 1emailmailchimp
user_1Session 2organicgoogle

Then let’s look at user level dimensions

When it comes to First User Medium & Source, we look at the first session for that user - which in this case is via the newsletter.

first_session_via_newsletter

first_user_medium_source_example

In a table it would look like this:

User IDSession NumberSession MediumSession SourceFirst User MediumFirst User Source
user_1Session 1emailmailchimpemailmailchimp
user_1Session 2organicgoogleemailmailchimp

Note how the First User Medium & First User Source are the same across both sessions.

First user today or first time the user ever visited?

When we say first user medium, does GA4 mean today or over all time?

It means over all time.

This means that it doesn’t matter if the first session for a user is one day, or one month ago, the First User Medium & Source would still come from that first user session.

Should I use First User or Session Dimensions?

Ultimately, it depends on which metric you want to use.

You can’t combine any metric with any dimension as these numbers can be misleading or straight-up wrong. We discuss this in detail in this blog about scoping in GA4.

The short answer is that:

  • If it's a website probably session.
  • If it's an app probably user.

Just remember this rule when pairing metrics and dimensions.

first_user_dimensions_compatability

That's our first concept.

Each traffic source has a different scope. The most common is session vs user.

But we can also have event level too.

Concept 2: What is Last Non Direct Click attribution?

The next concept we need to understand is attribution. By default GA4 uses last non direct click attribution.

Let’s take our previous example and add on a 3rd visit. The user comes back direct to the homepage.

this_session_is_direct

The session is direct. But that isn’t what GA4 will show. It will show:

MediumSourceSessions
emailmailchimp1
organicgoogle2

What is happening?

Last Non Direct Click attribution

GA4 uses the last non direct click attribution model.

If the source of traffic is “direct”, GA4 then looks back through all the previous sessions (for 90 days) until it finds a source of traffic which isn’t direct.

session_medium_source_previous

This is crucial and explains the difference between our event level medium from event parameters and the session level medium's from session_traffic_source_last_click.

Concept 3: Extra data sources

This one is nice and simple.

You can connect GA4 to other Google data sources. If you do they'll attempt to pull in that data and connect it your GA4 data.

This means that in our BigQuery export we're going to get data from all these different sources which is going to force many of our traffic source fields to have different versions: one of each different source Google can connect to and pull from.

Concept 4: Unnesting

event_params is the clever core of GA4's data model flexibility.

We talk about at length in our explanation of the GA4 export schema.

For the purposes of this post all we need to know is that it's slightly more complex to use than just selecting from a column and so Google tried to make it easier.

Now we have all the pieces we need to differentiate between the different traffic sources

So lets go through each of them using medium as an example:

What is the traffic_source property?

  • This is the medium of the first time this user ever visited your site. (And by user we mean device + browser).

What are the traffic sources in event_params?

  • This is the medium of this specific event. So it will exist for events fired on the landing page typically and disappear after that. We have to unnest this to use it.

What are the traffic sources in the collected_traffic_source property?

  • It's the same as event_params.medium, but Google did the unnesting for us. The downside is that Google didn't make this available from the start. It seems like it was added around 2023-05-03, at least on the data sources we can see.

What are the traffic sources in the session_traffic_source_last_click property?

  • This is the medium of the session (using last non direct click, Google has set the attribution for us). It's taken the event level data and aggregated it to session level. Now of course there are multiple versions in here's an example one.

What are the traffic sources in the session_traffic_source_last_click.google_ads_campaign property?

  • This is the medium of the session (using last non direct click, Google has set the attribution for us) as Google Ads saw it. This means this will only show up if you have a Google Ads connection and the medium was Google Ads i.e. (cpc).

And so on and so forth!

Where is the traffic source data stored in GA4 BigQuery?

Great so that's how they're all different. Now we'll just briefly run through all the fields grouped by the scope they're in (as there is more just medium), just to be thorough.

User Level Dimensions

We'll start with the traffic source fields. This is the user one so:

The traffic_source set of nested fields contains the information about the first time the user ever arrived at your website. This is over all time.

Unlike the other ones, this contains a reduced set of fields and isn't quite as complete as the others.

Field NameData TypeWhat is it?
traffic.sourceRECORDThis record contains traffic source information that first acquired the user.
traffic_source.nameSTRINGThe name of the marketing campaign that first brought the user to the site (i.e. Spring_Sale, Holiday_Promo).
traffic_source.mediumSTRINGThe type of channel that first brought in the user (i.e. organic search, email, paid search)
traffic_source.sourceSTRINGThe name of the platform or site that first brought in the user (i.e. Google, Facebook)

Session Level Dimensions

As mentioned the session level traffic source dimensions are found in several locations:

  • As nested fields in the event_params i.e. Event Parameters.
  • In the collected_traffic_source fields (as of 2023/05/03).
  • In the session_traffic_source_last_click fields (as of 2024/07/17)

Split by attribution model and ease of use.

Nested Fields in Event Parameters

There is no set list of fields here, it's literally whatever you set in event parameters in the URL.

Looking at an example session_start event, we can see this one has medium and source set.

This is the value just for the event it's set on.

session_start_event

Collected Traffic Fields

These are the same values we see in event parameters, but they’re now pulled out for you into their own column.

Session Traffic Source - Last Click

These fields contain the session level traffic source information attributed to last non-direct click.

Sometimes the GA4 BigQuery export doesn't attribute paid search correctly.

You can spot this by looking for a gclid/wbraid/gbraid parameter in the URL.

gclid

To fix this you have to look for these parameters and manually set medium in order to correctly classify your traffic.

To fix this, you can search for the gclid in the URL and then set the medium and source correctly.

In the example queries below, we’ve added in a fix to ensure that this same misattribution isn’t happening.

When we're using Pipeline, we pull everything from the params because it will work historically - whereas new fields such as session_traffic_source_last_click only became available from July 2024.

How can we query traffic source metrics in GA4 BigQuery?

Right now you know all the fields and concepts, so let's look at some examples.

To make this easier, we've built a little widget which allows you to define the data set and date range which will automatically update the FROM statement in the SQL.

You can also custom set this based on whether you are looking at the Streaming or Daily GA4 tables.

Customise the queries.

Set-up variables to customise our BigQuery queries for your setup.

Calculating the number of Sessions by Session Medium

Take note when reading this and the next handful if queries. The main difference is the GROUP BY at the end.

This is the total number of sessions broken down by session medium.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_medium using the event_timestamp.
    session_data_2 as (
        select
            * except (session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_medium
        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_medium
    final_output as (
        select
            session_medium,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_medium
    )

-- Step 5: Output final session counts
select *
from final_output
;

Calculating the number of Sessions by Session Source

This is the total number of sessions broken down by session source.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.

with session_data as (
    select
        user_pseudo_id,
        event_timestamp,
        (
            select value.int_value
            from unnest(event_params)
            where key = 'ga_session_id'
        ) as ga_session_id,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'medium'
        ) as session_medium,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'source'
        ) as session_source,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'campaign'
        ) as session_campaign,
        (
            select value.string_value 
            from unnest(event_params) 
            where key = 'page_location'
        ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
),

-- Step 2: 
-- This fixes the GCLID bug and handle missing session attributes.
-- This happens when the information exists in the URL (page_location) 
-- but is not attributed correctly to session_source or session_medium.

fixing_gclid_bug as (
    select
        * except (session_source, session_medium, session_campaign),
        case
            when page_location like '%gclid%' and session_source is null
            then "google"
            else session_source
        end as session_source,
        case
            when page_location like '%gclid%' and session_medium is null
            then "cpc"
            else session_medium
        end as session_medium,
        case
            when page_location like '%gclid%' and session_campaign is null
            then "(cpc)"
            else session_campaign
        end as session_campaign,
        regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
    from session_data
),

-- Step 3: 
-- This gets the first session_source using the event_timestamp.
session_data_2 as (
    select
        * except (session_source),
        coalesce(
            first_value(
                case
                    when session_source <> '(direct)'
                    then coalesce(session_source, '(none)')
                end ignore nulls
            ) over (session_window),
            '(direct)'
        ) as session_source

    from fixing_gclid_bug
    window session_window as (
        partition by ga_session_id,user_pseudo_id
        order by event_timestamp asc
        rows between unbounded preceding and unbounded following
    )
),

-- Step 4: Count distinct sessions grouped by session_source
final_output as (
    select
        session_source,
        count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
    from session_data_2
    group by session_source
)

-- Step 5: Output final session counts
select *
from final_output;

Calculating the number of Sessions by Session Medium & Source

This is the total number of sessions broken down by session medium & source.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source using the event_timestamp.
    session_data_2 as (
        select
            * except (session_source, session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_source, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(direct)'
            ) as session_source,
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_medium

        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_source and session_medium
    final_output as (
        select
            session_source,
            session_medium,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_source, session_medium
    )

-- Step 5: Output final session counts
select *
from final_output
;

Calculating the number of Sessions by other Campaign Parameters

As noted above, you can use campaign parameters to provide more information about how users have come to your site. These are contained in UTM parameters.

You can adapt the following query to look at whichever campaign parameters you have set up such as:

  • campaign
  • term
  • gclid etc.

For this example, I am going to use “campaign”


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source using the event_timestamp.
    session_data_2 as (
        select
            * except (session_campaign),
            coalesce(
                first_value(
                    case
                        when session_campaign is not null and session_campaign <> ''
                        then session_campaign
                    end ignore nulls
                ) over (session_window),
                '(none)'
            ) as session_campaign

        from fixing_gclid_bug
        window
            session_window as (
                partition by ga_session_id, user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct sessions grouped by session_source and session_medium
    final_output as (
        select
            session_campaign,
            count(distinct concat(user_pseudo_id, ga_session_id)) as sessions
        from session_data_2
        group by session_campaign
    )

-- Step 5: Output final session counts
select *
from final_output
;

First User Traffic Source Calculations

There are two ways we can use to calculate first user dimensions:

  1. Using the Traffic Source field which looks to the first user session ever.
  2. Manually calculating this by looking at the first session for a user throughout the time period selected.

The Traffic Source fields are great, but these fields don't exist in the streaming tables. Boo.

Let's talk through these two methods.

Calculating the number of Users by First User Medium (Using the Traffic Source Field)

This is the total number of users by First User Medium using the traffic_source field which looks at the first every user session for traffic source attribution.

Note: This will not work for the intraday (streaming) tables.


with
    data_sorting as (
        select
            user_pseudo_id,
            coalesce(traffic_source.medium, '(none)') as medium,  -- Coerce NULL to '(none)'
            row_number() over (
                partition by user_pseudo_id
                order by
                    event_timestamp  ASC -- Order by earliest timestamp
            ) as row_number
        from
            -- This table will need to be updated
            `@project.analytics_@analytics_property_id.@event_type*`
        where
            -- The date range being selected will need to be updated
            _table_suffix between '@start_date' and '@end_date'
    )
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, medium
from data_sorting
where row_number = 1  -- Eliminate duplicate user_pseudo_ids across mediums
group by medium
order by total_pseudo_users desc
;

Calculating the number of Users by First User Source (Using the Traffic Source Field)

This is the total number of users by First User Source.

Note: This will not work for the intraday (streaming) tables.


with
    data_sorting as (
        select
            user_pseudo_id,
            coalesce(traffic_source.source, 'direct') as source,  -- Coerce NULL to 'direct'
            row_number() over (
                partition by user_pseudo_id
                order by
                    case when traffic_source.source = 'direct' then 1 else 0 end,  -- Prioritise other sources over '(none)'
                    traffic_source.source  -- Default order for remaining sources
            ) as row_number
        from
            -- This table will need to be updated
            `@project.analytics_@analytics_property_id.@event_type*`
        where
            -- The date range being selected will need to be updated
            _table_suffix between '@start_date' and '@end_date'
    )
-- Select only the first occurrence of each user_pseudo_id
select count(distinct user_pseudo_id) as total_pseudo_users, source
from data_sorting
where row_number = 1  -- Eliminate duplicate user_pseudo_ids across sources
group by source
order by total_pseudo_users desc
;

Calculating the number of Users by First User Medium (Manually Calculated)

This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_medium for the user using the event_timestamp.
    session_data_2 as (
        select
            * except (session_medium),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_medium, '(none)')
                    end ignore nulls
                ) over (user_pseudo_window),
                '(none)'
            ) as first_user_medium
        from fixing_gclid_bug
        window
            user_pseudo_window as (
                partition by user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct usersr grouped by first_user_medium
    final_output as (
        select
            first_user_medium,
            count(distinct concat(user_pseudo_id)) as users
        from session_data_2
        group by first_user_medium
    )

-- Step 5: Output final session counts
select *
from final_output

Calculating the number of Users by First User Source (Manually Calculated)

This calculation comes from looking at the traffic source information for the first session by a user throughout the time period selected.


-- Step 1: 
-- Extract key session attributes from event parameters
-- and pulling in required fields such as user_pseudo_id and
-- event_timestamp.
with
    session_data as (
        select
            user_pseudo_id,
            event_timestamp,
            (
                select value.int_value
                from unnest(event_params)
                where key = 'ga_session_id'
            ) as ga_session_id,
            (
                select value.string_value from unnest(event_params) where key = 'medium'
            ) as session_medium,
            (
                select value.string_value from unnest(event_params) where key = 'source'
            ) as session_source,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'campaign'
            ) as session_campaign,
            (
                select value.string_value
                from unnest(event_params)
                where key = 'page_location'
            ) as page_location
        from `@project.analytics_@analytics_property_id.@event_type*`
        where _table_suffix BETWEEN '@start_date' AND '@end_date'
    ),

    -- Step 2: 
    -- This fixes the GCLID bug and handle missing session attributes.
    -- This happens when the information exists in the URL (page_location) 
    -- but is not attributed correctly to session_source or session_medium.
    fixing_gclid_bug as (
        select
            * except (session_source, session_medium, session_campaign),
            case
                when page_location like '%gclid%' and session_source is null
                then "google"
                else session_source
            end as session_source,
            case
                when page_location like '%gclid%' and session_medium is null
                then "cpc"
                else session_medium
            end as session_medium,
            case
                when page_location like '%gclid%' and session_campaign is null
                then "(cpc)"
                else session_campaign
            end as session_campaign,
            regexp_extract(page_location, r"(?i)gclid=([^&]+)") as dim__gclid
        from session_data
    ),

    -- Step 3: 
    -- This gets the first session_source for the user using the event_timestamp.
    session_data_2 as (
        select
            * except (session_source),
            coalesce(
                first_value(
                    case
                        when session_source <> '(direct)'
                        then coalesce(session_source, '(none)')
                    end ignore nulls
                ) over (user_pseudo_window),
                '(direct)'
            ) as first_user_source,
        from fixing_gclid_bug
        window
            user_pseudo_window as (
                partition by user_pseudo_id
                order by event_timestamp asc
                rows between unbounded preceding and unbounded following
            )
    ),

    -- Step 4: Count distinct userr grouped by first_user_source
    final_output as (
        select
            first_user_source,
            count(distinct concat(user_pseudo_id)) as users
        from session_data_2
        group by first_user_source
    )

-- Step 5: Output final session counts
select *
from final_output
order by users DESC
;