BigQuery & Basics

An Introduction to GA4 & BigQuery

GA4 lets you export data to BigQuery for free.

Ok so the GA4 interface is a bit of a mess, but the raw export is brilliant.

It opens up so many possibilities that were simply not possible with Universal Analytics (unless you were paying for GA360).

This article will talk through the fundamentals of how GA4 works and how that transfers to BigQuery.

If you do nothing else:

Set-up the BigQuery export.

  • You can set it up now for essentially free.
  • It doesn’t backfill.
  • So if you do need it you’ll be glad to have it!

Then when you want more from GA4 you can come back to it.

GA4 is an Event-Based Data Model

This means everything it collects is a single event.

ga4_centred_on_events

The types of things that can be an event includes:

  • Starting a session
  • Making a purchase
  • Doing a site search
  • Scrolling down a page

In the interface you can work with these events in a relatively limited fashion.

But you can also export the raw events from GA4 into BigQuery.

ga4_export_to_bigquery

You can set-up GA4 BigQuery right now, even if you don’t use it immediately

If you have a GA4 property set up you can set that up now and for 95% of sites it’s will be basically free.

There are some small costs associated, but for most sites you’ll be paying cents a month.

For example: We get a couple thousand users a month and our cost is ~$0.03.

If you’re unsure where to start with setting this up then read our blog.

If you haven’t already got it set up, we’d recommend it.

It can’t be backfilled so good to get set up sooner rather than later.

Even if it does cost you slightly more, BigQuery primarily bills on how much use the data not what you store, so you set-up it and if you don't use it, it will still be super cheap.

Why use GA4 BigQuery?

There are a lot of pros to using the raw export. Let’s dive into this:

You have access to all the historical data

In the GA4 interface you can only hold onto data for a set period of time before it’s deleted forever from Analytics servers.

The default is:

  • User and event data: 2 months or 14 months for most properties. For Google Analytics 360, extended options like 26, 38, or 50 months are available.
  • Large/XL properties: Event-level retention is reduced to 2 months once collection limits are exceeded, and older data is deleted.

The good news is that when you have the raw data, you don’t have these limits and can store all the information for as long as you like.

You can avoid sampling issues in the GA4 interface

GA4 has a quota request limit.

This limit is 10 million events for standard Google Analytics properties for event level queries.

When you hit this limit, GA4 takes a sample of the data and scales it to give results that reflect the overall trends.

This can be a PITA.

boo

When you have access to the raw data yourself, you don’t need to sample - it’s your data.

You don’t run into cardinality problems.

Dimensions such as Country or Source can have lots of values.

When you view these dimensions with metrics such as Sessions, for example, you may see a large number grouped under (other).

This is because the GA4 interface report hits a row limit.

problem_with_cardinality

Having access to the raw data with BigQuery means that you don’t run into this issue.

You can easily exclude spam traffic

We’ve all been in the situation where Spam traffic has messed up our GA4 data.

Unlike UA with views, in GA4 there is no easy way to exclude spam from all of your reports in GA4.

You either have to filter when it's being collected or permanently run filters on all of your report, which can cause some of the sampling and lookback issues we've talked about here.

Having the raw data makes it really easy to just exclude this as bot traffic follows very similar patterns and trends.

You can fix old data.

If your UTMs were previously broken, you can now update them historically.

This means that historic traffic sources do not need to be broken forever.

If you have badly named events, you can also fix them.

we_can_fix_weird_naming

GA4 is expecting events to have specific names or they won’t appear in reports in the UI.

If you have submitted events with the wrong name, they won’t appear.

monetization_overview

Having access to the raw data means we can easily pull this through to custom reports and bypass issues.

Retroactive calculation of data

Very similar to the above.

With the raw data, any segments, conversions etc. you can define can be applied retrospectively. This means that if you didn’t set up a key event or conversion at the very onset, you are able to go and backdate it (assuming that you were tracking the event).

One example could be users who have visited specific page categories or downloaded specific resources, for example.

No thresholding

The opposite problem to sampling is thresholding.

With the raw GA4 data, you do not have access to more specific demographic information such as gender and interests, but this also means that you don’t run into thresholding issues.

These data limits are applied in the interface to prevent GA users from being able to figuring out who individual users are or finding out sensitive info about them, based on things like demographics, interests, or other identifiable information.

Better conversions

With the raw data, you’re able to write SQL which allows you to make more complex conversions.

We're no longer stuck with key events. Here's an example for catching blog email signups visualised using our Pipeline tool.

You're seeing a form, but all it's doing on the backend is writing SQL which we can do because we have the raw data!

better conversions

Which might generate something like this:

max(
    case
        when
            event_name = 'form_submit'
            and regexp_contains(page_location, r'blog|landing-page-a')
        then 1
        else 0
    end
) over (session_window) as example_email_sub_from_blog

The point is with SQL we can be so much more flexible.

Even better imagine that we'd changed the name of the conversion event? We could easily just et-up an OR piece of logic to cover both.

You can create better session and user segments

In the GA4 interface, you can create “segments” in Explorations.

There are plenty of limitations:

  1. Only dimensions - These can only be based on dimensions - and not metrics.
  2. Limited to explorations - You're limited to explorations. You can't easily get them in dashboards.
  3. Lookback limits - Explorations have data lookback limits.
  4. Sampling issues - You are going to run into sampling issues if you're a decent size.
  5. Limited to filters - They're filters, not adding new custom dimensions & metrics. GA4 is even more limited in adding those.

But with access to the raw data and a bit of SQL we can setup similar segments to the interface as well as way beyond it.

Take a super basic. Sessions that have had more than 5 pageviews. It's built off metrics, it's a custom metric, not just a filter, although we can use it as one, we can retroactively apply it and it won't sample.

In our output tables it might look like this:

example_custom_session_dimension_2

Some common segments we see people setting up:

  • Is a B2B user - For a site with mixed B2C and B2B users, they'll flag any user who visits the B2B section of the site/reads a B2B email and report just on that segment.
  • Last blog page visited in the session - What's the last content they visited in the session. Then filter to those who converted.
  • Last page visited in the session - If you have users who land in the wrong locale and then change locales this lets you report on users based on the correct locale they end up in.
  • Sessions that have had more than 5 pageviews - Let's get all of the heavy use sessions and see what they have in common.

The SQL here is slightly more complex, but we're mostly building things like this:

last_value(page_location) over (session_window) as last_page_visited_in_session

or

case    
    sum(if(event_name = "page_view", 1, 0)) over (session_window) > 5
    then 1
    else 0
end as more_than_5_pageviews_in_session

Take a look at our Pipeline tool documentation if you want to get a sense of how to structure them.

And of course if you do want a tool to help you do this. Please checkout Pipeline!

Anyway you get the idea.

In short, the raw GA4 data allows you to be far more specific than the native GA4 segments which are fundamentally just filters.

You can do your own custom attribution

By default, GA4 provides several attribution models to determine which channel get credit for a conversion. The primary one used by GA4 is data driven attribution.

Having access to all the raw data, though, means that it’s possible to create your own attribution model based on what works best for your business.

So this could be:

  • Time delay
  • Last Touch
  • First Touch
  • Linear Attribution

We have all the raw data - the world is our oyster.

How can I use the GA4 BigQuery?

Great you've solved me. How can I go about using it?

The basic process is:

  • Write some SQL
  • Put it into a BI platform/or download to a spreadsheet.

There are a couple ways you can do that process:

  • BI Tool Basic:
    • Write SQL queries directly in a BI tool.
  • Spreadsheet:
    • Write them in BigQuery and then download to a spreadsheet.
  • Write your own pipelines:
    • Make pipelines with something like DBT or Dataform.
  • Piped Out Pipeline:
    • We built a whole product to let you use your GA4 BigQuery data without having to know any SQL!
    • You configure it with our UI.
    • You can then use this data directly in a BI platform, download to Excel or use our Looker Studio connector to start directly building reports.

Conceptually all of these methods are doing the same thing. They’re all converting raw event data into concepts that you actually use:

  • Sessions
  • Users
  • etc.

event_bigquery_sessions

What does raw event data look like?

The event is core to BigQuery GA4 export and comes with supporting information.

Each event has:

  • event_name - Every event needs a name.
  • event_timestamp - Every event arrives at a timestamp.

There are also other pieces of information which help us tie everything together such as:

  • session_id - The unique ID for the session.
  • user_id - The unique ID for the user.

How many events are there?

As many as you want there to be! You can define as many custom events as you’d like.

There are however a number of default events.

These either fire by default or fire automatically if you have enabled enhanced measurement.

These include events such as:

  • session_start: This is triggered whenever a user starts a new session on your site or app.
  • user_engagement: This event fires when a user actively engages with a page.
  • video_start: This tracks when a user starts playing a video on your site.
  • page_view (web) or screen_view (app): These events capture every time a user views a new page (on web) or screen (in an app).

There are also interaction based events:

  • form_submit: This event triggers when a user submits a form.
  • file_download : This tracks when a user downloads a file.
  • click : When a user clicks a link that leads away from the current domain,

Having these enhanced measurement events makes it super easy to just get up and running.

Events come with supporting information (parameters)

This is then supported by information tied to that event such as:

  • The date/ time
  • The device
  • The location
  • The traffic source
  • The session the event fell in.
  • The user responsible.

Google calls all of these event properties and we can see them in the BigQuery table structure (schema):

BigQuery_data_structure

Often this information is buried in nested fields.

This is just to group what is quite a lot of information.

nested_fields

The event_params can be used on top of this to pass whatever you want

Event parameters is a special field ( event_params) that lets you set up to 25 custom values to be passed with any event.

Here's an example for a session_start event.

custom_parameters

GA also adds some default values to this. Things like ga_session_id to tell you which session the event belongs to.

There are also default event parameters:

default_event_params

The types of default event_params which fire with every event include include:

  • language
  • page_location
  • page_referrer
  • page_title
  • screen_resolution

You can then fire as many custom values as you’d like. And Google will often set pieces of information which are specific to one event in here.

E.g. if you click an outbound link (one leaving your site), it will store the link of the URL here.

How do we turn raw event data into the metrics and dimensionswe see in GA4?

The key to this is using IDs. ID's let us tie events together.

ga_session_id which is in event_params, gives us the ID for a session.

Multiple events can have the same ga_session_id, so we can use this to group events together.

E.g. If a user landed onto a site, visited 3 pages and then converted in one session, each event would have the same ga_session_id in the event parameters for each of these events.

same_session_id

There's a lot more to this, but this is core of what is happening. We make these constructs using keys and then we count within them.

Do conversions exist in GA4 BigQuery?

Yep. In the GA4 interface, conversions are called "Key Events", but in BigQuery they are just events.

You count them just like any other event. And as we mentioned earlier you can actually do better conversions.

How are Universal Analytics and GA4 different?

Right so we care about, how is the underlying data model different?

There are loads of differences when it comes to the UI, but that's not what we're interested in here (let's quickly pour one out for the old UA interface).

The GA4 data model is way better and more flexible.

Essentially:

  • Universal Analytics creates sessions and puts everything into a session.
  • Google Analytics 4 stores everything events - these are the building blocks for everything else.

It might be easier to just see it in BigQuery.

Let’s look firstly at Universal Analytics.

Each row is based on a session.

The screenshot below shows the data structure for Universal Analytics which is centred on visits (i.e Sessions).

ua_data

Then when we look at GA4

Each row is based on an event.

The other columns provide additional information about that event.

Those events contain all the information to roll up to everything else into sessions and users.

ga4_data

As you can imagine the GA4 is harder to work with at a glance. But once you get past that it's just so much more flexible and powerful.

That's it for now.

If you'd like some help with GA4 BigQuery please check out our GA4 Pipeline tool. It'll take care of all the complex SQL, let you configure it through a UI and then you can either build directly with a Looker Studio connector or connect directly to our pre-built tables.