At the moment it feels like, almost daily, we're seeing incredibly powerful and exciting new developments in AI and Machine Learning. Large Language Models like GPT4 are giving normal people, with no expertise in creating AI tools, brand-new opportunities to do things we thought were impossible (or at least very, very hard).
Even so - you can still run into problems where GPT doesn't do exactly what you need it to. Maybe it doesn't know all the information you need it to (maybe it's kind of niche, or maybe it's private data about your company) or you can find that you have to spend ages writing instructions to try to convince it to pay attention to certain things, answer in specific ways, or extract information the way you want.
OpenAI themselves have given talks on how to solve these problems and supercharge GPT's performance.
(Pictured above) recreation of OpenAI's techniques for maximising LLM performance.
OpenAI recommends that everyone starts in the bottom left-hand corner with "prompt engineering" before moving up, or to the right, based on specific needs. While options like "prompt engineering" have been well covered by others in the past, until now, options like RAG and fine-tuning have stayed locked behind specific technical knowledge. That means that the average person can't use these powerful options to get even more out of these world-class tools.
Not anymore! In this post, we're going to briefly explain "RAG" and "fine-tuning" and share some free tools that will let you supercharge how you use GPT, without needing any coding knowledge at all.
Getting extra data into the model with RAG (Response Augmented Generation)
A common problem when working with tools like GPT is that, while they are great at surfacing information, they might not always know the information we need them to work with.
Often the best solution to that is just "put that information in the prompt". We have some great examples of doing just that in this blog post which talks about how different members of our team use some of the latest ML tools.
The big problem is that sometimes we don't know exactly what information is the most relevant. Say we want GPT to answer a series of questions using internal data from our company. We've got a couple of options. We could;
- Copy and paste in all of the information that could be relevant (but then the model tends to miss the actual relevant information about 40% of the time).
- Pull out the most relevant information for each question (but that takes a bunch of time, and we were really hoping systems like GPT could do that sort of work for us!).
- Use the same kind of tech that powers GPT, to dynamically pull in the most relevant information each time we send a question.
(Spoiler alert - what we're offering does number 3 for you).
How do we make our information accessible to GPT?
One of the things that makes GPT so powerful is you don't have to use exact words to find the information you need. You can describe things how it makes most sense to you, or misspell things, and the system will still understand what you're talking about.
That's because, when you send a message to a tool like GPT, it doesn't actually see the words you've written. Instead, your words are converted into a big long string of numbers that represents the concept of what you wrote down.
This is the concept of "mouse" and the concept of "elephant" converted into numbers the way GPT understands them.
What's really useful about that is once words are converted into numbers you can plot them on a kind of chart. You can say, "Here are all the mouse-like words, elephant-like words are a bit further away, and right now I don't want any of that stuff - I want the sentences that are about last year's revenue".
So, basically, RAG involves;
- Converting all your important, potentially relevant, information into these numbers and saving it in a special database.
- Every time you want to ask a question to GPT you convert that question into numbers too and you check your database to see if there is any saved data with similar numbers.
- Dynamically add that data to the prompt when you ask GPT to answer your question.
Before now - that's required writing code to do advanced data pipelining, to work with specialised APIs and databases. Now you can create your own, disposable version of these RAG databases just by opening up a GSheet, pasting your important data into a table, and clicking the buttons on our free, custom, extension.
If you want to learn more - here are some videos of me explaining how the tool is useful and how to use it;
Teaching the model to behave differently with fine-tuning
While RAG helps add extra information to the model, fine-tuning helps change the way the model behaves by giving it new examples to learn from.
The idea here is more straightforward than RAG. When we're working with a tool like GPT we're basically trying to give it the right instructions so that it gives us certain outputs. The default way to do that is to write those instructions straight into the prompt - this is a big part of "prompt engineering".
Sometimes those instructions end up having to be quite long and complex, and sometimes just writing them into the prompt isn't enough. Something you've probably noticed in the past (I certainly have) is that when you're trying to get someone to do something specific, the best way to get on the same page is to give them a bunch of examples of exactly what you want.
That's the idea behind fine-tuning - instead of writing out loads of instructions, you just give a bunch of examples and you create a custom version of a GPT model that is trained on your examples on top of all of the training and knowledge that was included to begin with.
OpenAI has its own clients and has found that fine-tuned versions (even based on older models) often outperform the most cutting-edge models, just because they are better suited to the task.
Normally you would have to do coding to construct your examples, use APIs to start fine-tuning, and use a specific API to work with your fine-tuned models - but not any more!
We've created another free sheet for you where you just paste in a bunch of example questions and answers to a tab.
Then you press some buttons in the extension we made (pictured right) to upload your data to your own private GPT account.
The sheet will automatically start a "fine tune" job, again - in your GPT account. It'll start creating a custom-tuned version of GPT that has learned from the extra examples you've shared.
When the fine-tuning is done, you can use the same extension to send prompts to your fine-tuned model and even compare the responses you get from it to responses you've got from other models.
If you want to learn more - here are some videos of me explaining how the tool is useful and how to use it;
Conclusion
So there you have it - two free tools that will help you take your use of GPT to the next level.
At Aira, we believe that the more people have access to more tools, the more we'll see amazing and creative ideas for what to do next. We can't wait to see what you do with these!
Are you struggling with the fact your GA4 data doesn’t stretch back as far as you’d like?
Perhaps you're struggling to join up your UA data that stopped dead in July last year with your GA4 data. Or maybe you’ve noticed real significant discrepancies between your GA4 sessions and your UA sessions?
If so, then this Google Sheet, is for you! We’ve put together a sheet that helps you blend together your existing GA4 data with historical UA data. Our Historic GA4 Sessions Generator creates backdated, estimated GA4 sessions going back to the starting period of your UA data.
This allows you to maintain a consistent analytics history which is really useful for reporting, forecasting and general data analysis.
Make a copy and download the sheet here
Why is this tool required?
In July 2023, Google phased out Universal Analytics (UA) and transitioned to Google Analytics 4 (GA4). With the discontinuation of UA, many found themselves missing historical data, as they hadn't implemented GA4 tracking going back that far.
This may lead to challenges, such as:
- Generalised reporting - If you want to examine session data over time, you are limited to only the period where you have GA4 set up.
- Forecasting and Causal Impact testing - For these, you will generally require at least a year or two of data for the model to understand seasonality and yearly trends.
- Comparative analysis: Comparing current performance with historical performance becomes challenging. For folks who were late to set up GA4 tracking, it can be difficult to draw direct comparisons between past and present performance metrics.
So what’s the problem with just simply joining together UA and GA4 data?
It may seem easy enough to just get your historical UA data and blend it with your GA4 data with a date of your choosing, but there is one pretty fundamental problem. Even with early GA4 setup and concurrent running with UA, you’ll notice discrepancies.
This discrepancy comes from the fact that 'sessions' means different things across UA and GA4 with their differing capabilities in tracking user activity across various devices and platforms.
The solution? Generating backdated “predicted” GA4 sessions.
How does this Google Sheet generate “predicted” GA4 sessions?
This sheet generates these backdated, historical GA4 session numbers by looking at the average proportional difference between UA and GA4 sessions for the periods where both are tracked simultaneously.
From there, the historical GA4 sessions are calculated by multiplying the known UA sessions by this ratio and “voilà!” you have your backdated GA4 sessions.
Benefits
- This sheet is straightforward to use and is based on simple calculations.
- All calculations and data are handled in Google Sheets as opposed to a colab etc (and who doesn’t love Google Sheets?)
- The visualisations allow you to see how the backdated GA4 sessions compare to:
a) The recorded UA sessions
b) The recorded GA4 sessions.
Limitations
- This does require having a period where both GA4 and UA sessions are tracked simultaneously (the longer the better!)
- These backdated GA4 sessions are estimates based on the proportional difference between known UA and GA4 sessions.
- This sheet does require you to pull in the UA and GA4 data. Though you can use the Google Sheets Add Ons to help with this.
How do I use this tool?
This section provides a step-by-step guide of how to use the sheet.
Step 1: Make a copy of the Google Sheet
You can make your copy of the Google Sheet here.
Step 2: Navigate to “UA - Raw Data tab” to provide Universal Analytics Sessions data.
You will need to provide historical Universal Analytics data in the “UA - Raw Data” tab.
This sheet is ideally designed to work using an output from the Google Analytics Spreadsheet Add-On with the data starting from row 15 (though can enter it in manually).
For this sheet to work, Column A should be Date and Column B Sessions.
This data can be inputted by either…
Option A - Using the Google Analytics Spreadsheet Add-On. This is the easiest option.
This requires you to have downloaded the Add On, set up the Report Configuration using the template below and then run the report. This will auto-populate the “UA - Raw Data” tab.
Option B - Manually copying and pasting in the sessions data from the Google Analytics interface or from a Looker Studio Report.
Important notes
- You will need to ensure that the data is formatted the exact same way as the existing sheet which includes:
- The data starting at row 15.
- The data column is in the format yyyy-mm-dd.
- Column A being Date and Column B being sessions
Step 3 - Navigate to “GA4 - Raw Data tab” to provide GA4 Sessions data
You will need to provide GA4 data in the “GA4 - Raw Data” tab.
This operates in a very similar way as the UA sessions data, ideally using a similar output from a Google Sheets add on. In this case, the Add on is Adformatics Google Analytics 4 Google Sheet Add On. with the data starting from row 15.
For this sheet to work, Column A should be Date and Column B Sessions.
This data can be inputted by either…
Option A - Using the Adformatics Google Analytics 4 Google Sheet Add On.
This requires you to have downloaded the Add On, set up the Report Configuration using the template below and then run the report. This will auto-populate the “GA4 - Raw Data” tab.
Option B - Manually copying and pasting in the sessions data from the Google Analytics 4 interface or from a Looker Studio Report.
Important notes
- You will need to ensure that the data is formatted the exact same way as the existing sheet which includes:
- The data starting at row 15.
- The data column is in the format yyyy-mm-dd.
- Column A being Date and Column B being sessions
Step 4 - Review the outputs
Once you have loaded in the UA and GA4 data, the Google Sheet will do the magic in the background.
If you want to see what’s going on in the background, see “How do these calculations work?” section.
There are a number of outputs which allow you to compare how the sessions compare between the different sources.
The raw numbers
The first output to have a look at is the tables which contain the raw data broken down by date. This includes:
- UA Recorded Sessions - These are the known UA sessions.
- GA4 Recorded Sessions - These are the known GA4 sessions.
GA4 Sessions: Recorded + Backfilled Estimates - These combine the known GA4 sessions with the “predicted” backfilled GA4 sessions. These numbers are calculated using the average ratio between UA and GA4 sessions for the period where there is the overlap.
Comparing UA Sessions To GA4 Sessions
This section allows you to see the proportional difference (or ratio) between the GA4 sessions and the UA sessions.
The is broken down into:
- Classic Average - This is the average of the proportional difference between the GA4 sessions and UA sessions throughout the whole time period.
- Weighted Average - This is the average of the proportional difference between the GA4 sessions and UA sessions throughout the whole time period but this time weighted by the recorded GA4 sessions. This means that the average is going to be more impacted by days with higher sessions recorded.
By default, this is set to use the Classic Average, but this can be updated in the ‘[HIDDEN] Calculations’ tab.
"GA4 Total: Recorded + Backfilled Estimates" vs UA Recorded Sessions
This graph allows you to compare how the GA4 Sessions (including the backdated “predicted” GA4 sessions) compare to the recorded UA sessions for periods.
This allows us to compare how GA4 sessions generally compare to UA sessions - you’ll often see that one is consistently higher than the other.
Recorded GA4 Sessions vs Recorded UA Sessions
This graph allows you to compare how the known GA4 Sessions compare to the known UA Sessions. You’ll most likely see that UA Recorded Sessions are cut off at a specific point, before just GA4 sessions are recorded.
Recorded GA4 Sessions vs "GA4 Total: Recorded + Backfilled Estimates"
The graph displays the Recorded GA4 sessions (green) and allows you to compare them with how the backdated GA4 sessions look.
This enables you to see when the transition takes place from recorded to predicted which is useful - especially when it comes to forecasting and adding regressors.
GA4 Total: Recorded + Backfilled Estimates - Mapped Over Time
The graph displays the main data and output that you’ll want from this Google Sheet which is the Recorded GA4 Sessions alongside the backdated GA4 Sessions.
The surrounding graphs are primary there to provide the context to these final figures,
How do these calculations work?
There are six key stages in generating these figures:
- Step 1 - Generate the earliest Start Date and End Date for the GA4 and UA data.
- Step 2 - Use these start and end dates to generate a date range from the starting point of the UA data, to the end point of the GA4 data.
For those interested, SEQUENCE is the magic formula to be able to do this.
- Step 3 - Pull in the UA and GA4 data for each of the days in the date range.
A simple VLOOKUP is all that’s used here with an IFERROR to catch the gaps.
- Step 4 - Calculate the average proportional difference between UA and GA4 sessions (we’ve done this as a weighted average, and just a classic average to provide several options.)
The first step here is calculating the proportional difference for each day between the UA and GA4 Sessions which I’ve done using an ARRAYFORMULA.
The second step is then using this column to generate the different averages.
- Step 5 - Multiply the historic UA sessions by the calculation by the avg ratios of UA sessions to GA4 Sessions to generate our backdated, predicted GA4 sessions numbers.
We can then decide which average we’d like to use, using the checkbox which dictates which column populates the final “GA4 Sessions: Recorded + Backfilled Estimates” column.
- Step 6 - Once we have generated these numbers, we are then able to pull these final figures into the final Outputs page to generate our graphs and final GA4 Sessions with the predicted backdated sessions.
Final takeaways
This tool is a practical solution for bridging the gap between Universal Analytics and Google Analytics 4. Not a perfect solution, but a way of filling the potential void in data using a somewhat more intelligent approach that just pulling in GA4 and UA session data in together.
Reach out to me @da_westby on X (formerly known as Twitter) to let me know what you think.