Data tips for when you’re doing Causal AI

Collecting the Right Data

There are several tools on the market that help you collect interaction from your site or product and many of those offerings also provide suggestions on the types of interactions you will want to collect data on, as well as best practices for event naming conventions. We strongly recommend that you follow the advice given by those event collection tools.

That said, if there are things you know you want to model on, make sure you’re collecting the appropriate data for it. As an example, if you’re interested in modeling on sign-ups, you’d want to consider tracking things like:

  • Which pages users viewed
  • What sections users scrolled to
  • What actions users interacted with (buttons, dropdowns, etc)
  • Date stamps
  • An identifier for each user
  • etc

If you’re collecting and modeling on something that would be in-product, the types of things you collect may be different.


When talking about causal inference or causal AI, something that we’ve seen not get enough attention is the ordering of events, or chronology. Anybody familiar with causal inference knows this, but not everybody that does the data collection and processing does.

Chronology matters because you want the model to be able to understand the order in which events happened. Essentially chronology provides context to the model which helps to answer questions like did X happen before Y? Does X lead to Y?

When collecting data, make sure you’re collecting not just the event, but the event time.

If you collect data that looks like the image below, it could be interpreted as meaning that watching a video leads to signing up.

No Date or Time

But, if you include a time stamp, you might see that the flow actually looks more like this:

With Time Stamp

Here you can see that the video is actually being watched after sign-up occurs and is not relevant. This example is simplified, but it helps to illustrate the point.

Selection Bias

Selection bias is when you use data that doesn’t allow for proper randomized sampling that would represent the larger population.

This relates to collecting the right data, but it goes further because it isn’t just about the data that is being collected, but the data that’s used in the model as well. Just because you collect data, that doesn’t mean it needs to be included in your model.

That said, when collecting and choosing what to include, you need to avoid selection bias. Selection bias leads to inaccurate causal results because the model doesn’t have the full range of possible positive or negative results. This leads to skewed results and that means you could end up taking action that leads to worse overall results.

Having Enough Data

Causal inference doesn’t work like an LLM, you don’t need petabytes of data to get accurate models, but you do need enough of each event to make predictions confidently. We suggest having a data set of at least a few hundred rows (you could think of this as website visits, product users, etc). As you collect more data the model will get more accurate, but based on our testing even a few hundred rows can start to generate meaningful results.

You can read more about data requirements in a previous blog here.

If you want to talk about causal inference or have questions about the data you’re collecting and how to work with it, we’re happy to help!

If you want to see what Dacture is working on, you can schedule a meeting with us, or contact us here.