An in-depth discussion of both the implementation and technical underpinnings of Robyn follows -
robyn_inputs() mainly captures all model specification for the dataset. Here we break down some of the underlying concepts
InputCollect <- robyn_inputs( dt_input = dt_simulated_weekly ,dt_holidays = dt_prophet_holidays ,date_var = "DATE" # date format must be "2020-01-01" ,dep_var = "revenue" # there should be only one dependent variable ,dep_var_type = "revenue" # "revenue" or "conversion" ,prophet_vars = c("trend", "season", "holiday") # "trend","season", "weekday" & "holiday" ,prophet_country = "DE"# input one country. dt_prophet_holidays includes 59 countries by default ,context_vars = c("competitor_sales_B", "events") # e.g. competitors, discount, unemployment etc ,paid_media_spends = c("tv_S","ooh_S","print_S","facebook_S", "search_S") ,paid_media_vars = c("tv_S", "ooh_S,"print_S","facebook_I","search_clicks_P") ,organic_vars = c("newsletter") # marketing activity without media spend ,factor_vars = c("events") # specify which variables in context_vars or organic_vars are factorial ,window_start = "2016-01-01" ,window_end = "2018-12-31" ,adstock = "geometric" # geometric, weibull_cdf or weibull_pdf.)
Refer to the Data Collection section of the Analyst's Guide to MMM for a detailed discussion on best practices for selecting your paid media variables.
For paid media variables, we currently require that users designate both
paid_media_spends. These vectors should be of the same length, and the elements in the same positions in each list should correspond (e.g. if element 2 in paid_media_vars is your tv variable, then element 2 in paid_media_spends should be your spend that corresponds to that tv data)
When exposure metrics are provided in
paid_media_vars, Robyn will fit them against their corresponding spend using the Michaelis-Menten nonlinear fitting. If the fit is low, it indicates that exposure metrics have different underlying pattern than their spends. Therefore, Robyn will recommend to splitting the channel for potentially better modelling result. Take Meta as example, retargeting and prospecting campaigns might have very different CPMs and efficiencies. In which case, it would be meaningful to split Meta by retargeting and prospecting.
Ensure that the paid media data is complete and accurate before proceeding. We will talk more about the variable transformations paid media variables will be subject to in the Variable Transformation section.
Robyn enables users to specify
organic_vars to model marketing activities without direct spend. Typically, this may include newsletter, push notification, social media posts, among other efforts. Moreover, organic variables are expected to have similar carryover (adstock) and saturating behavior as paid media variables. The respective transformation techniques such as geometric or Weibull transformation for adstock; or Hill transformation for saturation, are also applied to organic variables. More on these transformations in the following section -
- Reach / Impressions on blog posts
- Impressions on organic & unpaid social media
- SEO improvements
- Email campaigns
- Reach on UGC
Below a chart showing the different types of organic variables that could be modeled
All contextual variables must be specified as elements in
context_vars. For a detailed discussion on potential contextual variables to include in your model, see the Data Collection section of the Analyst's guide to MMM.
MMM is typlically characterised by the following two hypothesis:
- Ads investment has lagged effect / carries over through time. For example, I see ads today and buy next week.
- Ads investment has diminishing returns. For example, the more I spent on a channel, the less marginal return I will get.
Robyn conducts two types of transformation to account for these hypothesis:
- Adstock transformation
- Saturation transformation
Adstock refers to the hypothesis of ads carryover and reflects the theory that the effects of advertising can lag and decay following initial exposure. It's also related to certain brand equity metrics like ad-recall or campaign awareness. The logical narrative is "I saw the ads X days ago before I bought the product". Usually, it's assumed that this "memory" decays as time passes. But there're also cases when it's legitimate to assume that the effect of this "memory" will increase first before decreasing. For example, expensive products like cars or credits are unlikely to be purchased directly after the ads, esp. for offline channels. Therefore, it's usual to assume lagged effect with later peaks for offline channels for these products. At the same time, online conversions from digital channels might be suitable without lagged peaks and only deal with decay.
There are three adstock transformation options you can choose in Robyn:
The one-parametric exponential decay function is used with theta as the fixed-rate decay parameter. For example, an adstock of theta = 0.75 means that 75% of the ads in Period 1 carryover to Period 2. Robyn's implementation of Geometric transformation can be found here and is shown conceptually as followed:
Some rule of thumb estimates we have found from historically building weekly-level models are that TV has tended to have adstock between 0.3 - 0.8, OOH/Print/Radio has had 0.1-0.4, and Digital has had 0.0 - 0.3. This is anecdotal advice so please use your best judgement when building your own models.
Another useful property of Geometric decay is that the limit of the inifinite sum is equal to
1 / (1 - theta). For example, when theta = 0.75, its infinite sum is 4. Because Robyn conducts adstock transformation on spends, it can give you a quick and inituitive idea of how much "inflation" the adstock transformation will add on your raw data.
Robyn offers the two-parametric Weibull function in the formats of PDF and CDF. Compared to the one-parametric Geometric function where theta is equal to the fixed decay rate, Weibull produces time-varying decay rates through more flexibility in the transformation with the parameters shape and scale. Robyn's implementation of Weibull transformation can be found here and is shown conceptually as followed. Note that the theta from Weibull is time-dependent.
The plot below shows the flexibility of Weibull adstocks with regard to the two parameters shape and scale. This flexibility does come at a cost of more computational power due to the extra hyperparameter. Especially Weibull PDF is strongly recommmended when the product is considered to have longer conversion window. We've seen Weibulll PDF leading to considerabaly better fit in some cases.
Weibull CDF adstock: The Cumulative Distribution Function of Weibull has two parmeters , shape & scale, and has flexible decay rate, compared to Geometric adstock with fixed decay rate. The shape parameter controls the shape of the decay curve. Recommended bound is c(0.0001, 2). The larger the shape, the more S-shape. The smaller, the more L-shape. Scale controls the inflexion point of the decay curve. We recommend very conservative bounce of c(0, 0.1), because scale increases the adstock half-life greatly.
Weibull PDF adstock: The Probability Density Function of the Weibull also has two parameters, shape & scale, and also has flexible decay rate as Weibull CDF. The difference is that Weibull PDF offers lagged effect. When shape > 2, the curve peaks after x = 0 and has NULL slope at x = 0, enabling lagged effect and sharper increase and decrease of adstock, while the scale parameter indicates the limit of the relative position of the peak at x axis; when 1 < shape < 2, the curve peaks after x = 0 and has infinite positive slope at x = 0, enabling lagged effect and slower increase and decrease of adstock, while scale has the same effect as above; when shape = 1, the curve peaks at x = 0 and reduces to exponential decay, while scale controls the inflexion point; when 0 < shape < 1, the curve peaks at x = 0 and has increasing decay, while scale controls the inflexion point. When all possible shapes are relevant, we recommend c(0.0001, 10) as bounds for shape; when only strong lagged effect is of interest, we recommend c(2.0001, 10) as bound for shape. In all cases, we recommend conservative bound of c(0, 0.1) for scale. Due to the great flexibility of Weibull PDF, meaning more freedom in hyperparameter spaces for Nevergrad to explore, it also requires larger iterations to converge.
Implementation of adstocking is done in a few steps
adstockequal to the distribution you plan to use (geometric, weibull_cdf, weibull_pdf)
hyper_names(adstock = InputCollect$adstock, all_media = InputCollect$all_media)to identify the hyperparameters that will need to be set up correctly in order to begin the modeling process. All paid and organic media variables will have hyper parameters that need to be set up.
- Set ranges for each hyperparameter
The theory of diminishing returns holds that each additional unit of advertising increases the response, but at a declining rate. This key marketing principle is reflected in marketing mix models as a variable transformation.
The nonlinear response to a media variable on the dependent variable can be modelled using a variety of functions. For example, we can use a simple logarithm transformation (taking the log of the units of advertising log(x) ), or a power transformation (x^alpha). In the case of a power transformation, the modeler tests the different variables (different levels of parameter alpha) for the highest significance of the variable in the model and the highest significance of the equation overall. However, the most common approach is to use the flexible S-curve (Hill) transformation:
Note that the gamma in the equation is scaled to the inflexion point of variable in question. The variations of the parameters give modelers full flexibility on the look of the S-curve, specifically the shape and the inflection points:
To understand these charts, on the x axis we have spend, and on the y-axis we have response. So as the spend rises, the response changes and depending on the curve we understand what the marginal response is. These curves are extremely useful in the end stages of MMM where we seek to understand how we can more optimally allocate budgets between all of our media channels.
Hill function for saturation: The Hill function is a two-parametric function in Robyn with alpha and gamma. Alpha controls the shape of the curve between exponential and s-shape. Recommended bound is c(0.5, 3). The larger the alpha, the more S-shape. The smaller, the more C-shape. Gamma controls the inflection point. Recommended range is c(0.3, 1). The larger the gamma, the later the inflection point in the response curve.
Implementation in Robyn is all done in hyper parameter setting
- Ensure you have a alpha and gamma hyper parameter for each paid media variable with a range specified
Prophet has been included in the code in order to improve the fit and forecast of time-series by decomposing the effect of trend, seasonality and holiday components in the response variable (Sales, conversions, etc.).
Prophet is a Meta original procedure for forecasting time series data, based on a model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. More details can be found here.
Trend, season, holiday and an extra regressor ("events" in this case) decomposition by Prophet. Weekday is not used because the sample data is weekly (not daily). Robyn uses Prophet to also decompose categorical variables as an extra regressor which can simplify later programming. For technical details of decomposition, please refer to Prophet's documentation here.
In order to address multicollinearity among many regressors and prevent overfitting we apply a regularization technique to reduce variance at the cost of introducing some bias. This approach tends to improve the predictive performance of MMMs. The most common regularization, and the one we are using in this code is Ridge regression. The mathematical notation for Ridge regression is:
If we go a bit deeper into the actual components we will be using within the model specification, besides the lambda penalization term above, we can identify the following formula:
Ridge regression has an additional benefit of being relatively easy to interpret compared to other more complex techniques. As you can see in the formula, the hyperparameters that we set for each variable are used in this equation. As you'll see in the following section, we use automated hyper parameter optimization in order to ensure we get the best fitting ridge regression model.
MMMs are likely to contain high cardinality of parameters. ie. alphas and gammas for the diminishing returns (Hill) function, as well as, thetas for geometric ad stock transformation. In addition, parameters dimensionality increases proportionally with the total number of marketing channels to be measured. Thus, it is extremely necessary to deal with a high dimensionality parameter space where, the greater the number of parameters, the greater the model complexity, its dimensionality and computational requirements.
In order to achieve computational efficiency while optimizing overall model accuracy, we leverage Meta’s Nevergrad gradient-free optimization platform. Nevergrad allows us to optimize the explore and exploit balance through the ask and tell commands, in order to perform a multi-objective optimization tha balances out the Normalized Root Mean Square Error (NRMSE) and decomp.RSSD ratio (Relationship between spend share and channels coefficient decomposition share) providing a set of Pareto optimal model solutions
Please find below an example of a common chart for the Pareto model solutions. Each dot in the chart represents an explored model solution, while the lower-left corner lines are Pareto-fronts 1-3 and contains the best possible model results from all iterations. The two axes (NRMSE on x and DECOMP.RSSD on y) are the two objective functions to be minimized. As the iteration increases, a trend down the lower-left corner of the coordinate can be clearly observed. This is a proof of Nevergrad's ability to drive the model result towards an optimal direction.
The premise of an evolutionary algorithm is that of natural selection. In an evolutionary algorithm you may have a set of iterations where some combinations of coefficients that will be explored by the model will survive and proliferate, while unfit models will die off and not contribute to the gene pool of further generations, much like in natural selection. In robyn, we recommend a minimum of 2000 iterations where each of these will provide feedback to its upcoming generation, and therefore guide the model towards the optimal coefficient values for alphas, gammas and thetas. We also recommend a minimum of 5 trials which are a set of independent initiations of the model that will each of them have the number of iterations you set under ‘set_iter’ object. E.g. 2000 iterations on set_iter x 5 trials = 10000 different iterations and possible model solutions.
In Robyn, we consider the model to be converged UNDER REVISION when:
Criteria #1: Last quantile's standard deviation < first 3 quantiles' mean standard deviation
Criteria #2: Last quantile's absolute median < absolute first quantile's absolute median - 2 * first 3 quantiles' mean standard deviation
The quantiles are ordered by the iterations of the model, so if we ran 1000 iterations, the first 200 iterations would make up the first quantile. These two criteria represent an effort to demonstrate that both the standard deviation and the mean for both NRMSE and DECOMP.RSSD have improved relative to where they began, and they are not as variable.
After the modeling is complete, you can run:
To investigate the convergence of your multi-objective optimization. See below for examples of these graphs.
By applying results from randomized controlled-experiments, you may improve the accuracy of your marketing mix models dramatically. It is recommended to run these on a recurrent basis to keep the model calibrated permanently. In general, we want to compare the experiment result with the MMM estimation of a marketing channel. Conceptually, this method is like a Bayesian method, in which we use experiment results as a prior to shrink the coefficients of media variables. A good example of these types of experiments is Facebook’s conversion lift tool which can help guide the model towards a specific range of incremental values.
The figure illustrates the calibration process for one MMM candidate model. Facebook’s Nevergrad gradient-free optimization platform allows us to include the MAPE(cal,fb) as a third optimization score besides Normalized Root Mean Square Error (NRMSE) and decomp.RSSD ratio (Please refer to the automated hyperparameter selection and optimization for further details) providing a set of Pareto optimal model solutions that minimize and converge to a set of Pareto optimal model candidates. This calibration method can be applied to other media channels which run experiments, the more channels that are calibrated, the more accurate the MMM model.
Implementation of calibration is as follows
- Create a data file corresponding to your experimental results - here is an example
calibration_input <- data.frame( # channel name must in paid_media_vars channel = c("facebook_S", "tv_S", "facebook_S"), # liftStartDate must be within input data range liftStartDate = as.Date(c("2018-05-01", "2018-04-03", "2018-07-01")), # liftEndDate must be within input data range liftEndDate = as.Date(c("2018-06-10", "2018-06-03", "2018-07-20")), # Provided value must be tested on same campaign level in model and same metric as dep_var_type liftAbs = c(400000, 300000, 200000), # Spend within experiment: should match within a 10% error your spend on date range for each channel from dt_input spend = c(421000, 7100, 240000), # Confidence: if frequentist experiment, you may use 1 - pvalue confidence = c(0.85, 0.8, 0.99), # KPI measured: must match your dep_var metric = c("revenue", "revenue", "revenue"))
calibration_input <- dt_calibration(or whatever you name your calibration data frame) within InputCollect
Increase your iteration count by at least 1000 to account for needing to optimize 3 objective functions (NRMSE, Decomp.RSSD, MAPE.lift) rather than 2 (NRMSE, Decomp.RSSD)
The MMM code will automatically generate a set of plots under the folder you specify on the ‘model_output_collect’ object. Each of these plots represents one of the optimal model solutions as a result of the multi-objective optimization Pareto optimal process mentioned in the ‘Automated hyperparameter selection and optimization’ section. Please find below an example of the model output:
As you may observe we have 6 different charts above:
Response decomposition waterfall by predictor: This chart reflects the percentage of each of the variables effect (Baseline and Media variables + intercept) on the response variable. E.g. If newsletter effect says it's 9.6% that means that 9.6% of the total sales can be attributed to newsletter activations.
Tip: Intercept, and Prophet vars like trend can constitute a large portion of your decomp. This makes sense if your brand is established, as it is basically saying that you would still have a large baseline of sales without paid media spend.
Share of spend vs. share of effect: This plot reflects the comparison of the total effect each channel had by means of the decomposition of the coefficients into the response variable divided by the total effect. As well as, the total spend (cost or investment) each channel had and its relative share over total marketing spend. We also plot the return on investment (ROI) each channel had which can give you an idea over the most profitable channels.
Tip: Decomp.RSSD is one of the objective functions we minimize which corresponds to the distance between share of spend and share of effect. So in essence we do not want to see extreme distances between share of spend and share of effect because it does not make realistic business sense to optimize to. Consider this when comparing model solutions.
Average adstock decay rate: This chart represents, on average, what is the percentage decay rate each channel had. The higher the decay rate, the longer the effect in time for that specific channel media exposure. This is a simple calculation for Geometric adstocks, with average adstock decay rate = theta. For Weibull adstocks this calculation is a bit more complicated.
Tip: It can be useful to look at the adstock rates in comparison to each other. For example, how do your upper funnel channel adstocks compare to your lower funnel channel adstocks. There are some logical conclusions to make about upper funnel having longer adstocks than lower funnel, so keep an eye out for cases where it doesn't seem to be the case. This isn't a hard and fast rule, but can be a way to differentiate between solutions.
Actual vs. predicted response: This plot shows the actual data for the response variable E.g sales, and how the modeled predicted data for that response variable is capturing the real curve. We aim for models that can capture most of the variance from the actual data and therefore the R-squared is closer to 1 while NRMSE is low.
Tip: Look for specific periods where the model is predicting worse/better. For example, if one notices the model is predicting noticeably worse around certain periods related to promotions periods, it may be a good way to identify a contextual variable that should be included in the model
Response curves and mean spend by channel: These are the diminishing returns response curves from hill function. They represent how saturated a channel is and therefore, may suggest potential budget reallocation strategies. The faster the curves reach to an inflection point and to a horizontal/flat slope, the quicker they will saturate with each extra ($) spent.
Tip: You can use the
robyn_responsefunction in a loop to recreate these response curves. For example, creating a loop around an execution like this:
Spend1 <- 60000Response1 <- robyn_response(robyn_object = robyn_object#, select_build = 1 # 2 means the second refresh model. 0 means the initial model, media_metric = "search_S", metric_value = Spend1)Response1$response/Spend1 # ROI for search 80k
Fitted vs. residual: This chart shows the relationship between fitted and residual values. A residual value is a measure of how much a regression line vertically misses a data point. A residual plot is typically used to find problems with a regression. Some data sets are not good candidates for regression, such as points at widely varying distances from the line. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
Rather than exploring all Pareto-optimal solutions (of which there could be dozens), we use clustering to find similar solutions and encourage users to explore those solutions as the most distinct, best solutions. The process for the clustering is as follows:
- Using the pool of models on Pareto fronts obtained, we run k-means clustering using (normalized) ROI data on each (paid) media variable
k = "auto"(which is the default), we calculate the WSS (Within Group Sum of Squares) on k-means clustering using k = 1 to k = 20 to find the "best automatic k". We pick the one k whose WSS reduction doesn't change more than 5% from the previous k. You could also override this by setting k manually (on
robyn_run(..., cluster = TRUE),
robyn_outputs(..., clusters = TRUE), or
- After we have run k-means on all Pareto front models, using the defined k, we pick those "best models" that have the lowest normalized combined errors (NRMSE, DECOM.RSSD, and MAPE if calibrated was used).
When you run robyn_clusters() you'll get a list of results: the data used to calculate the clusters with each model's cluster-ID, and some visualizations on WSS-k selection, ROI per media on winner models, and correlations of ROI by cluster to understand each cluster's content models better. See below for example output on clustering selection.
Once you have analyzed the optimal model results plots and have chosen your model, you may introduce the model unique ID from the results in the previous section. E.g setting select_model = "1_92_12" could be an example of a selected model from the list of best models in 'OutputCollect$allSolutions' results object. Once you run the budget allocator, results will be plotted and saved under the same folder where the model plots had been saved. The result would look like the following:
You may encounter three charts as in the example above:
- Initial vs. optimized budget allocation: This channel shows the original spend share vs. the new optimized recommended one. If optimized share is greater than original, this would mean you will need to proportionally increase budgets for that channel according to the difference between both shares. And you would reduce budgets in the case spend share would be greater than optimized share.
- Initial vs. optimized mean response: Similar to the chart above, we have initial and optimized shares, but this time over the total expected response E.g. Sales. The optimized response is the total increase in sales that we are expecting you to have if you switch budgets following the chart we explained above, increasing those with better share for optimized spend and decreasing spend for those with lower optimized spend than the initial.
- Response curve and mean spend by channel: These again are the diminishing returns response curves from hill function. They represent how saturated a channel is and therefore, may suggest potential budget reallocation strategies. The faster the curves reach to an inflection point and to a horizontal/flat slope, the quicker they will saturate with each extra ($) spent. The initial mean spend is represented by a circle and the optimized one by the triangle.
Another strong capability requirement linked to the model window we identified, is the ability to refresh the model when new data would arrive. In other words, to be able to continuously report in a monthly, weekly or even daily frequency, based on a previously selected model but applied onto new fresh data. Consequently, enabling MMM to be a continuous reporting tool for actionable and timely decision-making that could feed your reporting or BI tools within the defined cadence.
robyn_refresh() function is able to continuously build and add new model periods, at any given cadence (weeks, months, etc.), based on previously selected models saved in the Robyn.RData object specified in the robyn_object.
For example, when updating the initial build with 4 weeks of new data, robyn_refresh() consumes the selected model of the initial build. What Robyn does is to set lower and upper bounds of hyperparameters for the new build which are consistent with the selected hyperparameters of the previous build. Therefore, stabilizing the effect of contextual and organic variables across old and new builds as well as, regulating the new share of effect of media variables towards the new added period spend level. Finally, returning aggregated results containing all previous builds for reporting purposes and their corresponding plots.
The example below shows the model refreshing mechanism for 5 different periods of time based in an initial window covering most of 2017 and 2018:
You will also obtain a set of results for each refresh period that describes the assigned ROI and effects from each of the variables within the model. The baseline variable is the sum of all prophet variables (trend, seasonality, weekday and holiday) plus the intercept. The charts are based on simulated and do not have real-life implication:
robyn_refresh() function builds updated models based on the previously built models saved in the Robyn.RData object specified in robyn_object.
For example, when updating the initial build with 4 weeks of new data,
robyn_refresh() consumes the selected model of the initial build.
What Robyn does, is to set lower and upper bounds of hyperparameters for the new build around the selected hyperparameters of the previous build, stabilizes the effect of baseline variables across old and new builds and regulates the new effect share of media variables towards the latest spend level. It returns aggregated results with all previous builds for reporting purposes and produces reporting plots.