Packaging testing in virtual environments: validating shelf impact before production
A new SKU that fails to gain traction in its first 12 weeks on shelf is typically delisted. That window is the standard FMCG measurement period for initial velocity and repeat purchase, and missing it triggers a chain of costs most pack-design budgets are not built to absorb.
Slotting fees alone can run into seven figures for a major launch across a national retailer footprint. Add reset costs, retailer relationship damage, and the re-engineering needed when artwork has to be redone, and the original design investment becomes the smallest line item in the post-mortem.
Shelf impact testing exists to catch the failure earlier. Brand teams have always run some version of it: physical mock shelves, focus groups, post-launch in-store testing. Each addresses part of the problem, and each has a structural limit. Packaging testing in virtual environments captures behavioural data the other methods cannot. Storelab operates in this fourth category, with more than 35 years of retail solutions experience underpinning the platform.
What shelf impact testing measures
Shelf impact testing assesses whether a pack can be seen, recognised, and chosen on a shelf populated with competitors. The four sub-components matter individually and can fail independently of each other.
- Visibility. Whether the pack registers at all in a shopper’s field of view as they approach the fixture.
- Brand block recognition. Whether the pack is identified as the brand at distance, before specific SKU details are legible.
- Shelf differentiation. Whether the pack is distinguished from adjacent SKUs in the same brand range, so shoppers reach for the right variant.
- Pick-up triggering. Whether attention converts into selection from the fixture.
A pack can score well on visibility and brand block recognition but fail on differentiation, in which case shoppers reach for the wrong SKU within the range. The diagnostic value of the framework comes from separating these signals, so a pack-design problem can be located precisely on one of the four signals instead of being treated as a single undifferentiated failure.
Real packs typically fail on one or two of the four, not all of them. A pack with strong brand block recognition but weak differentiation creates within-range cannibalisation: shoppers see the brand and pick up the wrong SKU. A pack with strong differentiation but weak visibility never reaches the recognition stage at all. The framework gives the brand team specific targets for redesign rather than a general ‘this pack underperformed’ verdict.
Methods compared: physical mock shelf, focus groups, in-store testing, and VR simulation
Each method has a defensible role in a packaging programme. The differences sit in cost, speed, the type of data captured, and the lifecycle stage each method suits best.
| Method | Cost per round | Time to insight | Behavioural data | Parallel variants | Best lifecycle stage |
|---|---|---|---|---|---|
| Physical mock shelf | High | Weeks | Limited; observed only | One at a time | Pre-production lock |
| Focus groups | Low to medium | Days to weeks | Stated preferences | Limited | Early concept |
| In-store testing | Highest (post-launch) | Months | Real purchase behaviour | Difficult | Post-launch only |
| VR simulation | Medium | Days | Eye-tracking, dwell, pick rate | Many in parallel | All stages |
Physical mock shelves remain the strongest method for tactile assessment and structural pack engineering. A shopper picking up a pack feels the weight, the texture of the substrate, and the closure mechanism in a way no virtual test reproduces. The constraint is cost and reset speed: every variant means a new physical build and the disposal of the previous one.
Focus groups produce qualitative reactions that help diagnose why a pack feels confusing or unappealing. The known limit is that stated preferences in a discussion environment often diverge from observed behaviour at shelf. A pack that focus group participants describe as appealing can fail in market because the explanation reflects post-rationalisation rather than the split-second decision the shelf demands.
In-store testing measures real purchase behaviour, which is the strongest possible signal. The cost is timing. By the time the data arrives, artwork is printed, the production run is complete, and the shelf-space negotiation has already happened.
Combined, the four methods cover the full design cycle, with gaps between stages. Programmes that rely on one method skip part of the picture. A programme using only physical mock shelves and in-store testing leaves the early concept and pre-production check stages thinly covered. A programme using only focus groups and in-store testing has no behavioural data before launch.
What VR captures that other methods cannot
Virtual store environments capture behavioural data at the pre-production stage that the other methods either cannot reach or cannot scale. The standard metrics in shopper research include time to first fixation (how quickly a pack is seen), fixation duration (how long attention dwells on a specific element), fixation count and gaze sequences (the order in which a shopper scans a fixture), dwell time, and areas of interest analysis on specific pack zones.
Storelab’s Research product uses proprietary 3D eye-tracking software to capture gaze data inside virtual store environments. The output is a frame-by-frame record of where shopper attention went while navigating the fixture and selecting a pack, captured directly from the gaze data rather than estimated by an external observer or self-reported in a post-shop interview. Run across a sample of shoppers and across pack variants, the data shows which design elements drive attention and which are ignored.
VR does not replicate every part of the shopper experience. Tactile assessment, weight, scent, and the social dynamics of an actual shopping trip sit outside what virtual testing can capture. VR’s strength is the visual decision-making layer, which is where pack design lives and where most shelf failures
Where VR fits in the packaging design cycle
Three stages produce the strongest return on virtual packaging testing.
Early concept validation
Before artwork is committed, brand teams can test design directions against the competitive set. A round of virtual testing at this stage filters concepts before the cost of finished artwork is incurred. The output is directional rather than absolute, and a concept that fails to register in virtual testing is unlikely to register in market.
Pre-production lock
At the final round before production, virtual testing serves as a quantitative check on the chosen design. The pack is close to launch state, and the data tests whether the final artwork delivers the visibility, recognition, and pick-up triggering the brief specified. A failure caught at this stage is recoverable through artwork revision. A failure caught post-launch is recoverable only through a full redesign cycle.
Post-launch diagnostic
When a launched SKU underperforms, the question is why. Virtual testing reconstructs the shelf condition the SKU faces in market and isolates the failure mode. A pack that is invisible at distance has a different problem from a pack that is seen but not picked up. The diagnostic informs whether the right response is artwork, formulation, pricing, or placement.
What VR cannot replace
Several aspects of pack design and shopper behaviour sit outside what virtual testing can validate. Acknowledging the limits matters for getting the most out of the method.
- Tactile assessment. Substrate texture, weight, structural integrity, and closure mechanisms need physical samples.
- Structural pack engineering. Drop tests, line speeds, and palletisation are physical-world questions virtual environments cannot answer.
- Price-value perception. Shopper response to pricing is sensitive to context that virtual environments approximate but do not fully reproduce.
- On-shelf interaction with adjacent products. How a pack performs sitting next to a specific competitor SKU at a specific retailer at a specific time still benefits from in-store observation.
- Scent, taste, and product trial. Any shopper response that depends on the product itself sits beyond visual testing.
The strongest packaging programmes integrate virtual testing with the other methods. Each method captures a different signal, and the picture is clearer when several are in the dataset.
Commissioning a virtual packaging test
A virtual packaging test starts with assets. Virtual merchandising software needs either 3D pack files or 2D artwork that can be wrapped onto pack geometry. The competitive set has to be defined: which adjacent SKUs the pack will be tested against, which retailer fixture context, which planogram position. Sample size depends on the question. Directional concept screening can produce useful signals on samples of 30 to 60 shoppers per variant. Pre-production validation, where the question is whether one design outperforms another at statistical confidence, typically needs samples in the low hundreds per cell. Category-defining launches often run larger studies because the cost of getting the answer wrong dwarfs the cost of additional sample.
Storelab’s Research product handles this category of work end-to-end, from fixture build through to behavioural data analysis. Turnaround for a pack-design study typically runs in days rather than weeks, which sits inside the timeline most stage-gate processes can absorb.
Where packaging testing is heading
Stage-gate approvals in larger FMCG organisations are starting to incorporate virtual test data as a formal input rather than a supplementary one. The number of physical mock shelf rounds in a design programme is dropping as virtual fidelity improves. The direction of travel is integrated testing across the full design cycle, with virtual environments doing the work that physical builds used to do at every stage except the final tactile check.
For a brand team weighing a pilot, the practical step is a benchmark study on an upcoming launch. A single virtual test on a pack already in development, run alongside the existing physical and focus group rounds, calibrates how the data fits the team’s stage-gate process before the method is committed across the full pipeline. Storelab can scope a benchmark study against a current pack programme; the demo request page on the site is the starting point.


