Towards replication-robust analytics markets
2026ArticleIn press/Available onlineJournal paper
T Falconer, P Pinson, J Kazempour
INFORMS Journal on Data Science, in press/available online
Publication year: 2026
Despite recent advancements in machine learning, in practice, relevant datasets are often distributed among market competitors who are reluctant to share. To incentivize data sharing, recent works propose analytics markets, where multiple agents share features and are rewarded for improving the predictions of others. These rewards can be computed by treating features as players in a coalitional game, with solution concepts that yield desirable market properties. However, this setup incites agents to strategically replicate their data and act under multiple false identities to increase their own revenue and diminish that of others, limiting the viability of such markets in practice. In this work, we develop an analytics market robust to such strategic replication for supervised learning problems. We adopt Pearl’s do-calculus from causal inference to refine the coalitional game by differentiating between observational and interventional conditional probabilities. As a result, we derive rewards that are replication-robust by design.
How to purchase labels? A cost-effective approach using active learning markets
2026ArticleIn press/Available onlineJournal paper
X. Huang, P. Pinson
INFORMS Journal on Data Science, in press/available online
Publication year: 2026
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalizing the market clearing as an optimization problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to benchmark baselines including random sampling and a greedy knapsack heuristic. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimizing data acquisition in resource-constrained environments.