The Hidden Transformation of Public Data: Algorithmic Conversion by SSDEs into Proprietary Market Power
From Reviews to Market Power – Public Data and Platform Dominance
Imagine a shopping mall owner who rents spaces to various outlets and has exclusive access to customers’ movements, preferences, and purchases. Using this exclusive data, the owner then launches a competing in-house brand. The outlets and customers see only the usual footfall; the mall owner alone is capable of seeing the pattern employed by all the outlets.[1] Notably, these SSDEs have a “dual role.” They operate as an online marketplace facilitating purchases and transactions between consumers and third-party sellers, while also acting as an online retailer that sells products directly to consumers.
Every product review, star rating, or user comment online tells a story. These publicly available reviews serve a three-fold function: enhancing shopping experiences for consumers, providing behavioural insights to the business users, and enabling the core digital service providers to improve their platforms. Although product reviews are publicly available data, the issue emerges when this public data is transformed into non-public, proprietary intelligence using advanced generative tools and algorithms. Amazon, for instance, has acknowledged using “public and aggregated data” of its stores to identify consumer demand and preferences on various parameters. It further stated that its private label products, on average, enjoy higher customer review ratings and higher repeat purchase rates than other brands on its platform, both of which are factors considered by its algorithm. This shows how publicly available consumer data, once processed, can be transformed into valuable proprietary intelligence, effectively creating what may be called derived data. While this new information expands the data ecosystem, it simultaneously obscures its origin, leaving individuals unaware that this public data forms the foundation of proprietary rights.
This transformation of public data into non-public data poses serious concerns. India’s Digital Competition Bill, 2024 [“DCB“] and the EU’s Digital Markets Act [“DMA“], in their respective formulations, fail to address this subtle form of potential anti-competitive conduct, thereby creating a regulatory blind spot. Although the DCB has recently been withdrawn, the evolving regulatory landscape presents a proactive opportunity for policymakers to address this lacuna in any future legislative framework.
Against this backdrop, this article examines how legislative frameworks, alongside India’s now-withdrawn DCB, fail to adequately regulate the transformation of publicly available data into non-public information. While the DCB does not constitute binding law, it is analysed as a proposed framework to illustrate persisting regulatory gaps. The article argues that the existing regime governing data usage by SSDEs does not sufficiently capture this form of conduct. To this end, the authors wish to highlight this concern in a four-pronged manner. Firstly, the prohibition of data usage by SSDEs under Section 12 and how it fails to address the issue of data creation and transformation; secondly, the idea of information asymmetry and the resultant data-use conduct of the digital platforms; thirdly, the algorithmic transformation of public reviews into non-public proprietary intelligence; and finally, the need to revisit the regulatory boundaries to address derived data and the anti-competitive conduct.
Analysing Section 12(1): The Unregulated Transformation of Public Data by SSDEs
Section 12 of the DCB provided the manner in which SSDEs must use data. Sub-section (1) prohibits SSDEs from using non-public data of business users operating on its core digital service to compete with such users. Additionally, the use of words “directly or indirectly” and “use or rely” emphasisesthe legislative intent to ensure fair competition while maintaining transparency.
Further in its explanatory clause, non-public data is defined as:
“non-public data” shall include any aggregated and non-aggregated data generated by business users that can be collected through the commercial activities of business users or their end users, on the identified Core Digital Service of the Systemically Significant Digital Enterprise.”
It becomes evident from the bare reading of the aforementioned provision that it is only the data generated by business users that the SSDEs are barred from using, and not any such data that may be generated by the SSDE itself. Likewise, it does not address the possibility thatSSDEs might transform publicly available data, such as product reviews and ratings, into non-public data through exclusive data processing mechanisms to distort competition from the market. The exclusion of this form of transformed data from the ambit of the definition of non-public data can be detrimental. It can lead to SSDEs claiming that they only possess “public” data and not the non-public data as per the provision of the aforementioned section.
This contention becomes particularly problematic when one considers the competitive advantage that does not merely arise from access to data, but from the structural conditions under which it is processed. This often leads to a ‘winner-takes-most’ outcome, where a dominant player resorts to market practices that curtail market contestability. Such practices include the unmatched scale, capital-intensive infrastructure required for the algorithmic processing of these public datasets, and access to real-time consumer behavioural metrics that third-party sellers cannot replicate. These factors enable the SSDEs to benefit from the publicly available datasets in a manner that smaller competitors, despite having access to the same data, cannot meaningfully utilise.
This lacuna also remains unaddressed by the DCB’s European counterpart, the DMA. Article 6(2) of DMA prohibits gatekeepers (functionally analogous to SSDEs) from using any data generated by business users or their customers which is not publicly available. While textually different, Article 6(2) and section 12 of the DCB seek to regulate the same category of competitively sensitive information.
Admittedly, an alternative interpretation of Section 12 is possible; however, this article proceeds on the view that the legislative choice to specifically define the prohibited category of ‘non-public data’ was intended to confine, rather than expand its scope. A similar approach can also be seen in the drafting of the DMA, where the protected category is expressly identified. On this reading, the provision governs the use of data already classified as non-public data but remains silent on situations where SSDEs transform publicly available data into proprietary, competitively valuable information. This regulatory blind spot enables SSDEs/gatekeepers to leverage the use of the exact data that the Bill and the Act wish to curtail. Consequently, the legislative frameworks govern the use of non-public data but not its creation, thereby allowing SSDEs to implicitly stifle competition owing to the silence retained by the drafting of the said provisions.
Information Asymmetry: The Data-use Conduct of Digital Platforms
In digital marketplaces, competition is not driven by access to consumers but rather by access to information. This struggle for gaining access to information results in what may be termed as information asymmetry, which arises when one party possesses more or better information than the others. In economic parlance, this phenomenon is identified as a distortion which creates an imbalance of power.
Today, information asymmetry transcends the mere usage of data. It largely depends on who has the ability to analyse and interpret it. Just as two photographers may capture the same scene, yet the one with a higher-resolution lens can uncover hidden textures and details. Similarly, these SSDEs, equipped with advanced generative tools and algorithms, perceive patterns invisible to others. Through these tools and algorithms, they transform publicly visible data into exclusive competitive insight, known as the Herein, it refers to the access to and use of non-publicly available data exclusively for one’s own operations in competition with others.
This assertion is supported by two significant observations. First, the European Commission, in its Amazon Marketplace and Amazon Buy Box decisions,found that Amazon relied on automated software systems and artificial intelligence applications on vast datasets. The Commission preliminarily observed that, contrary to Amazon Retail, which has full access to and uses real-time, individual-level data of all third-party sellers to calibrate its retail decisions, sellers themselves have access only to their own listings and sales data (para. 111). Further, these findings primarily concern the use of non-public data of third-party sellers, such as transaction-level (para. 109) and performance-related (para. 112) information. However, these findings remain instructive in demonstrating how access to large-scale datasets, combined with advanced analytical capabilities, enables platforms to monitor real-time market conditions and convert raw information into aggregated form and actionable insights. This analytical advantage, although identified in the context of non-public data, risks being extended to the processing of publicly available data, where similar techniques may be deployed to derive proprietary intelligence, i.e., information that is privately owned and not publicly available.
Second, the asymmetry manifests in the panoramic visibility that SSDEs enjoy by virtue of their structural capabilities.
Thus, even if public data is available to other market players, SSDEs, by virtue of their algorithmic capabilities, combine such public data with their platform-specific behavioural and transactional datasets, creating a form of proprietary intelligence that competing players cannot realistically generate. It is this structural inability to replicate the combined informational environment of SSDEs that generates the anti-competitive concern.
Algorithmic Transformation of Public Reviews into Non-Public Proprietary Intelligence
In the context of customer reviews and feedback that are inherently public data, the conduct of the SSDEs raises a deeper concern. These publicly available reviews, when used to extract insights unavailable to others, pose a threat to digital competitive practices by blurring the line of distinction between publicly available data and proprietary intelligence.
According to the U.S. Federal Trade Commission [“FTC”], the use of AI tools by large digital platforms is not limited to analysing non-public data of third-party sellers. It now encompasses publicly available user-generated data such as customer reviews and testimonials, which has been characterised as a form of “illicit review and endorsement practice.” While the FTC has used this specifically to address the issue of fake reviews, this indirectly shows that AI programs are being trained on customer reviews. The AI-driven analyses enable the platforms to derive invisible layers of information, including but not limited to sentiment clustering, feature-level preferences, and engagement activity. This derivation relies on sophisticated machine-learning methods. For instance, sentiment clustering employs Natural Language Processing [“NLP”] to group text data, primarily customer reviews and feedback, based on shared sentiments to understand the underlying patterns. This allows the SSDEs to perceive collective consumer experiences, behavioural insights, and predict purchasing trends in a manner and form that is anti-competitive.
The European Commission’s investigation intothe abovementioned Amazon Marketplace and the Buy Box Cases highlights the implications of this anti-competitive behaviour. Although these cases do not directly address the transformation of public data into non-public intelligence, they underscore how non-public data can be used as a potential access to gain unfair competitive advantage. Applied to customer reviews and feedback, such practices would enable platforms like Amazon and Google to leverage such data to improve theirprivate-label products, such as AmazonBasics. Similarly, in FTC’s lawsuit against Amazon, the platform was accused of manipulating the product reviews using algorithms to promote its own offerings (self-preferencing), highlighting how the control over non-publicly available data can distort competition. The inferences drawn from these cases contribute to the growing concern of rethinking the regulatory gaps that allow such data transformation to take place while escaping scrutiny.
Rethinking Boundaries: Addressing Derived Data and the Anti-Competitive Conduct
Additionally, there is a need to maintain algorithmic transparency, where SSDEs must be obligated to disclose to the regulatory authorities the different dimensions of data inputs and inferences they are using while undertaking commercial activities and decisions. Disclosing algorithmic information does not mean the disclosure of proprietary or in-house algorithms, but rather a regulatory oversight into how the data is being collected, processed, and transformed. For such disclosure to be meaningful, it must allow regulators to examine whether publicly available data is being integrated with consumer profiling datasets in order to favour the SSDE’sown products or services. This integration lies at the heart of the competitive concern, which non-dominant players or business users cannot replicate due to differences in their structural conditions. Accordingly, a narrowly tailored disclosure obligation, requiring the SSDEs to disclose the categories of data they use and how that is being used for product recommendation and search engine optimisation, would help regulators to assess whether such practices promote self-preferencing.
This is similar to the transparency obligations provided under Article 15 of the DMA, which requires the gatekeepers to provide an independently audited description of their consumer profiling techniques to the European Commission. Hence, India’s framework also requires a similar model, mandating the disclosure of input categories and derived data types to ensure that public data is not transformed into non-public proprietary information with a motive to stifle competition. The lacunae in the drafting of Section 12(1), if not adequately addressed, risksfreely allowing such SSDEs to distort and curb competition, thus defeating the very spirit that the digital competition law aims to uphold.
Sayed Kirdar Husain is a Year III student, and Kritvee Sharma is a Year II student at the Rajiv Gandhi National University of Law, Punjab.
[1]It is to be noted that the term SSDEs was introduced under the Digital Competition Bill, 2024 [“DCB”], which was subsequently withdrawn before enactment. SSDEs are large digital platforms classified under the DCB by the Competition Commission of India [“CCI”] based on their financial strength, scale and significant market influence, akin to the concept of “gatekeepers” under the European Union’s Digital Markets Act [“DMA”]. The term, SSDEs, is used throughout this article purely as a convenient reference to large digital platforms, and does not connote any other formal regulatory terminology.