LLM and Copyright

The quest to build the best commercial Large Language Model (LLM) has brought with it wealth, fame, technological progress, and a cornucopia of ethical questions which need to be explored before any serious integration should occur. However, being the first to market with new models could make the difference between success and failure for a company. These early integrations have arrived with a swath of legal questions and lawsuits as well as massive opportunities for organizations to create value. While the impact of these lawsuits seem to be limited to companies, it could have a greater effect on the data privacy experienced by individuals. Core components of contemporary litigation surrounding data scraping could extend to social media accounts, personal websites, or other forms of personal data. These potential changes would severely limit current scraping methods, but also the selling of scraped data to 3rd parties.

As some of these more problematic questions loom in the background, the building blocks for them are being decided in the world's legal systems. While bills like the European Union's AI Act (Hickman, 2024) or California’s Consumer Privacy Act (CCPA and CPRA, n.d.) show an appetite for legislation surrounding AI and data, major legal precedents are currently being set in lawsuits around the world. With the very real possibility of a major court decision reshaping the way companies collect, retain, and store data it becomes not only cost effective but good business to prepare for change. This especially rings true in the large legal gray area that data and AI companies currently operate under.

LLM’s require an almost obscene amount of data for training, with ChatGPT-3 alone using an estimated 570 GB of training data (Lammertyn, 2024). Training datasets are expected to increase in size as the race for the best LLM continues and models require an ever increasing amount of data. Companies readily leverage any and all public sources of information, and are already meeting legal challenges regarding their usage. GitHub, Microsoft, and OpenAI were involved in a lawsuit claiming that GitHub Copilot often did not properly attribute the code it based its suggestions on. While the judge ultimately determined that the code wasn’t similar enough to the original developers, it did open the path for breach of contract and violation of open-source licenses lawsuits (Roth, 2024). While the case marked a tepid win for the three companies, it did leave a door open for the right lawsuit to change the data collection landscape. Just a month after the Github lawsuit decision, another judge moved a lawsuit forward stating that Stable Diffusion’s AI, among other industry leaders, might have been “built to a significant extent on copyright works” (Cho, 2024). Following the Fruit of the Poisonous Tree Doctrine, any asset that was created using Stable Diffusion’s service would itself be infringing on copyrighted works if the model was trained on them. This could present a major challenge for organizations that may have incorporated “poisonous” data into their products in both legal cost and in developer hours spent redesigning products. These changes could restrict the unparalleled access to data across the internet that AI/ML companies have had, but it would also open up opportunities for other companies to emerge in the new environment.

The Github lawsuit is reminiscent of issues faced by early cartographers when it came to the copying of laboriously charted maps. Cartographers would often add in “Trap Streets” or “Paper Towns,” made up details that would be a dead giveaway that their works had been copied. With the immaturity of LLM models in the grander scheme (Leibowitz, 2023), it would stand to reason that these immature LLM models could easily fall for “Trap Streets.” If a model or product could be consistently proven to reference a “Trap Street” one would assume that a copyright claim would follow quickly. Especially motivated organizations or individuals could provide a paid probing service to test response for known “Trap Streets,” especially litigious groups could become extremely proactive in it’s research. If the tech space wanted to avoid these issues it could easily champion the practice of putting in code comments or documentation in repositories that would identify the data scraping rules. Packages already exist to read in comment code one could add to it by looking for specific phrases and then passing a yes or no flag to ingest the data. Complexity could also be added in case there are differences for education, personal, or enterprise usage. The idea could also be expanded for just setting a package configuration that could be read if referenced by other code or triggered by an event/action.

You could also apply roughly the same idea of a “Trap Street” but to a more nuclear option with a “Poison Pill.” Much like the individual that got a deal on a new car for a dollar using the dealership's AI chatbot (Leibowitz, 2023) or naming your child in similar fashion as Bobby Tables (Munroe, n.d.), the idea of poisoning a data model is not new. A great introduction to the idea of a poison pill and AI is when Microsoft introduced its chatbot, Tay, to the world. Once the Washington State based tech giant gave Twitter data to Tay it took a turn for the worse and they quickly shut it down (Microsoft, 2016). The idea of a certain dataset tarnishing a model has not changed since Microsoft's lesson from 2016, using ingested data to create an environment for errors. While going scorched earth on some minor public code or data sources might be overkill, the same cannot be said for confidential or sensitive data. The ability to kill a process or its response when reading or interpreting certain data would be a huge benefit to a plethora of sectors. This can be true whether it be about personal information or highly classified government intelligence, the chance to critically wound someone's attempt to peek into your data is worthwhile.

While data audits are a good internal practice, it is foreseeable that with the ever-integrated data-driven world we live in,more legal and invasive forms of data audits will soon exist. With lawsuits like the Doe v Github one mentioned earlier (Roth, 2024), an audit on training data or model response for a legal discovery process would be a major hurdle at all levels of an organization. Given the level of difficulty and costs involved, audits can create an environment for specialized organizations to emerge and corner markets. Value could easily be generated both upstream and downstream, data ingestion wise. Companies and services providing a 3rd party data auditing service that flags any suspect data or licensing violations could find themselves overworked within the near future. These data usage and licensing issues equally affect large enterprises, small to mid-sized businesses, and individuals.

AI inevitably will have negative effects on the data privacy and copyright ownership, that much is abundantly clear given current 2024 trends. However, organizations and their artificial intelligence models are going to run into hurdles, and those will create opportunities for new enterprises to emerge to mitigate or take advantage of these issues. While the legal issues surrounding copyright and citations for data scraped are certainly a major issue for AI companies, any major court decision could drastically impact the way that everyone interacts with data. This could be easily compounded by courts or judges that do not have a full grasp on the technical complexities of not only AI and LLMs but also the Extract, Transform, and Load cycle of data ingestion. Given that a whole rewrite of current production pipelines or models would not be out of the question, it would be good business to stay on top of current citation and data scraping best practices. It could be as easy as checking READMEs for any mention of citation. This would be a great area for new packages, or perhaps a new business, to emerge with a technical solution for better tracking of needed citations or ownership. Artificial intelligence is here to stay regardless of the depth of the ethical complaints or qualms that have arisen. While AI quickly weaves itself into the regular human experience and amasses the very real threat of decimating certain jobs and industries, it also brings with it an opportunity for new professions and money making ventures to arise from the ashes of others.

References

CCPA and CPRA. (n.d.). International Association of Privacy Professionals. Retrieved September 17, 2024, from https://iapp.org/resources/topics/ccpa-and-cpra/

Cho, W. (2024, August 13). Artists Score Major Win in Copyright Case Against AI Art Generators. The Hollywood Reporter. Retrieved 18 9, 2024, from https://www.hollywoodreporter.com/business/business-news/artists-score-major-win-copyright-case-against-ai-art-generators-1235973601/

Hickman, T. (2024, July 16). Long awaited EU AI Act becomes law after publication in the EU’s Official Journal. White & Case LLP. Retrieved September 17, 2024, from https://www.whitecase.com/insight-alert/long-awaited-eu-ai-act-becomes-law-after-publication-eus-official-journal

Lammertyn, M. (2024). 60+ ChatGPT Statistics and Facts You Need to Know in 2024. Invgate. https://blog.invgate.com/chatgpt-statistics

Leibowitz, D. (2023, December 21). Man Tricks Chatbot To Sell Car For $1 Car, by David Leibowitz. Action Bias. Medium. Retrieved September 18, 2024, from https://medium.com/action-bias/chevy-tahoe-1-chatgpt-4896b5dfc32c

Microsoft. (2016, March 25). Learning from Tay's introduction - The Official Microsoft Blog. The Official Microsoft Blog. Retrieved September 18, 2024, from https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/

Munroe, R. (n.d.). xkcd: Exploits of a Mom. XKCD. Retrieved September 18, 2024, from https://xkcd.com/327/

Roth, E. (2024, July 9). The developers suing over GitHub Copilot got dealt a major blow in court. The Verge. Retrieved September 17, 2024, from https://www.theverge.com/2024/7/9/24195233/github-ai-copyright-coding-lawsuit-microsoft-openai

Latest Articles

Read more about the latest and greatest work Rearc has been up to.

Databricks’ Gartner Win and the DAIS 2025 Announcements That Explain It

Databricks' recent Gartner Magic Quadrant leadership win is explained by their comprehensive DAIS 2025 announcements including MLFlow 3.0, AgentBricks, Lakebase, and serverless GPU support that create a unified platform for enterprise GenAI applications.

DAIS 2025

Databricks

GenAI

Data and AI Summit 2025 Roundup

This article is a concise recap of Databricks’ DAIS 2025 keynotes, highlighting the launch of a free training tier, Agent Bricks for AI agents, GA Databricks Apps, Lakebase preview, and serverless GPUs—framed around how these innovations accelerate secure, compliant AI and data workflows in financial services.

Data

Announcement