The quest to build the best commercial Large Language Model (LLM) has brought
with it wealth, fame, technological progress, and a cornucopia of ethical
questions which need to be explored before any serious integration should occur.
However, being the first to market with new models could make the difference
between success and failure for a company. These early integrations have arrived
with a swath of legal questions and lawsuits as well as massive opportunities
for organizations to create value. While the impact of these lawsuits seem to be
limited to companies, it could have a greater effect on the data privacy
experienced by individuals. Core components of contemporary litigation
surrounding data scraping could extend to social media accounts, personal
websites, or other forms of personal data. These potential changes would
severely limit current scraping methods, but also the selling of scraped data to
3rd parties.
As some of these more problematic questions loom in the background, the building
blocks for them are being decided in the world's legal systems. While bills like
the European Union's AI Act (Hickman, 2024) or California’s Consumer
Privacy Act (CCPA and CPRA, n.d.) show an appetite for
legislation surrounding AI and data, major legal precedents are currently being
set in lawsuits around the world. With the very real possibility of a major
court decision reshaping the way companies collect, retain, and store data it
becomes not only cost effective but good business to prepare for change. This
especially rings true in the large legal gray area that data and AI companies
currently operate under.

LLM’s require an almost obscene amount of data for training, with ChatGPT-3
alone using an estimated 570 GB of training data (Lammertyn,
2024). Training datasets are expected to increase in size as the
race for the best LLM continues and models require an ever increasing amount of
data. Companies readily leverage any and all public sources of information, and
are already meeting legal challenges regarding their usage. GitHub, Microsoft,
and OpenAI were involved in a lawsuit claiming that GitHub Copilot often did not
properly attribute the code it based its suggestions on. While the judge
ultimately determined that the code wasn’t similar enough to the original
developers, it did open the path for breach of contract and violation of
open-source licenses lawsuits (Roth, 2024). While the case marked a tepid win
for the three companies, it did leave a door open for the right lawsuit to
change the data collection landscape. Just a month after the Github lawsuit
decision, another judge moved a lawsuit forward stating that Stable Diffusion’s
AI, among other industry leaders, might have been “built to a significant extent
on copyright works” (Cho, 2024). Following the Fruit of the Poisonous
Tree Doctrine, any asset that was created using Stable Diffusion’s service would
itself be infringing on copyrighted works if the model was trained on them. This
could present a major challenge for organizations that may have incorporated
“poisonous” data into their products in both legal cost and in developer hours
spent redesigning products. These changes could restrict the unparalleled access
to data across the internet that AI/ML companies have had, but it would also
open up opportunities for other companies to emerge in the new environment.

The Github lawsuit is reminiscent of issues faced by early cartographers when it
came to the copying of laboriously charted maps. Cartographers would often add
in “Trap Streets” or “Paper Towns,” made up details that would be a dead
giveaway that their works had been copied. With the immaturity of LLM models in
the grander scheme (Leibowitz, 2023), it would stand to reason
that these immature LLM models could easily fall for “Trap Streets.” If a model
or product could be consistently proven to reference a “Trap Street” one would
assume that a copyright claim would follow quickly. Especially motivated
organizations or individuals could provide a paid probing service to test
response for known “Trap Streets,” especially litigious groups could become
extremely proactive in it’s research. If the tech space wanted to avoid these
issues it could easily champion the practice of putting in code comments or
documentation in repositories that would identify the data scraping rules.
Packages already exist to read in comment code one could add to it by looking
for specific phrases and then passing a yes or no flag to ingest the data.
Complexity could also be added in case there are differences for education,
personal, or enterprise usage. The idea could also be expanded for just setting
a package configuration that could be read if referenced by other code or
triggered by an event/action.
You could also apply roughly the same idea of a “Trap Street” but to a more
nuclear option with a “Poison Pill.” Much like the individual that got a deal on
a new car for a dollar using the dealership's AI chatbot (Leibowitz,
2023) or naming your child in similar fashion as Bobby Tables
(Munroe, n.d.), the idea of poisoning a data model is not new. A great
introduction to the idea of a poison pill and AI is when Microsoft introduced
its chatbot, Tay, to the world. Once the Washington State based tech giant gave
Twitter data to Tay it took a turn for the worse and they quickly shut it down
(Microsoft, 2016). The idea of a certain dataset tarnishing a
model has not changed since Microsoft's lesson from 2016, using ingested data to
create an environment for errors. While going scorched earth on some minor
public code or data sources might be overkill, the same cannot be said for
confidential or sensitive data. The ability to kill a process or its response
when reading or interpreting certain data would be a huge benefit to a plethora
of sectors. This can be true whether it be about personal information or highly
classified government intelligence, the chance to critically wound someone's
attempt to peek into your data is worthwhile.

While data audits are a good internal practice, it is foreseeable that with the
ever-integrated data-driven world we live in,more legal and invasive forms of
data audits will soon exist. With lawsuits like the Doe v Github one mentioned
earlier (Roth, 2024), an audit on training data or model response for a
legal discovery process would be a major hurdle at all levels of an
organization. Given the level of difficulty and costs involved, audits can
create an environment for specialized organizations to emerge and corner
markets. Value could easily be generated both upstream and downstream, data
ingestion wise. Companies and services providing a 3rd party data auditing
service that flags any suspect data or licensing violations could find
themselves overworked within the near future. These data usage and licensing
issues equally affect large enterprises, small to mid-sized businesses, and
individuals.
AI inevitably will have negative effects on the data privacy and copyright
ownership, that much is abundantly clear given current 2024 trends. However,
organizations and their artificial intelligence models are going to run into
hurdles, and those will create opportunities for new enterprises to emerge to
mitigate or take advantage of these issues. While the legal issues surrounding
copyright and citations for data scraped are certainly a major issue for AI
companies, any major court decision could drastically impact the way that
everyone interacts with data. This could be easily compounded by courts or
judges that do not have a full grasp on the technical complexities of not only
AI and LLMs but also the Extract, Transform, and Load cycle of data ingestion.
Given that a whole rewrite of current production pipelines or models would not
be out of the question, it would be good business to stay on top of current
citation and data scraping best practices. It could be as easy as checking
READMEs for any mention of citation. This would be a great area for new
packages, or perhaps a new business, to emerge with a technical solution for
better tracking of needed citations or ownership. Artificial intelligence is
here to stay regardless of the depth of the ethical complaints or qualms that
have arisen. While AI quickly weaves itself into the regular human experience
and amasses the very real threat of decimating certain jobs and industries, it
also brings with it an opportunity for new professions and money making ventures
to arise from the ashes of others.
References
CCPA and CPRA. (n.d.). International Association of Privacy Professionals.
Retrieved September 17, 2024, from
https://iapp.org/resources/topics/ccpa-and-cpra/
Cho, W. (2024, August 13). Artists Score Major Win in Copyright Case Against
AI Art Generators. The Hollywood Reporter. Retrieved 18 9, 2024, from
https://www.hollywoodreporter.com/business/business-news/artists-score-major-win-copyright-case-against-ai-art-generators-1235973601/
Hickman, T. (2024, July 16). Long awaited EU AI Act becomes law after
publication in the EU’s Official Journal. White & Case LLP. Retrieved
September 17, 2024, from
https://www.whitecase.com/insight-alert/long-awaited-eu-ai-act-becomes-law-after-publication-eus-official-journal
Lammertyn, M. (2024). 60+ ChatGPT Statistics and Facts You Need to Know in
2024. Invgate. https://blog.invgate.com/chatgpt-statistics
Leibowitz, D. (2023, December 21). Man Tricks Chatbot To Sell Car For $1 Car,
by David Leibowitz. Action Bias. Medium. Retrieved September 18, 2024, from
https://medium.com/action-bias/chevy-tahoe-1-chatgpt-4896b5dfc32c
Microsoft. (2016, March 25). Learning from Tay's introduction - The Official
Microsoft Blog. The Official Microsoft Blog. Retrieved September 18, 2024,
from https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/
Munroe, R. (n.d.). xkcd: Exploits of a Mom. XKCD. Retrieved September 18,
2024, from https://xkcd.com/327/
Roth, E. (2024, July 9). The developers suing over GitHub Copilot got dealt a
major blow in court. The Verge. Retrieved September 17, 2024, from
https://www.theverge.com/2024/7/9/24195233/github-ai-copyright-coding-lawsuit-microsoft-openai