A recent study shows that AIs of the same kind as GPT-4 can use anonymous posts taken from online forums or social media to deduce very precise information on people’s lives. This simplified profiling could constitute a new threat to privacy, especially in the relationship between employers and employees.
On the floor lies a receipt bearing a date and a few items consumed: “4 October: room, 8 shillings; breakfast, 2 shillings 6 pence; cocktail, 1 shilling; lunch, 2 shillings 6 pence; glass of sherry, 8 pence.” No logo, no address. When Inspector Lestrade discovers the receipt, he doesn’t see anything promising in it. But thanks to his elephantine memory, Sherlock Holmes knows that a cocktail and a room at that price could only be found in one of the hotels on Northumberland Avenue; so he heads there to look through the registers in search of the mysterious client. This method – so brilliantly applied by the hero of The Adventures of Sherlock Holmes – is called inference. By collecting pieces of apparently negligible information, like a pair of sleeves pulled up, the wear and tear on a hat, or the use of a specific vocabulary, Holmes is able to infer extremely precise facts about a person’s age, origin, profession, or even their state of health.
Researchers from the IT department of the university of Zurich (ETH) have been asking themselves to which extent the highest-performing AI language-processing systems of the day (known as “large language models”, or LLMs) – such as GPT-4, by OpenAI, or Llama 2, by Meta – could pull off this type of inference. By providing them with random texts posted on social media and online forums, they found out that these AIs could in fact infer precise information about the users’ private lives with an accuracy close to 100%.
‘Potentially sensitive information about anonymous people would be obtained without them being aware of it’
A profiling AI, at low cost
To reach such results, AI will use anything: a clue in the mention of a building’s architecture, of certain animals or plants; in someone’s description of their symptoms; in a TV show they mention, which happens to only be available in one country; or in their use of a piece of jargon or turn of phrase characteristic of a certain town or profession… All of which can say a lot about our place of birth, time of writing, neighbourhood, ethnic origin, etc. Try competing with AI on this website set up to accompany the publication of the article: you’ll see that using a trivial text describing a view of the Alps from a tram, GPT-4 can deduce that it was written by an American tourist in Switzerland, in the district of Oerlikon, just north of Zurich. And here, the researchers call to our attention that we’re not dealing with a mere whodunit game, but a potentially massive encroachment on our privacy: sensitive information can be gleaned from anonymous people without their knowing. “We demonstrate that by scraping the entirety of a user’s online posts and feeding them to a pre-trained LLM, malicious actors can infer private information never intended to be disclosed by the users,” reads the study.
‘In half an hour and for twenty dollars, we were able to profile 520 individuals using their Reddit posts’
—Robin Staab
“With the appropriate human resources, companies and government agencies could already achieve a similar result,” one of the authors of the study, Robin Staab, told Philonomist. But hiring a profiler for several days comes at a considerable cost. AI allows us to do it en masse, virtually for free. “In half an hour and for twenty dollars, basically the cost of a GPT-4 subscription,” Staab explains, “we were able to profile 520 individuals using their Reddit posts.”
Employees under surveillance?
Depending on one’s perspective, this can be seen as very useful for companies or very dangerous for employees. “For example, when you apply for a job,” the researcher says, “it’s common for the recruiter to quickly look up your name online.” But due to a lack of time and skill, this quick attempt at profiling doesn’t go very far. Now imagine if a candidate’s every online trace could be handed over to an artificial Sherlock Holmes. All their Facebook posts, Twitter messages, LinkedIn posts, etc. If they happen to post a lot online, and even if their posts seem innocent and devoid of private information, AI could still provide the employer with specific facts about their state of health, how many children they have, where they live, as well as potential problems encountered in their previous job…
Deep down, by allowing anyone to use this kind of artificial Sherlock Holmes, GPT-4 and similar LLMs are once again raising the question of our online activity. If it seems obvious that we shouldn’t reveal private or sensitive information about ourselves online, the Swiss study suggests that in the era of AI, we will probably have to elevate our level of caution to a whole new level, verging on paranoia. One solution could simply be to say nothing at all. Another, more amusing one, could be to trick the machine by carrying out what’s known as “data poisoning”: between two truths, slip in elements which aren’t related to you, as if they were... AI will then infer information which no longer has any value. But if we all start spreading fake clues online, what will the internet of tomorrow look like?