Explosives, Cybercrime, Dark Web: The Alarming Secrets AI Chatbots Revealed In Safety Trials

A recent set of safety evaluations has revealed how advanced AI chatbots, including OpenAI’s GPT-4.1, could be manipulated to provide harmful information, raising fresh concerns about misuse and security risks.

OpenAI and Anthropic Put Each Other to the Test

The trials were conducted in a unprecedented partnership between OpenAI, worth $500 billion and headed by Sam Altman, and its competitor Anthropic, created by former OpenAI employees worried about safety. Under the experiment, both firms tested each other’s models to determine how they reacted to hazardous prompts.

The findings were troubling. OpenAI’s GPT-4.1 was reported to have produced instructions on explosives, drug-making, and even bioweapons such as anthrax. It also went as far as identifying weak points in sports venues and suggesting ways attackers could bypass security. Though these scenarios were tested in controlled conditions, Anthropic said such “concerning behaviour” highlighted the urgent need for more rigorous alignment checks on AI systems.

AI Use in Cybercrime and Scam

Anthropic revealed that its own Claude model had been used improperly in actual crime plots. The company stated that North Korean agents attempted to use the AI while pretending to be job applicants at global tech companies, while other people sold AI-created ransomware packages for up to $1,200.

The company alerted that AI is now being “weaponised,” fuelling advanced cyberattacks and web fraud. Unlike conventional tools, the systems can learn in real-time to security measures, including malware protection software. This flexibility, Anthropic emphasized, reduces the barrier for cybercriminals, as less technical skill is needed to execute complex attacks.

ALSO READ: Did ChatGPT Cross The Line? Teen’s Tragic Death Puts AI Safety On Trial

Experts Call for Stronger Guardrails

In spite of these striking examples, large-scale real-world abuses are still comparatively modest, says researchers. Ardi Janjeva from the UK’s Centre for Emerging Technology and Security explained that while it is disturbing, there is no “critical mass” of high-profile instances yet. He is convinced that with further research intensifications, careful safeguards, and global collaboration, it will become increasingly difficult to exploit these systems.

The two firms published their results to be open about how they work, departing from the tech sector’s common tradition of secrecy when it comes to testing safety. OpenAI stated its just released ChatGPT-5 demonstrates “substantial improvements” against being exploited, hallucinating, and misleading prompts. Anthropic further stated that most of the harmful outputs that were identified in its studies could be avoided if there were proper controls put in place around the models.

Nevertheless, scientists determined that even such tenuous pretexts as stating the information was for “security planning” could convince certain models to generate harmful content. “We need to know how frequently, and under what conditions, systems may try to perform unwanted actions that could cause severe harm,” Anthropic cautioned.