Horeszko's Weblog


Phishing Detection Heuristic: Prototype Achieves 60% Detection Rate

February 17, 2026

Project Source Code

Those of you who have ever advertised online through Facebook know that if you advertise an email address that it will become inundated with phishing emails trying to gain access to your business's Facebook account.

Surprisingly these phishing emails mimic Meta's emails and support pages almost exactly. Many of these phishing emails and corresponding malicious websites are of the highest quality.

After receiving 2 to 3 of these phishing emails per week in my wife's small business's inbox, and the failure of email spam/phishing filters to flag these phishing attempts I decided to try my hand at developing a solution.

I set out to develop a phishing email detection tool.

Researching existing approaches and phishing email datasets, I noticed many developers take a machine learning (ML) approach using outdated phishing emails from the 2000s. They would try and train a ML solution that identifies phishing using whole phishing emails in aggregate.

I realized developing a stochastic solution based on phishing email datasets would always inevitably become out of date as phishing methods evolved. Using lists of malicious senders is also flawed as phishers would always create or hijack new email accounts, and so lists also become out of date.

Inspired by the Sherlockian method of deduction and logic, I instead thought deeply on to how differentiate phishing emails from legitimate emails. I reasoned that in order for the solution to not become obsolete as phishing and email addresses evolved, the solution could not rely on any information external to the phishing email. That is, only the email itself could be used as the basis to determine its legitimacy.

To achieve this, I considered the human thought process we use to identify phishing emails. Most simply, we compare the actual sending domain to the claimed sender. For example, a phishing email might claim to be from Meta but is actually from ns.tanmmo.com.

Using this heuristic as the basis of my design, I developed a solution that utilizes natural language processing (NLP) to identify the claimed sender and compare it to the actual sending domain. The prototype heuristic has a detection rate of 60% based on a sample of 32 phishing emails. A 60% success rate for the prototype shows that the heuristic has potential. Refinements should be able to move the detection rate to 95%. For now however, I am satisfied with the success of this project.