Fraud Detection with Semi-supervised Learning

41626 2019-02-13 23:57


Leveraging user and device data during user login to fight against

  1. ATO (account takeovers)
  2. Botnet attacks

ATOs ranking from easy to hard to detect

  1. from single IP
  2. from IPs on the same device
  3. from IPs across the world
  4. from 100k IPs
  5. attacks on specific accounts
  6. phishing and malware


Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

  • K-means: not good
  • DBSCAN: better. Use labels to
    1. Tune hyperparameter
    2. Constrain


  • Manual feature selection
  • Feature evolution in adversarial environment
  • Scalability
  • No online DBSCAN


Anti-fraud Query

Anti-fraud Featuring

Production Setup

  • Batch: 7 days worth of data, run DBSCAN hourly
  • Streaming: 60 minutes moving window, run streaming k-means
  • Used feedback signal success ratios to mark clusters as good, bad or unknown
  • Bad clusters: Always throw
  • Good clusters: Small % of attempts
  • Unknown clusters: X% of attempts


在 Github 上 Follow 我 :)

下载 硅谷io App

关注互联网创业,用移动 App ,随时随地收藏和复习好文章