某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

Overview

elective-dataset-2021spring

某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。

数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。

Baseline模型和上下游相关工具采用 MIT License 进行许可。

数据集

dataset/ 目录包含了收集到的所有带标签验证码数据,共29338张。

  • dataset/manual: 人工标注的带标签验证码GIF数据集,标签经过了elective验证因此都是正确的。共5471张。
  • dataset/auto-corrdataset/auto-fail-tagged: 模型自动标注的带标签验证码GIF数据集,其中 auto-corr 是识别正确(通过了elective验证)的部分,auto-fail-tagged 是识别错误然后手工重新标注的部分(此部分不保证正确性)。共22931(正确)+936(错误)张。

使用时请注意,由于 GitHub 的限制

  • auto-fail-tagged 在仓库中存储为7-zip压缩包;
  • manual 在仓库中存储为7个不超过48MB的7-zip分卷;
  • auto-corr 没有存储在仓库中,而是压缩为14个不超过95MB的7-zip分卷放在了 Release页面

Baseline 模型

baseline/ 目录包含一个简易的验证码识别模型。

此模型进行了提取关键帧、基于OpenCV的图像增强以及基于CNN的分类器等一系列工作以完成识别。

将训练集和测试集图片分别放入 set-trainset-test 后运行 train.py 进行训练,用一块TITAN RTX训练需要几分钟的时间。

用大约一万张图片训练好的 checkpoints/model_29.pth 能达到 98.4% 的整体精确度。

predict_bootstrap.py 在elective系统上测试当前模型,将检验正确的带标签图片放入 bootstrap_img_succ 目录,错误的图片放入 bootstrap_img_fail 目录。

上下游相关工具

  • crawl/: 验证码众包标注平台,可以从elective爬取验证码、辅助多名用户同时标注、检验正确性后将正确的数据放入 img_correct 目录。检验错误的验证码将被抛弃,这是初期的一个设计失误,这样将使得数据集的分布与真实分布有偏差。
  • retag/: 手工标注模型识别错误数据的工具。从 bootstrap_img_fail 读取标注错误图片,人工输入正确标注后移动到 bootstrap_img_fail_tagged
  • serve/: 提供在线验证码识别服务的 HTTP RPC 服务器。POST /fire 并传入base64编码的验证码GIF来进行识别。

数据处理过程

首先,我们设立了众包标注平台,多名志愿者累计标注了超过五千张验证码。

有了这些数据后,我们利用OpenCV进行了简单的图片增强、二值化、分字、裁切,然后随手糊了一个简单的CNN网络来识别。在随意调参之后,模型的整体(四个字)准确率接近95%。

然后,我们利用此模型来对数据集进行自举:爬取验证码后调用模型识别然后检验正确性,其中识别错误的部分手工标注。这样我们可以轻易地扩大数据集的规模,从而提升模型效果。

经过了更多的随意调参,模型的整体准确率可以达到98.4%。因为继续提升准确率意义不大,就没有继续优化。考虑到 PyTorch 安装比较麻烦,模型不易于部署到用户的设备上,我们实现了一个 HTTP API 可以用于云端识别。

相关工作

by Elector Quartet (按字典序的倒序 @xmcp, @SpiritedAwayCN, @Rabbit, @gzz)

You might also like...
Owner
xmcp
叶氏筛法第 NaN 代传人
xmcp
Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Softlearning Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is

Robotic AI & Learning Lab Berkeley 997 Dec 30, 2022
World Models with TensorFlow 2

World Models This repo reproduces the original implementation of World Models. This implementation uses TensorFlow 2.2. Docker The easiest way to hand

Zac Wellmer 234 Nov 30, 2022
SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

Jingyun Liang 2.4k Jan 05, 2023
Normalization Calibration (NorCal) for Long-Tailed Object Detection and Instance Segmentation

NorCal Normalization Calibration (NorCal) for Long-Tailed Object Detection and Instance Segmentation On Model Calibration for Long-Tailed Object Detec

Tai-Yu (Daniel) Pan 24 Dec 25, 2022
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR) This is the official implementation of our paper Personalized Tran

Yongchun Zhu 81 Dec 29, 2022
CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation

CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation We propose a novel approach to translate unpaired contrast computed

Nicolae Catalin Ristea 13 Jan 02, 2023
AdvStyle - Official PyTorch Implementation

AdvStyle - Official PyTorch Implementation Paper | Supp Discovering Interpretable Latent Space Directions of GANs Beyond Binary Attributes. Huiting Ya

Beryl 37 Oct 21, 2022
Management Dashboard for Torchserve

Torchserve Dashboard Torchserve Dashboard using Streamlit Related blog post Usage Additional Requirement: torchserve (recommended:v0.5.2) Simply run:

Ceyda Cinarel 103 Dec 10, 2022
Chinese license plate recognition

AgentCLPR 简介 一个基于 ONNXRuntime、AgentOCR 和 License-Plate-Detector 项目开发的中国车牌检测识别系统。 车牌识别效果 支持多种车牌的检测和识别(其中单层车牌识别效果较好): 单层车牌: [[[[373, 282], [69, 284],

AgentMaker 26 Dec 25, 2022
Pywonderland - A tour in the wonderland of math with python.

A Tour in the Wonderland of Math with Python A collection of python scripts for drawing beautiful figures and animating interesting algorithms in math

Zhao Liang 4.1k Jan 03, 2023
[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger 🍔 Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

Gsunshine 271 Dec 29, 2022
pytorch implementation for PointNet

PointNet.pytorch This repo is implementation for PointNet in pytorch. The model is in pointnet/model.py. It is teste

Fei Xia 1.7k Dec 30, 2022
Brain tumor detection using CNN (InceptionResNetV2 Model)

Brain-Tumor-Detection Building a detection model using a convolutional neural network in Tensorflow & Keras. Used brain MRI images. InceptionResNetV2

1 Feb 13, 2022
I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform some analysis,,

Virtual-Artificial-Intelligence-genesis- I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform

AKASH M 1 Nov 05, 2021
Code for our WACV 2022 paper "Hyper-Convolution Networks for Biomedical Image Segmentation"

Hyper-Convolution Networks for Biomedical Image Segmentation Code for our WACV 2022 paper "Hyper-Convolution Networks for Biomedical Image Segmentatio

Tianyu Ma 17 Nov 02, 2022
Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

TopClus The source code used for Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations, published in WWW 2022. Requ

Yu Meng 63 Dec 18, 2022
Tracking Pipeline helps you to solve the tracking problem more easily

Tracking_Pipeline Tracking_Pipeline helps you to solve the tracking problem more easily I integrate detection algorithms like: Yolov5, Yolov4, YoloX,

VNOpenAI 32 Dec 21, 2022
Unofficial implementation of One-Shot Free-View Neural Talking Head Synthesis

face-vid2vid Usage Dataset Preparation cd datasets wget https://yt-dl.org/downloads/latest/youtube-dl -O youtube-dl chmod a+rx youtube-dl python load_

worstcoder 68 Dec 30, 2022
Open standard for machine learning interoperability

Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides

Open Neural Network Exchange 13.9k Dec 30, 2022
You Only Look Once for Panopitic Driving Perception

You Only 👀 Once for Panoptic 🚗 Perception You Only Look at Once for Panoptic driving Perception by Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wan

Hust Visual Learning Team 1.4k Jan 04, 2023