Methodology - StegVision

TECHNICAL OVERVIEW

Multisteganalysis Using CNN And Transformer

The final implementation uses a CNN-Transformer evidence pipeline. Images use Stegformer ONNX plus spatial/frequency forensics. Audio uses multisegment AudioStegNet with SPA/RS discrimination and codec-aware calibration (v3). Video uses adaptive 32–128 frame sampling with H.264/DCT and pixel paths (v4b calibrated).

CNN-TRANSFORMER EVIDENCE PIPELINE

INPUT

Image, audio, or video upload routed by file type

PREPROCESSING

Image grayscale 256x256, audio mel/LSB/residual tensor, video 32-128 adaptive frames

MODEL

Transformer image inference, CNN audio inference, spatial LSB, JPEG-frequency, residual, and temporal support engines

OUTPUT

P(Clean), P(Stego), confidence, reliability, decision engine, media statistics, and downloadable JSON

The report is evidence-based: the final result is not just a label. It includes transformer probability, LSB evidence, JPEG-frequency evidence, residual texture support, audio forensic score, video temporal score, and a reliability label.

PIPELINE

Six-stage Forensic Flow

01 - ROUTE

Media Router

The API validates size and extension, then routes the file to the image, audio, or video analysis branch.

02 - IMAGE

Transformer Branch

The image branch uses Stegformer ONNX for transformer-based clean/stego probability estimation.

03 - SUPPORT

Forensic Evidence

Spatial LSB, JPEG-frequency, and residual texture modules add support evidence and reliability context.

04 - AUDIO

CNN Audio Branch

The audio branch uses mel, PCM-LSB, and residual tensors, then runs SPA/RS forensics, codec profiling, and calibrated CNN fusion.

05 - VIDEO

Temporal Fusion

Videos are sampled into up to 128 frames adaptively, scored per frame, and fused by mean, P90, maximum, support, and temporal artifact metrics.

06 - REPORT

Explainable Report

The website renders probability bars, evidence cards, visualisations, technical findings, and the raw JSON report.