AI

Dex Scanner

The Dex scanner is ShipThatApp's flagship AI feature: point the camera at anything — a plant, a fish, a rock, even a Pokémon card — and get back a rich, structured "Dex entry" identifying it. Every discovery is collected into a persistent, browseable Pokédex.

Under the hood it is a hybrid, two-pass identifier. Apple's on-device Vision framework gives an instant, offline, private coarse guess; the secure cloud Vision endpoint then returns a fine-grained, structured result. The two passes complement each other: the local guess seeds the UI immediately and acts as a graceful offline fallback, while the cloud pass produces the detailed entry the user saves.

Heads Up!

The Dex scanner reuses the same secure vision endpoint as the rest of the app. Every cloud call is an HMAC-signed request through your backend proxy, so your OpenAI keys never ship inside the app binary. See In-App Purchases and the API client for the shared request machinery.

Architecture at a Glance

The feature lives in three places:

  • ViewsViews/AI/Pokedex/ (the grid, the scan flow, the entry views, and the DexCategory / ScanResult / DexEntry models that back them).
  • ServicesServices/Vision/ImageClassifier.swift (on-device pass), Services/OpenAI/Vision/DexVisionService.swift (cloud pass), and Services/Pokedex/DexStore.swift (persistence).
  • Backend — the existing signed Endpoints.vision endpoint, reused unchanged.

ScannerViewModel is the orchestrator. It runs the on-device pass, then the cloud pass, then builds (but does not insert) a DexEntry. Persistence is left to the view, which keeps the view model fully testable.

The Two-Pass Hybrid Flow

When the user captures or picks an image, ScannerViewModel.analyze(_:) drives the whole sequence. It moves through a small state machine — Phase.capturing → analyzing → result (or failed) — that the ScannerView renders.

Pass 0: Encode once, off the main actor

Before any classification happens, the captured UIImage is downscaled and JPEG-encoded once, off the main actor, and the resulting bytes are reused for both passes and for persistence. This avoids three separate main-thread encodes.

// Resize + JPEG-encode ONCE, off the main actor, then reuse the bytes for
// both passes and for persistence — avoids three main-thread encodes.
let maxDimension = Config.imageMaxDimension
let jpeg = await Task.detached(priority: .userInitiated) {
  ScannerViewModel.downscaledJPEG(from: image, maxDimension: maxDimension, quality: 0.8)
}.value

Downscaling uses ImageIO thumbnailing (CGImageSourceCreateThumbnailAtIndex), capped at Config.imageMaxDimension. The resulting Data is held on imageData for the rest of the scan.

Pass 1: Instant on-device classification (offline, private, free)

The first pass is Apple's Vision framework via VNClassifyImageRequest, wrapped in ImageClassifier. It runs entirely on-device — no network, no cost, instant — and produces a coarse category and a confidence:

// 1. Instant, offline on-device pass.
let guess = await classifier.classify(imageData: jpeg)
withAnimation { localGuess = guess }

ImageClassifier takes pre-encoded JPEG Data (not a UIImage) precisely so it can reuse the bytes encoded in Pass 0. It runs off the main thread and never throws — a failed classification simply yields nil:

nonisolated private static func classify(cgImage: CGImage) -> LocalGuess? {
  let request = VNClassifyImageRequest()
  let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
  do {
    try handler.perform([request])
  } catch {
    // Best-effort, offline hint — a failure just means no local guess.
    return nil
  }

  guard let observations = request.results, !observations.isEmpty else { return nil }
  let ranked = observations.sorted { $0.confidence > $1.confidence }
  guard let best = ranked.first else { return nil }

  let topLabels = ranked.prefix(8).map { $0.identifier }
  let category = DexCategory.matching(fromLabels: topLabels)
  let label = best.identifier.replacingOccurrences(of: "_", with: " ")
  return LocalGuess(category: category, label: label, confidence: Double(best.confidence))
}

The result is a LocalGuess:

nonisolated struct LocalGuess: Equatable, Sendable {
  let category: DexCategory
  /// The top raw label, lightly cleaned for display (e.g. "tree frog").
  let label: String
  /// Confidence in `[0, 1]`.
  let confidence: Double
}

As soon as the local guess lands, the analyzing screen upgrades its headline from a generic "Analyzing…" to "Looks like a Plant…" — so the user sees a meaningful response before the cloud has even replied.

Pass 2: Rich cloud identification

The second pass calls DexVisionService.identify(imageData:categoryHint:). The on-device guess is passed through as a weak categoryHint, nudging the model without overriding it:

// 2. Rich cloud pass.
do {
  let scan = try await visionService.identify(imageData: jpeg, categoryHint: guess?.category.title)
  result = scan
  phase = .result
} catch {
  // graceful fallback — see below
}

DexVisionService reuses the existing VisionRequest / VisionResponse types and the signed Endpoints.vision endpoint — no backend change. It sends the base64 image plus a prompt that instructs the model to return strict JSON, then parses a ScanResult out of the response:

struct DexVisionService: DexVisionServicing {
  func identify(imageData: Data, categoryHint: String?) async throws -> ScanResult {
    let request = VisionRequest(
      image: imageData.base64EncodedString(),
      prompt: Self.prompt(categoryHint: categoryHint)
    )
    let body = try JSONEncoder().encode(request)

    let result = await ApiClient.shared.sendRequest(
      endpoint: Endpoints.vision,
      body: body,
      responseModel: VisionResponse.self
    )

    switch result {
    case .success(let response):
      return try ScanResult(jsonString: response.content)
    case .failure(let error):
      Logger.dex.error("Cloud identification request failed: \(error.localizedDescription)")
      throw error
    }
  }
}

Because the request goes through ApiClient.shared.sendRequest, it is HMAC-signed with the per-session token before it leaves the device. The model never sees your keys, and neither does the app.

The Identification Prompt

The model is told it is "the identification engine for a Pokédex-style app" and asked to identify the single main subject as specifically as possible — including special handling for Pokémon (cards, figures, plush, or artwork), where it uses the Pokémon's type(s) as the "scientific name". The on-device categoryHint is injected only as a weak hint.

Crucially, the model is asked to respond with only a JSON object matching the ScanResult shape:

Respond with ONLY a JSON object — no markdown, no commentary — exactly matching:
{
  "commonName": "string, the everyday name, e.g. Monstera Deliciosa / Clownfish / Pikachu",
  "category": "one of: plant, animal, fish, bird, insect, reptile, mineral, fungus, food, pokemon, object, unknown",
  "scientificName": "latin/binomial name, or Pokémon type(s), or empty string if unknown",
  "summary": "one or two engaging sentences describing it",
  "facts": ["3 to 5 short, fun, Pokédex-style facts"],
  "rarity": "one of: Common, Uncommon, Rare, Legendary",
  "confidence": 0.0
}

To customize the entries the Dex produces — different tone, extra fields, stricter rarity rules — edit DexVisionService.prompt(categoryHint:). It is the single source of truth for what the cloud model is asked to return.

Parsing the Result Tolerantly

Models don't always cooperate: they wrap JSON in markdown fences, add prose, or omit fields. ScanResult is built to survive all of that.

nonisolated struct ScanResult: Codable, Equatable {
  var commonName: String
  var category: String
  var scientificName: String
  var summary: String
  var facts: [String]
  var rarity: String
  var confidence: Double

  /// The matched, stable category for this result.
  var dexCategory: DexCategory { DexCategory.matching(category) }
}

Three layers of tolerance keep a single bad response from blowing up the scan:

  • Fence + brace extraction. ScanResult.extractJSONObject(from:) strips markdown code fences and slices out the first balanced { … } object, tracking string literals so a brace inside a value (or stray trailing prose) doesn't throw off the extraction.
  • Per-field tolerant decoding. The custom init(from:) decodes every field with try? and a sensible default, so a missing field never fails the whole scan.
  • Refusal rejection. init(jsonString:) rejects all-default or refusal payloads — valid JSON that contains no real identification — so the flow surfaces a failure (or offline fallback) instead of letting the user save a bogus "Unknown" entry.
// Reject all-default / refusal payloads (valid JSON, but no real ID) so the
// scan flow surfaces a failure or offline fallback instead of a bogus
// "Unknown" success that the user could save.
let name = decoded.commonName.trimmingCharacters(in: .whitespacesAndNewlines)
let isEmptyOrRefusal = name.isEmpty
  || (name.caseInsensitiveCompare("Unknown") == .orderedSame && decoded.summary.isEmpty)
if isEmptyOrRefusal {
  throw ParseError.invalidJSON
}

Graceful Offline Fallback

If the cloud pass throws — no network, a parse failure, a refusal — the scan does not simply fail when an on-device guess exists. Instead it degrades to an offline-only entry built purely from the local guess, and tells the user via a Toast:

} catch {
  Logger.dex.error("Identification failed: \(error.localizedDescription)")
  if let guess {
    // Degrade gracefully: an offline-only entry beats nothing.
    result = ScanResult.offlineFallback(for: guess)
    phase = .result
    Toast.shared.present(
      title: "Identified offline — reconnect for full details",
      symbol: "wifi.slash",
      tint: .orange,
      timing: .short
    )
  } else {
    errorMessage = error.localizedDescription
    phase = .failed
    // ...
  }
}

The fallback entry carries the on-device category and confidence plus a clear note that the rich details require reconnecting:

static func offlineFallback(for guess: LocalGuess) -> ScanResult {
  ScanResult(
    commonName: guess.label.capitalized,
    category: guess.category.rawValue,
    scientificName: "",
    summary: "Identified on-device while offline. Reconnect and scan again for the full Dex entry.",
    facts: [],
    rarity: "Common",
    confidence: guess.confidence
  )
}

Only when there is no local guess and the cloud fails does the flow move to Phase.failed. Errors are always logged with Logger.dex and surfaced via Toast — never silent.

Categories

DexCategory is the stable vocabulary the whole feature speaks. It's a nonisolated, Codable, Sendable enum (so it can cross actor boundaries from the off-main classifier and @Model inits), and each case carries its display title, SF Symbol, and tint color used for badges throughout the UI.

nonisolated enum DexCategory: String, CaseIterable, Codable, Sendable {
  case plant, animal, fish, bird, insect, reptile
  case mineral, fungus, food, pokemon, object, unknown
}

Both the cloud category string and the on-device Vision labels are free-form, so DexCategory maps them onto the stable set. Exact rawValue matches win; otherwise ordered keyword heuristics decide, scanning the highest-confidence labels first:

/// Maps a ranked list of labels (e.g. on-device Vision identifiers) to the
/// best-fitting category. Exact `rawValue` matches win; otherwise keyword
/// heuristics decide, scanning highest-confidence labels first.
static func matching(fromLabels labels: [String]) -> DexCategory {
  let normalized = labels.map { $0.lowercased() }

  for label in normalized {
    if let exact = DexCategory(rawValue: label) { return exact }
  }
  for label in normalized {
    if let matched = keywordMatch(label) { return matched }
  }
  return .unknown
}

The keyword matcher is deliberately careful: short needles (like "ant") match whole words only — tokenizing the label first — so "plant" and "elephant" don't resolve to .insect, while 4+ character needles still match as substrings so "goldfish" and "houseplant" resolve via "fish" and "plant".

Persistence: A Dedicated SwiftData Store

Every saved discovery is a SwiftData @Model:

@Model
final class DexEntry {
  @Attribute(.unique) var id: String
  var commonName: String
  var categoryRaw: String
  var scientificName: String
  var summary: String
  var facts: [String]
  var rarity: String
  var confidence: Double
  /// What the on-device classifier guessed, kept for transparency in the UI.
  var localGuessLabel: String
  @Attribute(.externalStorage) var imageData: Data
  var dateDiscovered: Date

  /// The stable category, derived from the stored raw value.
  var category: DexCategory { DexCategory(rawValue: categoryRaw) ?? .unknown }
}

A few details worth noting: the captured photo is stored with @Attribute(.externalStorage) so large image blobs live outside the main store; localGuessLabel is persisted for transparency (the detail page can show what the on-device pass thought); and a convenience initializer builds an entry straight from a ScanResult.

Its own isolated store

The Dex does not share the app's default SwiftData store. DexStore vends a ModelContainer backed by its own file so the Dex schema can never collide with other models (such as ChatMessage from the chat feature):

@MainActor
final class DexStore {
  static let shared = DexStore()

  let container: ModelContainer

  private init() {
    do {
      let supportURL = URL.applicationSupportDirectory
      try FileManager.default.createDirectory(at: supportURL, withIntermediateDirectories: true)
      let storeURL = supportURL.appending(path: "Dex.store")
      let configuration = ModelConfiguration(url: storeURL)
      container = try ModelContainer(for: DexEntry.self, configurations: configuration)
    } catch {
      Logger.dex.error("Failed to initialize Dex ModelContainer: \(error.localizedDescription)")
      fatalError("Failed to initialize Dex ModelContainer: \(error.localizedDescription)")
    }
  }
}

This container is attached once at the Dex root, so the grid's @Query and the scanner's inserts share a single context:

struct PokedexHomeView: View {
  var body: some View {
    // Attach the dedicated, isolated Dex store so the grid's @Query and the
    // scanner's inserts share one context.
    PokedexGridScreen()
      .modelContainer(DexStore.shared.container)
  }
}

The Grid and the Entry Views

PokedexGridScreen is the home of the Dex. It uses @Query to fetch entries newest-first and renders them in an adaptive LazyVGrid, with a live discovery count and a persistent Scan button pinned to the bottom safe area:

@Query(sort: \DexEntry.dateDiscovered, order: .reverse) private var entries: [DexEntry]

When the collection is empty, an EmptyDexView invites the user to "Scan a plant, a fish, a rock — or even a Pokémon — to start your collection." Tapping Scan presents ScannerView as a fullScreenCover. Each grid cell shows a downsampled thumbnail with a category badge; tapping one navigates to DexEntryDetailView.

The entry presentation is shared. DexEntryContentView renders the body of an entry — hero image, category badge, rarity pill, confidence meter, summary, and a "Field Notes" list — and the hero image is injected so the same view powers both:

  • ScanResultView — the live reveal card after a scan (hero is the captured UIImage).
  • DexEntryDetailView — a stored entry (hero is a downsampled DownsampledImageView decoded off the main thread from the persisted Data).

Images are always downsampled off-main for the grid and detail views via DownsampledImageView, which keys its load task on the entry id (cheaper than hashing the full Data, and correct when SwiftUI recycles a cell for a different entry).

Saving a Discovery

The view model deliberately builds an entry without inserting it — persistence belongs to the view:

/// Builds — but does not insert — a `DexEntry` for the current result.
func makeEntry() -> DexEntry? {
  guard let result, let imageData else { return nil }
  return DexEntry(scan: result, localGuess: localGuess, imageData: imageData, discoveredAt: Date())
}

ScannerView.save() inserts and saves into the Dex context, then confirms with a Toast and dismisses — or logs and surfaces an error if the save fails.

Accessibility & Motion

The scan flow is built to be inclusive:

  • The scanning sweep animation (ScanningOverlay) is only shown when Reduce Motion is off.
  • Because the analysis is silent and asynchronous, the view posts a VoiceOver announcement when a result or failure appears (AccessibilityNotification.Announcement).
  • The analyzing headline carries the .updatesFrequently trait, and grid cells, badges, the confidence meter, and the scan buttons all have explicit accessibility labels and hints.
  • All text respects Dynamic Type.

Extending the Dex

The scanner is designed to be customized. The most common changes:

  • Add or change a category. Add a case to DexCategory, give it a title, symbol, and tint, and (optionally) add keyword groups in keywordMatch(_:) so on-device and cloud labels resolve to it. The badge styling, grid, and filters pick it up automatically.
  • Change what the model returns. Edit the prompt in DexVisionService.prompt(categoryHint:). If you add fields, mirror them on ScanResult (with tolerant decoding) and on DexEntry for persistence.
  • Swap the classification or cloud engine. Both passes sit behind protocols — ImageClassifying and DexVisionServicing — and are injected into ScannerViewModel. Provide your own conforming type (for example, a custom Core ML model on-device) without touching the view model or the UI.
init(
  classifier: ImageClassifying = ImageClassifier(),
  visionService: DexVisionServicing = DexVisionService()
) {
  self.classifier = classifier
  self.visionService = visionService
}

Don't share the default store

The Dex keeps its own Dex.store on purpose. If you point DexStore at the app's default SwiftData container — or attach a second container for DexEntry elsewhere — you risk schema collisions and duplicated contexts. Keep all Dex reads and writes flowing through DexStore.shared.container.

Previous
AI Vision