AI Vision - Docs

The Vision feature lets users pick or capture a photo and get a natural-language description of what's in it. The image is sent to OpenAI's vision model, but never directly from the device — every call is routed through ShipThatApp's secure backend proxy with an HMAC-signed request, so your OpenAI key never ships inside the app.

This is also the foundation the Dex Scanner is built on: the same cloud /vision endpoint powers Dex's rich structured identification, layered on top of Apple's on-device Vision framework.

Overview

The feature follows the project's standard MVVM structure:

VisionView — the SwiftUI screen that lets the user pick or capture an image, add an optional prompt, and trigger analysis.
VisionViewModel — an @Observable, @MainActor view model that drives loading state and holds the result.
VisionService — the service that builds the request and sends it through ApiClient to the secure /vision endpoint.

The end-to-end flow is: pick or capture an image → encode it → analyze via the /vision endpoint → present the result.

Picking or Capturing an Image

VisionView starts in an empty state with a "Nothing selected." prompt. Tapping the toolbar button presents a sheet offering two sources — Camera and Library — each of which presents an ImagePicker bound to vm.selectedImage:

.sheet(isPresented: $isCameraPresented) {
    ImagePicker(selectedImage: $vm.selectedImage, sourceType: .camera)
}
.sheet(isPresented: $isLibraryPresented) {
    ImagePicker(selectedImage: $vm.selectedImage, sourceType: .photoLibrary)
}

Once a UIImage is selected, the view shows a preview, an optional prompt TextField, and an Analyze Picture button.

Encoding and Analyzing

Before the image leaves the device it is downscaled with resized() and converted to a Base64 string with toBase64(). The result, along with the optional prompt, is handed to the view model:

Button(action: {
    vm.analyzeImage(image: selectedImage.resized().toBase64() ?? "", prompt: prompt)
}) {
    Text("Analyze Picture")
}
.disabled(vm.isAnalyzing)

Heads Up!

Resizing before upload keeps the request payload small and the analysis fast. The prompt is optional — pass an empty string (or omit it) to get a general description, or supply a question to focus the analysis (for example, "What ingredients are in this dish?").

VisionViewModel

VisionViewModel is a @MainActor @Observable class. Its analyzeImage(image:prompt:) method flips the loading flags, launches a Task, and calls the service. On completion it stores the VisionResponse and toggles isFinished so the view can present the result sheet:

@MainActor
@Observable
final class VisionViewModel {
    var visionResponse: VisionResponse?
    var isLoading = false
    var errorMessage: String?
    var isFinished = false
    var selectedImage: UIImage?

    @ObservationIgnored
    private let visionService = VisionService()

    func analyzeImage(image: String, prompt: String? = nil) {
        isLoading = true
        isFinished = false

        Task {
            do {
                let response = try await visionService.analyzeImage(image: image, prompt: prompt)
                self.visionResponse = response
                self.errorMessage = nil
            } catch {
                self.errorMessage = error.localizedDescription
                self.visionResponse = nil
            }
            self.isLoading = false
            self.isFinished = true
        }
    }
}

While the request is in flight, isLoading drives a ProgressView overlay on the screen.

VisionService and the Secure Proxy

VisionService packs the inputs into a VisionRequest and sends it through ApiClient.shared.sendRequest to Endpoints.vision (the vision path). The call returns a Result, which the service unwraps into a VisionResponse or rethrows the failure:

struct VisionRequest: Codable {
    let image: String
    let prompt: String?
}

struct VisionResponse: Decodable, Equatable {
    let role: String
    let content: String
}

final class VisionService {
    func analyzeImage(image: String, prompt: String? = nil) async throws -> VisionResponse {
        let requestModel = VisionRequest(image: image, prompt: prompt)
        let result = await ApiClient.shared.sendRequest(
            endpoint: Endpoints.vision,
            body: try JSONEncoder().encode(requestModel),
            responseModel: VisionResponse.self
        )

        switch result {
        case .success(let response):
            return response
        case .failure(let failure):
            throw failure
        }
    }
}

The request never talks to OpenAI directly. ApiClient signs every call with HMAC before it leaves the device: the signature covers METHOD\nPATH\nTIMESTAMP\nSHA256(body), and an x-signature plus x-timestamp header are attached. The /vision endpoint signs with the per-session token stored in the keychain (issued by the /auth_token endpoint), so a captured request cannot be replayed or tampered with — and your OpenAI API key lives only on the backend.

Keys stay on the server

Because all vision traffic flows through the proxy, no OpenAI credentials are bundled into the app. Configure the backend URL and bootstrap auth key in your gitignored Config.xcconfig — never hardcode secrets in the client.

Presenting the Result

When isFinished becomes true, VisionView presents a VisionResultView sheet that displays the analyzed image alongside the model's response text. The view reads vm.visionResponse?.content, falling back to an error string if the response is missing:

.sheet(isPresented: $vm.isFinished) {
    if let selectedImage = vm.selectedImage {
        VisionResultView(
            result: vm.visionResponse?.content ?? "Error fetching response",
            selectedImage: selectedImage
        )
    }
}

VisionResultView is a simple ScrollView showing an "Analysis Result" title, the image, and the returned content.

Where to Go Next

The Vision feature is intentionally a focused, single-call template. If you want a richer, app-style experience built on the same secure endpoint — instant on-device classification, structured identification (name, scientific name, facts, rarity), and a persistent SwiftData collection — see the Dex Scanner.