Skip to main content
Dartantic supports multi-modal input, including text, images, PDFs and other binary attachments. You can attach local files, download files from URLs, attach raw bytes, attach links, or mix and match all of the above.

Local Files

You can attach local files to prompts you send to your agent:
// Using cross_file for cross-platform support
import 'package:cross_file/cross_file.dart';

final agent = Agent('openai');

// Text file
final bioFile = XFile.fromData(
  await File('bio.txt').readAsBytes(), 
  path: 'bio.txt',
);
final result = await agent.send(
  'Can you summarize the attached file?',
  attachments: [await DataPart.fromFile(bioFile)],
);

// Image file (the moment of truth)
final fridgeFile = XFile.fromData(
  await File('fridge.png').readAsBytes(),
  path: 'fridge.png',
);
final result = await agent.send(
  'What food do I have on hand?',
  attachments: [await DataPart.fromFile(fridgeFile)],
);
// "I see leftover pizza, expired milk, and... is that a science experiment?"

// Responses API can request richer vision detail
final responsesAgent = Agent(
  'openai-responses',
  chatModelOptions: const OpenAIResponsesChatModelOptions(
    imageDetail: ImageDetail.high,
  ),
);
await responsesAgent.send(
  'Describe the fridge image with extra detail',
  attachments: [await DataPart.fromFile(fridgeFile)],
);

Download from URL

You can download data from links:
// Download and include file from URL
final urlData = await DataPart.url(
  Uri.parse('https://example.com/document.pdf'),
);
final result = await agent.send(
  'Summarize this document',
  attachments: [urlData],
);

Raw Bytes

You can attach bytes you’ve already got in memory:
// Include raw bytes with mime type
final bytes = Uint8List.fromList([/* your data */]);
final rawData = DataPart(
  bytes: bytes,
  mimeType: 'application/pdf',
);
final result = await agent.send(
  'Process this data',
  attachments: [rawData],
);

Web URLs

You an attach links w/o downloading:
// Direct URL reference (OpenAI)
final result = await agent.send(
  'Describe this image',
  attachments: [
    LinkPart(Uri.parse('https://example.com/image.jpg')),
  ],
);

Mix and Match Attachments

You can mix and match:
// Mix text and images
final result = await agent.send(
  'Based on the bio and fridge contents, suggest a meal',
  attachments: [
    await DataPart.fromFile(bioFile),
    await DataPart.fromFile(fridgeFile),
  ],
);

Audio Transcription

Google Gemini models support audio transcription natively through the chat interface. Simply attach an audio file and request transcription in your prompt.

Text Transcription

For simple text transcription, attach an audio file and request the transcription:
import 'dart:io';
import 'package:dartantic_ai/dartantic_ai.dart';

final agent = Agent('google');
final audioBytes = await File('audio.m4a').readAsBytes();

final result = await agent.send(
  'Transcribe this audio file word for word.',
  attachments: [
    DataPart(audioBytes, mimeType: 'audio/mp4', name: 'audio.m4a'),
  ],
);

// Get transcription text
final transcription = result.messages
  .expand((m) => m.parts)
  .whereType<TextPart>()
  .map((p) => p.text)
  .join('\n');

print(transcription);

Transcription with Timestamps

For word-level timestamps and structured output, use typed responses with a JSON schema:
import 'dart:io';
import 'package:dartantic_ai/dartantic_ai.dart';

final agent = Agent('google');
final audioBytes = await File('audio.m4a').readAsBytes();

final schema = Schema.fromMap({
  'type': 'object',
  'properties': {
    'transcript': {'type': 'string'},
    'words': {
      'type': 'array',
      'items': {
        'type': 'object',
        'properties': {
          'word': {'type': 'string'},
          'start_time': {'type': 'number'},
          'end_time': {'type': 'number'},
        },
      },
    },
  },
});

final result = await agent.sendFor<Map<String, dynamic>>(
  'Transcribe this audio with word-level timestamps (in seconds).',
  outputSchema: schema,
  attachments: [
    DataPart(audioBytes, mimeType: 'audio/mp4', name: 'audio.m4a'),
  ],
);

final transcription = result.output;
print('Transcript: ${transcription['transcript']}');

// Access word-level timestamps
for (final word in transcription['words'] as List) {
  final w = word as Map<String, dynamic>;
  print('${w['start_time']}s - ${w['end_time']}s: ${w['word']}');
}

Provider Support

ProviderAudio TranscriptionTimestamps
Google✅ Native support✅ Word-level
OpenAI Responses❌ Not supported
Anthropic❌ Not supported
Note: Only Google Gemini models currently support audio transcription through the chat interface.

OCR (Optical Character Recognition)

Google Gemini models support OCR for extracting text from images. Simply attach an image containing text and request extraction in your prompt.

Text Extraction

Extract text from images while preserving formatting and structure:
import 'dart:io';
import 'package:cross_file/cross_file.dart';
import 'package:dartantic_ai/dartantic_ai.dart';

final agent = Agent('google');

const imagePath = 'document.png';
final imageFile = XFile.fromData(
  await File(imagePath).readAsBytes(),
  path: imagePath,
);

final result = await agent.send(
  'Extract all text from this image. Preserve the formatting and structure.',
  attachments: [await DataPart.fromFile(imageFile)],
  history: [ChatMessage.system('Be precise and preserve formatting.')],
);

print(result.output);
// Extracted text with formatting preserved

Use Cases

OCR is useful for:
  • Extracting text from scanned documents
  • Reading text from screenshots
  • Processing forms and receipts
  • Analyzing documents with complex layouts
  • Converting images of text to editable format

Provider Support

ProviderOCR SupportComplex Layouts
Google✅ Native support✅ Tables, multi-column
OpenAI Responses✅ Vision models✅ General layout
Anthropic✅ Vision models✅ General layout
Note: For specialized OCR tasks requiring extremely high accuracy or specific document types, consider using dedicated OCR services. Mistral also offers a specialized OCR model (mistral-ocr-3-25-12) for document processing, which will be supported once the SDK adds vision capabilities.

Examples

Next Steps