AI Media Skills

CLI-JAW bundles 13 skills for generating, editing, and processing images, video, audio, and design assets. These skills wrap external AI services -- DALL-E, Sora, fal.ai, Hugging Face, and more -- behind natural-language commands so you can produce media without leaving the terminal.

Skill Catalog

Skill	Category	Description
`imagegen`	Image Generation	Generate images via DALL-E 3 / gpt-image-1. Supports prompt, size, quality, and style parameters.
`nano-banana-pro`	Image Generation	Fast image generation through the Nano Banana Pro pipeline on fal.ai. Optimized for speed over quality.
`fal-image-edit`	Image Editing	Edit existing images using fal.ai models -- inpainting, outpainting, style transfer, and background removal.
`sora`	Video Generation	Generate and edit video clips using OpenAI Sora. Supports text-to-video and image-to-video workflows.
`speech`	Audio Generation	Text-to-speech synthesis via OpenAI TTS. Supports multiple voices, speeds, and output formats.
`transcribe`	Audio Processing	Audio and video transcription via Whisper. Produces timestamped subtitles in SRT/VTT/JSON formats.
`hugging-face-cli`	ML Pipeline	Run Hugging Face model inference from the CLI. Supports text, image, and audio tasks.
`hugging-face-evaluation`	ML Pipeline	Evaluate Hugging Face models with standard benchmarks and metrics.
`hugging-face-model-trainer`	ML Pipeline	Fine-tune Hugging Face models on custom datasets with LoRA/QLoRA support.
`algorithmic-art`	Generative Art	Create algorithmic and generative art using code-driven patterns, fractals, and mathematical visualizations.
`canvas-design`	Design	Design canvas-based graphics -- layouts, banners, social media posts, and composited visuals.
`atlas`	Design	Generate and manipulate texture atlases and sprite sheets for game and UI assets.
`theme-factory`	Design	Generate color themes, palettes, and design tokens for apps and websites from a seed color or image.

Image Generation

The imagegen skill is the primary entry point for creating images. It delegates to DALL-E 3 or gpt-image-1 depending on the model configuration.

Natural language examples
"이미지 생성해줘 -- 석양이 지는 서울 남산타워"
"Generate a watercolor painting of a mountain lake at dawn"
"로고 만들어줘 -- 미니멀한 고양이 실루엣, 파란색 배경"

# Basic generation
/imagegen a cyberpunk cityscape at night, neon lights reflecting on wet streets

# With parameters
/imagegen --size 1792x1024 --quality hd a photo-realistic coral reef

# Using nano-banana-pro for fast drafts
/nano-banana-pro quick sketch of a robot barista

imagegen Parameters

Parameter	Default	Description
`--size`	1024x1024	Output size: `1024x1024`, `1792x1024`, `1024x1792`
`--quality`	standard	`standard` or `hd`
`--style`	vivid	`vivid` or `natural`
`--model`	dall-e-3	`dall-e-3` or `gpt-image-1`
`--output`	./output	Output directory for the generated file

Image Editing

The fal-image-edit skill handles post-generation edits: inpainting regions, extending canvases, transferring styles, and removing backgrounds.

Natural language examples
"이 이미지에서 배경 지워줘"
"사진의 하늘을 노을로 바꿔줘"
"Extend this image to the right with more forest"

# Remove background
/fal-image-edit --task remove-bg input.png

# Inpaint a region (mask auto-detected from prompt)
/fal-image-edit --task inpaint --prompt "replace the car with a bicycle" photo.jpg

# Style transfer
/fal-image-edit --task style-transfer --style "oil painting" photo.jpg

Video Generation

The sora skill generates short video clips from text or image prompts using OpenAI Sora.

Natural language examples
"영상 만들어줘 -- 바닷가에서 뛰어노는 강아지"
"Create a 5-second video of clouds forming over a mountain"
"이 사진을 영상으로 변환해줘"

# Text-to-video
/sora a timelapse of flowers blooming in a meadow --duration 5s

# Image-to-video (animate a still image)
/sora --input cover.png --prompt "gentle camera zoom out" --duration 3s

sora Parameters

Parameter	Default	Description
`--duration`	5s	Clip duration: `3s`, `5s`, `10s`
`--resolution`	720p	`480p`, `720p`, `1080p`
`--input`	-	Source image for image-to-video
`--output`	./output	Output directory

Audio: Speech and Transcription

Two complementary skills handle the audio pipeline: speech converts text to spoken audio, and transcribe converts audio/video to text with timestamps.

Natural language examples
"이 텍스트 읽어줘 -- 오늘의 뉴스 요약입니다"
"이 영상 자막 만들어줘"
"Convert this meeting recording to subtitles"
"음성 파일로 변환해줘 -- alloy 목소리로"

# Text-to-speech
/speech "Welcome to CLI-JAW. Your daily briefing is ready." --voice alloy

# Speech with custom speed and format
/speech --voice nova --speed 1.2 --format mp3 "오늘의 할 일을 알려드리겠습니다."

# Transcribe audio
/transcribe meeting-recording.m4a --format srt

# Transcribe video with language hint
/transcribe presentation.mp4 --language ko --format vtt

speech Parameters

Parameter	Default	Description
`--voice`	alloy	`alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
`--speed`	1.0	Playback speed: 0.25 to 4.0
`--format`	mp3	`mp3`, `opus`, `aac`, `flac`, `wav`

transcribe Parameters

Parameter	Default	Description
`--format`	srt	`srt`, `vtt`, `json`, `text`
`--language`	auto	ISO 639-1 language hint (e.g. `ko`, `en`, `ja`)
`--model`	whisper-1	Whisper model variant

Hugging Face Pipeline

Three skills wrap the Hugging Face ecosystem for inference, evaluation, and training directly from the CLI.

Natural language examples
"이 이미지 분류해줘 -- Hugging Face 모델로"
"모델 파인튜닝 해줘 -- LoRA로 학습"
"Evaluate this model on the GLUE benchmark"

# Run inference with a specific model
/hugging-face-cli --model stabilityai/stable-diffusion-xl-base-1.0 \
  --task text-to-image "a serene japanese garden"

# Evaluate a model
/hugging-face-evaluation --model bert-base-uncased \
  --benchmark glue --split validation

# Fine-tune with LoRA
/hugging-face-model-trainer --base meta-llama/Llama-3-8B \
  --dataset ./training-data.jsonl \
  --method lora --epochs 3 --lr 2e-4

Supported Task Types

Skill	Tasks
`hugging-face-cli`	text-generation, text-to-image, image-classification, summarization, translation, fill-mask, question-answering
`hugging-face-evaluation`	GLUE, SuperGLUE, SQuAD, custom metric evaluation
`hugging-face-model-trainer`	LoRA, QLoRA, full fine-tuning, DPO, RLHF

Generative Art and Design

Four skills cover design workflows -- from algorithmic patterns to full design-token systems.

algorithmic-art

Generates code-driven visual art: fractals, Voronoi diagrams, L-systems, flow fields, and mathematical surfaces.

# Generate a fractal
/algorithmic-art --type mandelbrot --palette ocean --size 2048x2048

# Flow field visualization
/algorithmic-art --type flowfield --seed 42 --particles 5000

canvas-design

Composites text, shapes, and images onto a canvas. Useful for social media graphics, banners, and thumbnails.

Natural language examples
"배너 만들어줘 -- 1200x630, 제목은 '신제품 출시'"
"Create an Instagram story template with gradient background"

# Create a social media banner
/canvas-design --size 1200x630 \
  --background "linear-gradient(135deg, #667eea, #764ba2)" \
  --text "Product Launch" --font-size 64

atlas

Packs multiple images into optimized sprite sheets and texture atlases with accompanying JSON metadata.

# Pack icons into a sprite sheet
/atlas --input ./icons/ --output spritesheet.png --padding 2

# Generate with metadata
/atlas --input ./frames/ --output atlas.png --meta atlas.json

theme-factory

Generates complete color systems from a seed color, image, or concept. Outputs CSS custom properties, Tailwind configs, and design tokens.

Natural language examples
"테마 만들어줘 -- 따뜻한 가을 느낌, 다크모드 포함"
"Generate a color palette from this brand logo"

# From a seed color
/theme-factory --seed "#4F46E5" --mode both --format css

# From an image
/theme-factory --from-image hero.jpg --format tailwind

# From a concept
/theme-factory --concept "warm autumn forest" --format tokens

Output Handling

All media skills follow a consistent output pattern:

File output -- Generated files are saved to the --output directory (default: ./output)
Inline preview -- When running in the Electron desktop app or Web UI, images are displayed inline
Clipboard -- Pass --copy to copy the output file path to the system clipboard
Pipe-friendly -- All skills print the output file path to stdout for chaining

# Chain generation into editing
/imagegen "a forest cabin" | xargs -I {} /fal-image-edit --task style-transfer --style "watercolor" {}

# Generate and open immediately
/imagegen "sunset over the ocean" && open ./output/latest.png

Configuration

API keys and defaults are configured in ~/.cli-jaw/config.yaml or via environment variables:

# config.yaml
skills:
  imagegen:
    default_model: gpt-image-1
    default_quality: hd
    output_dir: ~/Pictures/cli-jaw
  sora:
    default_duration: 5s
    default_resolution: 1080p
  speech:
    default_voice: nova
  transcribe:
    default_format: srt
    default_language: ko

# Environment variables
export OPENAI_API_KEY="sk-..."       # imagegen, sora, speech, transcribe
export FAL_KEY="fal-..."             # nano-banana-pro, fal-image-edit
export HF_TOKEN="hf_..."            # hugging-face-* skills

Automation Communication