Distributed Translation System for Dark Thoughts Dataset

Overview

This project implements a distributed translation system using RunPod and Ollama to translate the DataTonic/dark_thoughts_case_study_merged dataset across multiple languages. The system parses thinking content from responses and translates both components separately.

Architecture

The system consists of several components:

RunPod API Client (runpodapi.py): Handles communication with the RunPod API for creating, managing, and monitoring pods.
RunPod Command Executor (runcommandsrunpod.py): Executes commands on RunPod instances and checks their readiness.
RunPod Launcher (runpodlauncher.py): Manages the launching and coordination of multiple RunPod instances.
RunPod Manager (runpodmanager.py): High-level manager for RunPod instances used for distributed translation.
Ollama Client (ollamaclient.py): Async client for interacting with Ollama API and distributing translation tasks.
Translation Coordinator (translationcoordinator.py): Orchestrates the translation process across dataset splits and languages.
Data Processor (dataprocessor.py): Handles loading, processing, and saving the translated dataset.
Main Script (translate.py): Entry point for running the distributed translation process.
Test Scripts (test_translation.py, test_parsing.py): Tests the functionality of the distributed translation system.

Requirements

Python 3.8+
RunPod API key
Access to RunPod GPU instances
The following Python packages: aiohttp, asyncio, datasets, pandas, tqdm, requests, pydantic

Installation

Clone the repository:

git clone https://github.com/yourusername/distributed-translation.git
cd distributed-translation

Install the required packages:
```
pip install -r requirements.txt
```

Set up your RunPod API key:

export RUNPOD_API_KEY=your_runpod_api_key

Dataset Structure

The system works with the DataTonic/dark_thoughts_case_study_merged dataset, which contains:

English split: 20,711 examples
Chinese split: 20,204 examples

The system parses thinking content (text before </think>) from responses and translates both components separately.

The final dataset structure follows this model:

class Feature(BaseModel):
    id: int
    thinking: str
    response: str
    thinking_translated: str
    response_translated: str
    query: str
    source_data: str
    category: str
    endpoint: str
    source: str

Usage

Running the Translation Process

To run the full translation process:

python translate.py --pod-count 40 --batch-size 16 --max-tokens 100

Additional options:

--api-key TEXT            RunPod API key (defaults to RUNPOD_API_KEY environment variable)
--pod-count INTEGER       Number of RunPod instances to launch (default: 40)
--dataset TEXT            Dataset name or path (default: DataTonic/dark_thoughts_case_study_merged)
--output-dir TEXT         Output directory for translated data (default: translated_dataset)
--batch-size INTEGER      Batch size for translation (default: 16)
--max-tokens INTEGER      Maximum number of tokens to generate (default: 100)
--gpu-type TEXT           GPU type ID for RunPod instances (default: NVIDIA RTX A5000)
--image TEXT              Docker image name (default: tonic01/ollama-gemmax2)
--model TEXT              Model name for translation (default: gemmax2)
--cleanup                 Terminate all pods after completion
--prepare-only            Only prepare the dataset without translating
--process-only            Only process the translated dataset
--validate                Validate dataset structure after processing

Testing the System

To test the system components:

python test_translation.py --test all

To test the parsing functionality:

python test_parsing.py --test all

Translation Process

The translation process follows these steps:

Preparation: Parse the dataset to separate thinking content from responses.
Setup: Launch 40 RunPod instances with the tonic01/ollama-gemmax2 Docker image.
Readiness Check: Wait for all pods to be ready and for Ollama to be initialized with the required model.
Translation:
- For each dataset split (English and Chinese):
- Translate thinking and response fields separately to all target languages.
- Skip empty thinking content to optimize translation.
- Save intermediate results periodically.
Processing: Merge translations and create a Hugging Face dataset structure.
Validation: Ensure the dataset structure matches the required Feature model.
Cleanup: Terminate all pods if requested.

Supported Languages

The system supports translation between the following languages:

Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.

Error Handling and Recovery

The system includes several error handling and recovery mechanisms:

Retry Logic: Failed translations are automatically retried.
Checkpointing: Intermediate results are saved periodically to allow resuming from failures.
Health Checks: Pod and Ollama health are checked before starting translation.
Empty Content Handling: Empty thinking content is handled efficiently to avoid unnecessary translations.
Graceful Termination: Resources are properly cleaned up on completion or failure.

Docker Image Requirements

The tonic01/ollama-gemmax2 Docker image should have:

Ollama installed and configured to run on port 11434
The GemmaX2-28-2B-v0.1 model pre-loaded or configured to load automatically
Sufficient GPU memory (at least 24GB recommended)

Example Workflow

Prepare Dataset:
```
python translate.py --prepare-only
```
Run Translation:
```
python translate.py --pod-count 40
```

Process Results Only:

python translate.py --process-only --validate

Cleanup:

python test_translation.py --test termination

Troubleshooting

API Key Issues: Ensure your RunPod API key is correctly set in the environment variable or passed as a parameter.
GPU Availability: Check RunPod for GPU availability if pod creation fails.
Model Loading: If Ollama readiness check times out, the model may be too large for the selected GPU type.
Translation Errors: Check the logs for specific error messages. Most translation errors are automatically retried.
Dataset Structure: Run with the --validate flag to ensure the dataset structure matches the required Feature model.