Distributed Translation System for Dark Thoughts Dataset

Your Name or Team

Overview

This project implements a distributed translation system using RunPod and Ollama to translate the DataTonic/dark_thoughts_case_study_merged dataset across multiple languages. The system parses thinking content from responses and translates both components separately.

Architecture

The system consists of several components:

  1. RunPod API Client (runpodapi.py): Handles communication with the RunPod API for creating, managing, and monitoring pods.
  2. RunPod Command Executor (runcommandsrunpod.py): Executes commands on RunPod instances and checks their readiness.
  3. RunPod Launcher (runpodlauncher.py): Manages the launching and coordination of multiple RunPod instances.
  4. RunPod Manager (runpodmanager.py): High-level manager for RunPod instances used for distributed translation.
  5. Ollama Client (ollamaclient.py): Async client for interacting with Ollama API and distributing translation tasks.
  6. Translation Coordinator (translationcoordinator.py): Orchestrates the translation process across dataset splits and languages.
  7. Data Processor (dataprocessor.py): Handles loading, processing, and saving the translated dataset.
  8. Main Script (translate.py): Entry point for running the distributed translation process.
  9. Test Scripts (test_translation.py, test_parsing.py): Tests the functionality of the distributed translation system.

Requirements

  • Python 3.8+
  • RunPod API key
  • Access to RunPod GPU instances
  • The following Python packages: aiohttp, asyncio, datasets, pandas, tqdm, requests, pydantic

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/distributed-translation.git
    cd distributed-translation
  2. Install the required packages:
    pip install -r requirements.txt
  3. Set up your RunPod API key:
    export RUNPOD_API_KEY=your_runpod_api_key

Dataset Structure

The system works with the DataTonic/dark_thoughts_case_study_merged dataset, which contains:

  • English split: 20,711 examples
  • Chinese split: 20,204 examples

The system parses thinking content (text before </think>) from responses and translates both components separately.

The final dataset structure follows this model:

class Feature(BaseModel):
    id: int
    thinking: str
    response: str
    thinking_translated: str
    response_translated: str
    query: str
    source_data: str
    category: str
    endpoint: str
    source: str

Usage

Running the Translation Process

To run the full translation process:

python translate.py --pod-count 40 --batch-size 16 --max-tokens 100

Additional options:

--api-key TEXT            RunPod API key (defaults to RUNPOD_API_KEY environment variable)
--pod-count INTEGER       Number of RunPod instances to launch (default: 40)
--dataset TEXT            Dataset name or path (default: DataTonic/dark_thoughts_case_study_merged)
--output-dir TEXT         Output directory for translated data (default: translated_dataset)
--batch-size INTEGER      Batch size for translation (default: 16)
--max-tokens INTEGER      Maximum number of tokens to generate (default: 100)
--gpu-type TEXT           GPU type ID for RunPod instances (default: NVIDIA RTX A5000)
--image TEXT              Docker image name (default: tonic01/ollama-gemmax2)
--model TEXT              Model name for translation (default: gemmax2)
--cleanup                 Terminate all pods after completion
--prepare-only            Only prepare the dataset without translating
--process-only            Only process the translated dataset
--validate                Validate dataset structure after processing

Testing the System

To test the system components:

python test_translation.py --test all

To test the parsing functionality:

python test_parsing.py --test all

Translation Process

The translation process follows these steps:

  1. Preparation: Parse the dataset to separate thinking content from responses.
  2. Setup: Launch 40 RunPod instances with the tonic01/ollama-gemmax2 Docker image.
  3. Readiness Check: Wait for all pods to be ready and for Ollama to be initialized with the required model.
  4. Translation:
    • For each dataset split (English and Chinese):
    • Translate thinking and response fields separately to all target languages.
    • Skip empty thinking content to optimize translation.
    • Save intermediate results periodically.
  5. Processing: Merge translations and create a Hugging Face dataset structure.
  6. Validation: Ensure the dataset structure matches the required Feature model.
  7. Cleanup: Terminate all pods if requested.

Supported Languages

The system supports translation between the following languages:

Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.

Error Handling and Recovery

The system includes several error handling and recovery mechanisms:

  • Retry Logic: Failed translations are automatically retried.
  • Checkpointing: Intermediate results are saved periodically to allow resuming from failures.
  • Health Checks: Pod and Ollama health are checked before starting translation.
  • Empty Content Handling: Empty thinking content is handled efficiently to avoid unnecessary translations.
  • Graceful Termination: Resources are properly cleaned up on completion or failure.

Docker Image Requirements

The tonic01/ollama-gemmax2 Docker image should have:

  1. Ollama installed and configured to run on port 11434
  2. The GemmaX2-28-2B-v0.1 model pre-loaded or configured to load automatically
  3. Sufficient GPU memory (at least 24GB recommended)

Example Workflow

  1. Prepare Dataset:
    python translate.py --prepare-only
  2. Run Translation:
    python translate.py --pod-count 40
  3. Process Results Only:
    python translate.py --process-only --validate
  4. Cleanup:
    python test_translation.py --test termination

Troubleshooting

  • API Key Issues: Ensure your RunPod API key is correctly set in the environment variable or passed as a parameter.
  • GPU Availability: Check RunPod for GPU availability if pod creation fails.
  • Model Loading: If Ollama readiness check times out, the model may be too large for the selected GPU type.
  • Translation Errors: Check the logs for specific error messages. Most translation errors are automatically retried.
  • Dataset Structure: Run with the --validate flag to ensure the dataset structure matches the required Feature model.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.