This project implements a distributed translation system using RunPod and Ollama to translate the DataTonic/dark_thoughts_case_study_merged dataset across multiple languages. The system parses thinking content from responses and translates both components separately.
The system consists of several components:
runpodapi.py
): Handles communication with the RunPod API for creating, managing, and monitoring pods.runcommandsrunpod.py
): Executes commands on RunPod instances and checks their readiness.runpodlauncher.py
): Manages the launching and coordination of multiple RunPod instances.runpodmanager.py
): High-level manager for RunPod instances used for distributed translation.ollamaclient.py
): Async client for interacting with Ollama API and distributing translation tasks.translationcoordinator.py
): Orchestrates the translation process across dataset splits and languages.dataprocessor.py
): Handles loading, processing, and saving the translated dataset.translate.py
): Entry point for running the distributed translation process.test_translation.py
, test_parsing.py
): Tests the functionality of the distributed translation system.aiohttp
, asyncio
, datasets
, pandas
, tqdm
, requests
, pydantic
git clone https://github.com/yourusername/distributed-translation.git
cd distributed-translation
pip install -r requirements.txt
export RUNPOD_API_KEY=your_runpod_api_key
The system works with the DataTonic/dark_thoughts_case_study_merged dataset, which contains:
The system parses thinking content (text before </think>
) from responses and translates both components separately.
The final dataset structure follows this model:
class Feature(BaseModel):
id: int
thinking: str
response: str
thinking_translated: str
response_translated: str
query: str
source_data: str
category: str
endpoint: str
source: str
To run the full translation process:
python translate.py --pod-count 40 --batch-size 16 --max-tokens 100
Additional options:
--api-key TEXT RunPod API key (defaults to RUNPOD_API_KEY environment variable)
--pod-count INTEGER Number of RunPod instances to launch (default: 40)
--dataset TEXT Dataset name or path (default: DataTonic/dark_thoughts_case_study_merged)
--output-dir TEXT Output directory for translated data (default: translated_dataset)
--batch-size INTEGER Batch size for translation (default: 16)
--max-tokens INTEGER Maximum number of tokens to generate (default: 100)
--gpu-type TEXT GPU type ID for RunPod instances (default: NVIDIA RTX A5000)
--image TEXT Docker image name (default: tonic01/ollama-gemmax2)
--model TEXT Model name for translation (default: gemmax2)
--cleanup Terminate all pods after completion
--prepare-only Only prepare the dataset without translating
--process-only Only process the translated dataset
--validate Validate dataset structure after processing
To test the system components:
python test_translation.py --test all
To test the parsing functionality:
python test_parsing.py --test all
The translation process follows these steps:
tonic01/ollama-gemmax2
Docker image.The system supports translation between the following languages:
Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
The system includes several error handling and recovery mechanisms:
The tonic01/ollama-gemmax2
Docker image should have:
python translate.py --prepare-only
python translate.py --pod-count 40
python translate.py --process-only --validate
python test_translation.py --test termination
--validate
flag to ensure the dataset structure matches the required Feature model.This project is licensed under the Apache 2.0 License - see the LICENSE file for details.