Work Experience
- Performed large-scale data categorization using Lilac clustering; Benchmarked scalability, identifying high time costs and false-positives. Developed a rate-limited Google Classification API pipeline to produce reliable data–category pairs for downstream model training.
- Built a Streamlit app using Hugging Face APIs to load data, tokenize, filter, ingest meta, generate recipes & perform regex-based search.
- Researched various open-source data tools for LLMs. Generated 100s of billions of tokens of Indic data by translating open-source English datasets using an internal CPU-based translation model, optimized with multi-threading for faster processing.
HuggingFace Data Engineering LLM Lilac Streamlit
- Integrated a FST based Inverse Text Normalization (ITN) system into ASR post-processing using NVIDIA NeMo and added 4 custom grammar rules using Pynini. Compiled into FAR, reducing processing time by 140× and memory use by 80%.
- Generated 5M+ synthetic spoken-written pairs for capitalization using OpenAI APIs and led 4-member team for ITN data annotation.
- Benchmarked F1 scores for internal punct-cap models; Refactored & retrained punct-cap transformer-encoder in PyTorch Lightning.
- Generated transcriptions using Whisper to fine-tune an internal ASR model. Built & deployed a Flask inference demo as a Linux service.
- Preprocessed ASR datasets, boosting training data by 16%. Enhanced ASR functionality with a torchaudio based audio I/O module.
- Built internal tools (pystratus, dvc-stratus) for dataset and model management, enabling secure cross-team adoption with OAuth and resource policies. Maintained ZWAF for OneAuth token validation in cross-team Zoho API communication.
- Integrated ZWAF into FastAPI-based ASR web server middleware, securing cross-team usage.
ITN Pynini PyTorch NVIDIA NeMo Flask FastAPI DVC
- Collected and preprocessed datasets for ASR model, assisting peer ML Engineers with data requirements, increased benchmark data suite by 83% (400 hours of audio). Added PyTorch Iterable Dataset classes for each dataset and unit tests using pytest.
- Processed open-source datasets using youtube-dl for YouTube data, Google ASR for synthetic transcriptions and developed a Streamlit tool for audio recording. Organized team sessions to create a limited benchmark with real-time recordings.
Python PyTorch Dataset Pytest Streamlit Google ASR
- Developed an issue tracking system on ONGC's intranet, enabling issue creation, retrieval by ID, and resolution status tracking. Users could accept or deny resolutions to flag if further assistance was needed.
- Gained hands-on experience collaborating with a team of software engineers on a real-world product.
Python Web Development Intranet
