Data Engineer

2 недель назад


Украина, Украина «Київстар» Полный рабочий день 90 000 $ - 120 000 $ в год

We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel our Ukrainian LLM and Kyivstar's NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling our data scientists and ML engineers to develop cutting-edge language models. You will work at the intersection of data engineering and machine learning, ensuring that our datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context. This is a unique opportunity to shape the data foundation of a pioneering AI project in Ukraine, working alongside NLP experts and leveraging modern big data technologies.

Responsibilities:

  • Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information. Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.
  • Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to our language modeling efforts.
  • Implementation of NLP/LLM specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising), detection and deletion of personal data.
  • Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
  • Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.
  • Automate data processing workflows and ensure their scalability and reliability. Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.
  • Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs. Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.
  • Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models. Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.
  • Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.
  • Manage data security, access, and compliance. Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.

Required Qualifications:

  • Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor's or Master's degree in Computer Science, Engineering, or related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.
  • NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given our project's focus. Understanding of FineWeb2 or similar processing pipelines approach.
  • Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.
  • Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.
  • Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL) including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as our NLP applications may require embedding storage and fast similarity search.
  • Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.
  • Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.
  • Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.

Preferred Qualifications:

  • Advanced Tools & Frameworks: Experience with distributed data processing frameworks (such as Apache Spark or Databricks) for large-scale data transformation, and with message streaming systems (Kafka, Pub/Sub) for real-time data pipelines. Familiarity with data serialization formats (JSON, Parquet) and handling of large text corpora.
  • Web Scraping Expertise: Deep experience in web scraping, using tools like Scrapy, Selenium, or Beautiful Soup, and handling anti-scraping challenges (rotating proxies, rate limiting). Ability to parse and clean raw text data from HTML, PDFs, or scanned documents.
  • CI/CD & DevOps: Knowledge of setting up CI/CD pipelines for data engineering (using GitHub Actions, Jenkins, or GitLab CI) to test and deploy changes to data workflows. Experience with containerization (Docker) to package data jobs and with Kubernetes for scaling them is a plus.
  • Big Data & Analytics: Experience with analytics platforms and BI tools (e.g., Tableau, Looker) used to examine the data prepared by the pipelines. Understanding of how to create and manage data warehouses or data marts for analytical consumption.
  • Problem-Solving: Demonstrated ability to work independently in solving complex data engineering problems, optimising existing pipelines, and implementing new ones under time constraints. A proactive attitude to explore new data tools or techniques that could improve our workflows.

What we offer:

  • Office or remote – it's up to you. You can work from anywhere, and we will arrange your workplace.
  • Remote onboarding.
  • Performance bonuses for everyone (annual or quarterly — depends on the role).
  • We train employees: with the opportunity to learn through the company's library, internal resources, and programs from partners.
  • Health and life insurance.
  • Wellbeing program and corporate psychologist.
  • Reimbursement of expenses for Kyivstar mobile communication.

  • Data Engineer

    2 недель назад


    Украина, Украина Bringg Полный рабочий день 90 000 $ - 120 000 $ в год

    Position: Data Engineer We are seeking a forward-thinking Data Engineer to join our team during a pivotal time of transformation. As we redesign our data pipeline solutions, you will play a key role in shaping and executing our next-generation data infrastructure. This is a hands-on, architecture-driven role ideal for someone ready to help lead this...

  • QL: Sn Data Engineer

    2 недель назад


    Украина, Украина Adaptiq Полный рабочий день 900 000 ₴ - 1 200 000 ₴ в год

    Title: Senior Data Engineer Who we are: Adaptiq is a technology hub specializing in building, scaling, and supporting R&D teams for high-end, fast-growing product companies in a wide range of industries. About the Product: Our client is a leading SaaS company offering pricing optimization solutions for e-commerce businesses. Its advanced technology...

  • Middle Data Engineer

    2 недель назад


    Украина, Украина Intellias Полный рабочий день 90 000 $ - 120 000 $ в год

    Meet your recruiterYuliia Seredniayuliia....Vacancy detailsData EngineeringData EngineerMiddleUkraineRemoteRefer a friend nowDrivers of change, it's your time to pave new ways. Intellias, a leading software provider in the automotive industry, invites you to develop the future of driving. Join the team and create products used by 2 billion people in the...

  • Finaloop: Data Engineer

    2 недель назад


    Украина, Украина Adaptiq Полный рабочий день 90 000 $ - 120 000 $ в год

    Title: Senior Data Platform Engineer (Python)Who we are: Adaptiq is a technology hub specialising in building, scaling and supporting R&D teams for high-end, fast-growing product companies in a wide range of industries. About the Product: Our client - Finaloop - reshapes bookkeeping to fit the e-commerce needs, building a fully automated, real-time...

  • Strong Middle Data Engineer

    4 дней назад


    Украина, Украина Intellias Полный рабочий день 70 000 $ - 120 000 $ в год

    Meet your recruiterYaryna Holynskayaryna....Vacancy detailsData EngineeringData EngineerStrong MiddleUkraineRemoteRefer a friend nowDrivers of change, it's your time to pave new ways. Intellias, a leading software provider in the automotive industry, invites you to develop the future of driving. Join the team and create products used by 2 billion people in...

  • Senior Backend Engineer, Data Platform

    2 недель назад


    Украина, Украина ClickUp Полный рабочий день 90 000 $ - 120 000 $ в год

    ClickUp is revolutionizing the way the world works. As the only all-in-one productivity platform built from day one for true convergence, ClickUp unifies tasks, docs, chat, calendar, enterprise search, and more—supercharged by context-driven AI. While others scramble to bundle fragmented tools or bolt on AI, we anticipated this future and made it our...

  • Senior Data Scientist

    2 недель назад


    Украина, Украина ELEKS Полный рабочий день 600 000 ₴ - 1 000 000 ₴ в год

    ELEKS Artificial Intelligence Office is looking for a Senior Data Scientist in UkraineABOUT PROJECTOur Client is Europe's leading health, sport, and leisure group. The organization operates 116 clubs, including 100 in the UK and additional locations across the Republic of Ireland, mainland Europe, and Asia. With billions in annual turnover, the group offers...

  • Senior DevOps Engineer

    2 дней назад


    Украина, Украина CommIT Полный рабочий день 90 000 ₴ - 120 000 ₴ в год

    The company empowers businesses to gain complete supply chain transparency through real-time data and actionable insights. Our innovative cloud-based platform, Sensos Sync, in tandem with our sensor-based tracking labels, provides pin-point location data and critical alerts for temperature, movement, and shock. This gives logistics teams total control and...

  • Analytics Engineer

    2 недель назад


    Украина, Украина Zoolatech Полный рабочий день 90 000 $ - 120 000 $ в год

    OVERVIEWRESPONSIBILITIESREQUIREMENTSWe are seeking an experienced Analytics Engineer with proven hands-on experience in Atlassian Services. The successful candidate will play a key role in improving, standardizing, and scaling reporting for the engineering teams to improve and stardadaze the reporting using Atlassian products (Analytics and Atlassian Data...

  • Data Scientist

    2 недель назад


    Украина, Украина Influ2 Полный рабочий день 90 000 $ - 120 000 $ в год

    Influ2 is a new kind of B2B advertising that actually works. With Influ2, GTM teams can focus on the same buyers throughout the entire buying journey—from cold prospecting to closing and upselling—driving sales revenue. Buyers who engage with Influ2's contact-level ads are twice as likely to convert. Over 100 enterprises and mid-market companies...