Descrição da vaga
Company Description
FirstIgnite develops AI-powered tools designed to support universities in key areas such as Tech Transfer, Corporate Relations, Advancement, Career Development, and Research Development. These tools empower universities to expand their initiatives and streamline operations effectively. FirstIgnite is dedicated to helping academic institutions grow and thrive through innovative technological solutions, making a meaningful impact in the education sector.
Role Description
This is a full-time remote role for a Data Engineer. The Data Engineer, will lead the design, development, and maintenance of scalable data systems focused on acquiring, structuring, and integrating diverse datasets from external and internal sources. This role will be responsible for building reliable data acquisition pipelines, including crawling, scraping, API ingestion, data mining, enrichment, and ETL workflows that turn messy public or semi-public sources into accurate, usable data for FirstIgnite’s platform.
Requisitos
- Lead the architecture, development, and optimization of scalable data pipelines for patents, grants, clinical trials, publications, labs, and firmographic sources.
- Build and maintain ETL workflows on AWS using Glue (PySpark), Lambda, S3, and RDS PostgreSQL.
- Explore and implement complementary data storage solutions as needed.
- Apply machine learning and LLM-based techniques to extract, structure, and enrich data at scale.
- Ensure data integrity, security, accuracy, and governance across all systems.
- Implement tools for data exploration, visualization, and actionable insights.
- Collaborate with Product, Engineering, and Customer Success teams to drive AI-driven innovation.
- Evaluate emerging AI/ML techniques and tools to enhance platform capabilities.
- Maintain version control, CI/CD pipelines, and reproducible workflows for the data engineering team.
- Strong programming expertise in Python, with experience building production data pipelines.
- Hands-on experience with AWS data services (Glue/PySpark, Lambda, S3, RDS PostgreSQL, SSM) or equivalent cloud data stacks.
- Strong experience with external data acquisition, including building and maintaining crawlers, scrapers, API integrations, and data mining workflows that turn messy public or semi-public sources into reliable, structured datasets. This includes handling rate limits, source changes, deduplication, validation, enrichment, and pipeline monitoring.
- Familiarity with RESTful and GraphQL APIs, ETL processes, and data integration strategies.
- Understanding of Retrieval-Augmented Generation (RAG) and AI-enhanced data applications.
- Strong analytical skills, including evaluating trade-offs of data ingestion and storage methods.
- Leadership experience managing data engineering or AI teams in SaaS or tech environments.
- Solid knowledge of version control (Git) and CI/CD practices.
- Excellent problem-solving, collaboration, and communication skills in cross-functional teams.
- Ability to work independently, prioritize tasks, and drive projects from conception to deployment.
Preferred Qualifications
- Experience with web scraping at scale (Firecrawl or similar).
- LLM-based data extraction and prompt engineering for structured output.
- Data quality instrumentation and observability for pipelines.
As part of your LinkedIn application submission, please use Loom to record a video and email it to us within the next 48 hours. The recording should include both your camera (showing yourself) and your screen as you walk us through a project or piece of code you’re proud of.
Please email the video to [email protected] with the title of the position you are applying for in the subject line. We look forward to your submission!