Data Engineer PySpark AWS

Kathmandu, Nepal

Full Time

2427486630

Mid Level

About Fusemachines

Fusemachines is a leading AI strategy, talent, and education services provider. Founded by Sameer Maskey Ph.D., Adjunct Associate Professor at Columbia University, Fusemachines has a core mission of democratizing AI. With a presence in 4 countries (Nepal, United States, Canada, and Dominican Republic and more than 450 full-time employees). Fusemachines seeks to bring its global expertise in AI to transform companies around the world.

About the role

This is a full-time position, responsible for designing, building, testing, optimizing and maintaining the infrastructure and code required for data integration, storage, processing, pipelines and analytics (BI, visualization and Advanced Analytics) from ingestion to consumption, implementing data flow controls, and ensuring high data quality and accessibility for analytics and business intelligence purposes. This role requires a strong foundation in programming and a keen understanding of how to integrate and manage data effectively across various storage systems and technologies.

We are looking for a skilled Data Engineer with a strong background in Python, SQL, Pyspark, and AWS cloud-based large-scale data solutions with a passion for data quality, performance, and cost optimization. The ideal candidate will develop in an Agile environment.

This role is perfect for an individual passionate about leveraging data to drive insights, improve decision-making, and support the strategic goals of the organization through innovative data engineering solutions.

Qualification & Experience

Must have a full-time Bachelor's degree in Computer Science Information Systems, Engineering, or a related field.
At least 2 years of experience as a data engineer with strong expertise in Python, SQL, PySpark and AWS in an Agile environment, with a proven track record of building and optimizing data pipelines, architectures, and datasets, and proven experience in data storage, modeling, management, lake, warehousing, processing/transformation, integration, cleansing, validation and analytics.
2+ years of experience with DevOps tools and technologies: GitHub or AWS DevOps.
Proven experience delivering large scale projects and products for Data and Analytics, as a data engineer within AWS.
Preferred previous experience working with retail or other similar data models.
Following certifications:
- AWS Certified Cloud Practitioner
- AWS Certified Data Engineer - Associate
- Nice to have:
  - Databricks Certified Associate Developer for Apache Spark
  - Databricks Certified Data Engineer Associate

Required skills/Competencies

Strong programming Skills in one or more object-oriented languages such as Python (must have), Scala, Java, and proficiency in writing high-quality, scalable, maintainable, efficient, and optimized code for data integration, storage, processing, manipulation, and analytics solutions.
Strong SQL skills and experience working with complex data sets, Enterprise Data Warehouse, and writing advanced SQL queries. Proficient with Relational Databases (RDS, MySQL, Postgres, or similar) and nonSQL databases (Cassandra, MongoDB, Neo4j, etc.).
Strong analytic skills related to working with structured and unstructured datasets.
Thorough understanding of big data principles, techniques, and best practices.
Experience with scalable and distributed Data Processing Technologies such as Spark/PySpark (must have including Spark SQL) and Kafka, to be able to handle large volumes of data.
Experience with stream-processing systems: Storm, Spark-Streaming, etc. is a plus.
Experience in implementing data pipelines and efficient ELT/ETL processes, batch and real-time, in AWS and using open source solutions, being able to develop custom integration solutions as needed, including Data Integration from different sources such as APIs (PoS integrations is a plus), ERP (Oracle and Allegra are a plus), databases, flat files, Apache Parquet, event streaming, including cleansing, transformation and validation of the data.
Experience in data cleansing, transformation, and validation.
Understanding of Data Modeling and Database Design Principles. Being able to implement efficient database schemas that meet the requirements to support data solutions. With good understanding of dimensional data modeling
Knowledge in cloud computing specifically in AWS services related to data and analytics, such as S3, EMR, Glue, SageMaker, RDS, Redshift, Lambda, Kinesis, Lake Formation, EC2, ECS/ECR, EKS, IAM, CloudWatch, etc. implementing Data Warehousing, data lake and data lake house, solutions in AWS.
Experience in Orchestration using technologies like Azkaban, Luigi, Airflow, etc..
Good understanding of BI solutions including Looker and LookML (Looker Modeling Language)
Familiar with advanced analytics, AI/ML services and tools, and the ability to integrate advanced analytics, machine learning, and AI capabilities into data solutions, nice to have.
Strong understanding of the software development lifecycle (SDLC), especially Agile methodologies.
Knowledge of SDLC tools and technologies, including project management software (Jira or similar), source code management (GitHub, AWS CodeCommit or similar), CI/CD system (GitHub actions, Jenkins, AWS CodePipeline or similar) and binary repository manager (Sonatype Nexus, AWS CodeArtifact or similar).
Knowledge and hands-on experience of DevOps principles, tools and technologies (GitHub and AWS DevOps) including continuous integration, continuous delivery (CI/CD), infrastructure as code (IaC – Terraform), configuration management, automated testing, performance tuning and cost management and optimization.
Knowledge of data structures and algorithms and good software engineering practices.
Strong analytical skills to identify and address technical issues, performance bottlenecks, and system failures.
Proficiency in debugging and troubleshooting issues in complex data and analytics environments and pipelines.
Understanding of Data Quality and Governance, including implementation of data quality and integrity checks and monitoring processes to ensure that data is accurate, complete, and consistent.
Good Problem-Solving skills: being able to troubleshoot data processing pipelines and identify performance bottlenecks and other issues.
Strong interpersonal skills and ability to work with a wide range of stakeholders.
Excellent communication skills to collaborate with cross-functional teams, including business users, data architects, DevOps/DataOps/MLOps engineers, data analyst, data scientists, developers, and operations teams. Essential to convey complex technical concepts and insights to non-technical stakeholders effectively.
Ability to document processes, procedures, and deployment configurations.
Understanding of security practices, including network security groups, encryption, and compliance standards, and ability to implement security controls and best practices within data and analytics solutions, including proficient knowledge and working experience on various cloud security vulnerabilities and ways to mitigate them.
Self-motivated with the ability to work well in a team.
Strong project management and organizational skills.
A willingness to stay updated with the latest services, Data Engineering trends, and best practices in the field.
Comfortable with picking up new technologies independently and working in a rapidly changing environment with ambiguous requirements.
Care about architecture, observability, testing, and building reliable infrastructure and data pipelines.

Responsibilities:

Design, implement, deploy, test and maintain highly scalable and efficient data architectures, defining and maintaining standards and best practices for data management independently with minimal guidance.
Ensure systems meet business requirements and industry practices for data integrity, performance, and reliability.
Integrate new data management technologies and software engineering tools into existing structures.
Create custom software components and analytics applications.
Employ a variety of languages and tools to marry systems together or try to hunt down opportunities to improve current processes.
Evaluate and advise on technical aspects of open work requests in the data pipeline with the project team.
Handle ELT/ETL processes, including data extraction, loading and transformation, from different sources ensuring consistency and quality.
Transform and clean data for further analysis and storage.
Design and optimize data models and schemas to support business requirements and analysis.
Implement monitoring tools and systems to ensure the availability and performance of data systems.
Manage data security and access, ensuring confidentiality and integrity.
Automate repetitive tasks and processes to improve operational efficiency.
Collaborate with data science teams to establish pipelines and workflows for training, validation, deployment, and monitoring of machine learning models. Automate deployment and management of machine learning models in production environments.
Contribute to data quality assurance efforts, such as implementing data validation checks and tests to ensure reliability, efficiency, accuracy, completeness and consistency of data.
Test software solutions and meet product quality standards prior to release to QA.
Ensure the reliability, scalability, and efficiency of data systems are maintained at all times. Identifying and resolving performance bottlenecks in pipelines due to data, queries and processing workflows to ensure efficient and timely data delivery.
Work with DevOps teams to optimize resources.
Assist in the configuration and management of data warehousing and data lake solutions.
Collaborate closely with cross-functional teams including Product, Engineering, Data Scientists, and Analysts to thoroughly understand data requirements and provide data engineering support and extend the company’s data with third-party sources of information when needed.
Takes ownership of storage layer, database management tasks, including schema design, indexing, and performance tuning.
Evaluate and implement cutting-edge technologies and methodologies and continue learning and expanding skills in data engineering and cloud platforms, to improve and modernize existing data systems.
Develop, design, and execute data governance strategies encompassing cataloging, lineage tracking, quality control, and data governance frameworks that align with current analytics demands and industry best practices working closely with Data Architect.
Ensure technology solutions support the needs of the customer and/or organization.
Define and document data engineering architectures, processes and data flows.

Fusemachines is an Equal Opportunities Employer, committed to diversity and inclusion. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or any other characteristic protected by applicable federal, state, or local laws.

Apply for this position

Required*

First Name*

Last Name*

Email Address*

Phone*

Address*

Resume*

We've received your resume. Click here to update it.

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

LinkedIn Profile URL:

Desired salary*

Earliest start date?*

Do you have 2 years + of experience working as a data engineer?*

Do you have experience working in AWS and Pyspark?*

Notice time period*

Submit Application