Shariar Kabir

AI Research Engineer · Dhaka, Bangladesh · shariar1405076@gmail.com

I lead the AI Research and Engineering team at Celloscope Ltd where we develop production-grade AI-based solutions that are safe, reliable, and trustworthy. I completed my BSc in Computer Science and Engineering, from Bangladesh University of Engineering and Technology (BUET).

My work focuses on interpretability and behavioral evaluation of large language models, with related questions of safety and robustness. Currently, I am exploring how LLMs behavior evolve over longer context such as in multi-turn interactions, how their internal mechanisms can be made interpretable, and how fairness can be ensured through principled interventions.

In Spring 2026, I joined SPAR to work on real-time automated mechanistic interpretability methods for AI safety, under the mentorship of Sriram Balasubramanian.

Previously, I was a research intern at the NLP Lab in UC Riverside, advised by Prof. Yue Dong, where I was also fortunate to work with Prof. Kevin Esterling. I worked on behaviorial evaluation of LLMs and mechanistic interpretability, and also explored how psychometric and Bayesian modeling techniques can quantify and explain complex social behaviors in LLMs

Prior to that, I worked on inclusive AI systems for low-resource languages, including Bengali medical ASR and document understanding tools. My long-term goal is to build methods that make AI systems not only capable but also transparent, stable, and socially aligned. My detailed CV can be found here.

News


Interests

Large foundation models (e.g., LLMs and VLMs) remain a black box whose inner reasoning and long-term behavior are still poorly understood. Despite this, they are increasingly deployed for sensitive tasks such as mass persuasion and student education. My goal is to contribute to research that makes such models more reliable, transparent, and socially aligned. Specifically, I am interested in working on three intertwined directions:

  1. Robustness: Understanding and improving model behavior over extended interactions.
  2. Interpretability: Developing methods to explain the internal mechanisms and decision-making processes of large models.
  3. Fairness: Designing principled interventions to mitigate biases and ensure equitable treatment across diverse user groups.

Outside of my professional pursuits, I consider myself curious by nature and enjoy learning in general. I am an avid reader with a keen interest in classical thrillers, and philosophical novels. I enjoy listening to Bengali folk music and classical rock, and occasionally try my hand at playing Bengali folk melodies on the ukulele.

I love animals and have a soft spot for cats due to their elegance, independent nature and curious spirits. My wife and I take care of two lovely cats, and we warmly invite you to meet them through some of their photos here.

Experience

Research Intern

UCR NLP Lab (Prof. Yue Dong)

Working on methods to combine interpretability tools with fairness diagnostics from social science for designing an intervention that targets emergent activation circuits in LLMs responsible for particular behavioral tendencies.

  • Understanding LLMs’ response instability over longer context.
  • Mechanistic Interpretability of LLM in Socio-Political Reasoning.
  • LLMs’ Social Epistemology using Bayesian Statistics.
January 2025 - December 2025

Lead AI Research Engineer

Celloscope Limited

I led a team of six research engineers developing production-grade NLP and computer vision systems deployed across multiple industrial domains. Key projects I directed include:

  • Exercise Monitoring System for LG Nova, which used multimodal pose-estimation and language models to provide real-time feedback on workout form.
  • Resume Shortlister, a RAG-based ranking engine that matched the requirements from RFPs or job descriptions with candidate resumes using a hybrid approach combining rule-based filtering with semantic retrieval.
  • Drawing Checker, a vision system to automate design-error detection in engineering drawings through deep-learning-based object detection and geometric analysis.
September 2020 - Present

NLP and Data Scientist

MedAI Pvt. Limited (Part Time)

Extracting data-driven insights from medical data of Bangladesh and developing a smart healthcare platform that uses AI to deliver personalised healthcare services in local languages. Major contributions are:

  • Empowering Mental Health Support for Bengali Speakers through a Conversational AI chatbot.
  • Synthetic patient generator reflecting local demography.
  • Classifier for clustering patients disease using symptoms and other demography.
  • Training and serving of voice-based patients' symptoms collector.
  • Design and developement of audio data collection portal.
August 2021 - November 2024

DevOps

GRP, ICT Division

Automating the deployment process and monitoring of numerous microservices. Major contributions include:

  • Automation scripts for deploying web apps and micro-services in Docker
  • Gateway configuration using NGINX reverse proxy
  • Document generation scripts from Google Sheets
May 2019 - August 2020

Education

Bangladesh University of Engineering and Technology (BUET)

Master of Science (part time)
Computer Science and Engineering

GPA (coursework): 3.54

Thesis: Dynamic Resource Allocation for Workloads in Serverless Architecture using Collaborative Filtering. Under the supervision of Professor Muhammad Abdullah Adnan.

Coursework: Bioinformatics Algorithms · Distributed Computing Systems · Data Mining · Data Management in the Cloud · Advanced Database Systems · Advanced Artificial Intelligence

April 2019 - October 2022

Bangladesh University of Engineering and Technology (BUET)

Bachelor of Science
Computer Science and Engineering

GPA: 3.53

Major: Artificial Intelligence

Thesis: Active Learning on Big Data; A research on how we can apply active learning on big data in a distributed cloud computing system. Under the supervision of Professor Muhammad Abdullah Adnan.

Coursework: Machine Learning · Pattern Recognition · Computer Graphics · Artificial Intelligence · Digital Image Processing · Data Structures · Database · Operating Systems · Software Development · Computer Architecture · Microprocessors and Microcontrollers · Computer Networks · Concrete Mathematics · Discrete Mathematics · Numerical Methods · Software Engineering and Information System Design · Compiler · Data Communication · Digital Logic Design · Structured Programming Language · Object Oriented Programming Language · Theory of Computation

February 2015 - April 2019


Publications

Please, refer to my Google Scholar profile for a complete list of my publications.

PReSS: A Black-Box Framework for Evaluating Political Stance Stability in LLMs via Argumentative Pressure. [PDF]
Shariar Kabir, Kevin Esterling, Yue Dong,
Under Review
Abstract

Existing evaluations of political bias in large language models (LLMs) typically classify outputs as left- or right-leaning. We extend this perspective by examining how ideological tendencies vary across topics and how consistently models maintain their positions, a property we refer to as stability. To capture this dimension, we propose PReSS (Political Response Stability under Stress), a black-box framework that evaluates LLMs by jointly considering model and topic context, categorizing responses into four stance types: stable-left, unstable-left, stable-right, and unstable-right. Applying PReSS to 12 widely used LLMs across 19 political topics reveals substantial variation in stance stability; for instance, a model that is left-leaning overall can exhibit stable-right behavior on certain topics. This highlights the importance of topic-aware and fine-grained evaluation of political ideologies of LLMs. Moreover, stability has practical implications for controlled generation and model alignment: interventions such as debiasing or ideology reversal should explicitly account for stance stability. Our empirical analyses reveal that when models are prompted or fine-tuned to adopt the opposite ideology, unstable topic stances are more likely to change, whereas stable ones resist modification. Thus, treating stability as a moderating factor provides a principled foundation for understanding, evaluating, and guiding interventions in politically sensitive model behavior.

AgnoSVD: Dynamic resource allocation for serverless workloads using collaborative filtering. [PDF]
Shariar Kabir, Muhammad Abdullah Adnan,
Array, Volume 29, 2026.
Abstract

In serverless computing, determining the optimal resource configurations for workloads poses significant challenges, particularly due to the cloud provider's limited visibility into workload specifics. This complexity is amplified when dealing with diverse workloads that vary in their characteristics. In this paper, we present AgnoSVD, an approach for predicting the optimum resource configuration for an incoming workload using Singular Value Decomposition (SVD). The proposed model uses collaborative filtering to extract the latent factors of the workloads and resource profiles. Therefore, the model remains agnostic to the specific details of the functions and the resource configurations. We tested our approach on well-known serverless systems like AWS lambda and Apache OpenWhisk and evaluated the system using 99 functional workloads. These workloads encompass both individual functions and chains of …

Beyond the Surface: Probing the Ideological Depth of Large Language Models. [PDF]
Shariar Kabir, Kevin Esterling, Yue Dong,
arXiv preprint arXiv:2508.21448 (2025)
Abstract

Large language models (LLMs) display recognizable political leanings, yet they vary significantly in their ability to represent a political orientation consistently. In this paper, we define ideological depth as (i) a model's ability to follow political instructions without failure (steerability), and (ii) the feature richness of its internal political representations measured with sparse autoencoders (SAEs), an unsupervised sparse dictionary learning (SDL) approach. Using Llama-3.1-8B-Instruct and Gemma-2-9B-IT as candidates, we compare prompt-based and activation-steering interventions and probe political features with publicly available SAEs. We find large, systematic differences: Gemma is more steerable in both directions and activates approximately 7.3x more distinct political features than Llama. Furthermore, causal ablations of a small targeted set of Gemma's political features to create a similar feature-poor setting induce consistent shifts in its behavior, with increased rates of refusals across topics. Together, these results indicate that refusals on benign political instructions or prompts can arise from capability deficits rather than safety guardrails. Ideological depth thus emerges as a measurable property of LLMs, and steerability serves as a window into their latent political architecture.

AmarDoctor: An AI-Driven, Multilingual, Voice-Interactive Digital Health Application. [PDF]
Nazmun Nahar, Ritesh Harshad Ruparel, Shariar Kabir, Sumaiya Tasnia Khan, Shyamasree Saha, Mamunur Rashid,
arXiv preprint arXiv:2510.24724 (2025)
Abstract

This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).

Automatic Speech Recognition for Biomedical Data in Bengali Language. [PDF]
Shariar Kabir, Nazmun Nahar, Shyamasree Saha, Mamunur Rashid,
arXiv preprint arXiv:2406.12931 (2024)
Abstract

This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.

SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction. [PDF]
Syed Monsur, Shariar Kabir, Sakib Chowdhury,
BLP workshop at EMNLP, pages 117–123, Singapore.
Abstract

End-to-end Document Key Information Extraction models require a lot of compute and labeled data to perform well on real datasets. This is particularly challenging for low-resource languages like Bangla where domain-specific multimodal document datasets are scarcely available. In this paper, we have introduced SynthNID, a system to generate domain-specific document image data for training OCR-less end-to-end Key Information Extraction systems. We show the generated data improves the performance of the extraction model on real datasets and the system is easily extendable to generate other types of scanned documents for a wide range of document understanding tasks.


Awards & Achievements

Industry Coding Assessment

CodeSignal General Coding Assessment (ICA): 510/600 (≈ 722/850 equivalent GCA, top 15%)
2025

Global Health Equity Challenge Award

MIT Solve

AmarDoctor by MedAI has been selected as one of the six solvers out of 2200+ participants worldwide for its innovative approach to accessible healthcare.

2024


Research

NER From Chatbot User Messages

Extraction of named entities (NE) like benefeciary names, transfer amount, accound type, account no. etc. Instead of applying a single language model like BERT for all of these We employed a recipe of different approaches including BERT, RegEx and lookup tables. This was done to minimize training/finetuning tasks which is challenging for low-resource language like Bengali. We used BERT based model for beneficiary name extraction, RegEx for transfer amount and account no. extraction and lookup tables for account type extraction.

Finetuning LLMs for Mental Health Counsel

Recognizing the inherent bias of most LLMs towards European languages and ethnicities and the low resources of structured Bengali data, I initially focused on refining open-source models like LLaMA using different parameter-efficient fine-tuning (PEFT) (e.g., Adapter injections and LoRA). Finally, I was able to successfully fine-tune LLaMA for Bengali mental health consultation using QLoRA, which resulted in a more optimized model that can be served on low GPU memory. This work has been pivotal in ensuring equitable access to healthcare technologies across diverse linguistic communities.

ASR System for Patient Symptoms [PPT]

ASR system for understanding medical symptoms spoken by patients in Bengali language. We trained the DeepSpeech model from scratch using audio data collected from consented users using our audio data collection portal. We finetuned the model for a noisy environment, using the 13 domain augmentations provided by DeepSpeech. This model performed poorly when the user says any out-of-vocabulary words. Therefore we finetuned a Whisper (tiny) model specifically the BanglaASR model which was trained using Bangla Mozilla Common Voice Dataset. The model performs with a WER of only 8%. The performance is due to the limited vocabulary of symptoms.

SynthCases Creator and Disease Classifier [PPT]

A recommendation system based on ensemble classifiers for diseases based on patients' symptoms. The classifier is trained on synthetic data generated to reflect real-world demography. The generator takes into account patients' risk factors family history and medical history. The classifier uses a multi-layer pipeline for making predictions where in the first step it predicts the probability of each disease based on the symptoms, then it uses a prevalence look-up table for filtering the most probable diseases based on ethnicity, finally, it makes the prediction using the filtered diseases and patients risk-factors.

Licence Plate Detection in CCTV Frames using YOLOv5 [PPT]

We finetuned the famous YOLOv5 model to detect lcence plates of different vehicles in Bangladesh in the CCTV footage. Colab Notebook

Key Information Extraction (KIE) From NID using Donut [PPT]

We used the data generated by SynthNID to fine-tune the pretrained document transformers model (Donut) for Key Information Extraction (KIE) task. We used a mix of real and synthetic data. With the addition of synthetic data we found signinficant improvement in performance, especially in the Bengali fields.

Projects

Medical Code Classification via Linear Probing of LLM Activations

This project investigates multi-label medical code classification by training linear probes on Large Language Model (LLM) activations. We extract layer-wise attention head activations from medical-domain LLMs and use Ridge regression classifiers to predict relevant medical disciplines from clinical descriptions. The approach enables interpretable analysis of which model components are most informative for medical domain classification tasks.

Exercise Monitoring System

A system leveraging Vision-Language Models (VLMs) to assist users in performing exercises correctly by comparing their execution against reference videos of expert demonstrations. The system uses frame-level visual and motion comparison, integrated with language-based feedback, to generate natural language guidance that helps users improve their form and reduce the risk of injury.

Agrani Voice Banking Chatbot

Bangladesh's pioneering Voice-based AI Chatbot for seamless banking activities, serving hundreds of thousands of real users. Agrani Bank is one of the largest state-owned banks in Bangladesh, with a huge number of customers who have very little access to information. Agrani Voice Banking makes banking services accessible to everyone. It is powered by Bengali ASR and a finetuned NLU engine for natural language-driven fund transfers and inquiries. It can behave dynamically based on the input messages by the user.

Realtime Liveness Check

Analyzing real-time facial movements, blinking and requiring the user to perform specific facial actions during the authentication process of eKYC to ensure the presence of a live person. Developed to be used in mobile devices like smartphones.

Audio Data Collection Portal

Audio data collection portal for large user base. Built using React frontend and Python-Flask Backend. Metadata is stored in PostgreSQL, while object storage is in S3. Complete user authentication and authorization using AWS Cognito. Ability to collect data based on priority or user specifics. Useful for collection of medical recordings by filtering symptoms based on age or gender or audio counts.

AI Service Gateway

A portal for showcasing AI services. Clients can use a demo version of each services. Authentication and authorization is built using Keycloak and Google identity provider. New clients can sign-up using their email and receives a limited credit for using the services.

Don't Drop The Bomb

This microcontroller project was built as a multiplayer game. It was built using the wonderful mechanisms of microcontrollers. The game features two player controlled bars on either side of two connected dot matrices. At its core was a single Atmega32 microprocessor. The controllers were built using MPU-6050 accelerometer & gyro sensors.

Ray Tracing

Ray Tracing is a rendering technique that can produce incredibly realistic lighting effects. It works by tracing the path of light through pixels in an image plane and simulating the effects of its encounters with virtual objects. In this project, I implemented a ray tracer that can render spheres, planes, and triangles with textures and shadows. Phong Lighting Model and Recursive Reflection are employed in this implementation.

Curriculum Vitae