HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY ---🙠🕮🙢--- REPORT PROJECT: Introduction to Business Analytics Instructor: PhD. Nguyen Binh Minh Group: 3 Member: Nguyen Hai Long 20204920 Bui Thanh Tung 20204931 Nguyen Huy Hai 20200194 Duong Vu Tuan Minh 20209705 Index I. II. III. IV. V. VI. Introduction Problem description 3 3 About the dataset Designing System Classification Models Conclusion 4 9 10 1. Introduction Fraudulent activities in loan applications pose a significant threat to financial institutions, necessitating the development of robust fraud detection systems. In this study, we investigate a comprehensive set of applicant information, loan details, credit history, property information, document verification, and social surroundings using the Kaggle Credit Card Fraud dataset. The dataset, available at https://www.kaggle.com/datasets/mishra5001/credit-card, provides a diverse and realistic representation of credit-related transactions and allows us to enhance the accuracy and effectiveness of fraud detection in loan applications. The applicant information section includes demographic details, family-related information, and employment details, providing insights into the applicant's background and financial stability. Loan details encompass key factors such as loan type, credit amount, annuity, and goods price, shedding light on the financial aspects of the loan request. Credit history, evaluated through external scores and late payment records, offers a historical perspective on the applicant's creditworthiness. Property information, focusing on property type, ownership, and building features, contributes to understanding the collateral and associated risks. Document verification flags aid in assessing the authenticity of information provided by the applicant. The social surroundings section captures observations of the client's social environment and defaults over time, offering a dynamic perspective on risk. Additionally, factors such as the duration since the last phone change and the presence of mobile phones and emails are considered in the analysis. This multifaceted approach aims to create a comprehensive fraud detection model capable of identifying irregularities and potential fraudulent activities across diverse dimensions of loan applications. By integrating various indicators and leveraging advanced analytics techniques on the Kaggle Credit Card Fraud dataset, our study aims to contribute to the development of a more resilient fraud detection system, ultimately safeguarding financial institutions from fraudulent loan applications. The findings of this research have the potential to enhance risk assessment methodologies and promote a more secure lending environment. The work can be found at: https://www.kaggle.com/code/dngvminh/fraud-detection. 2. Problem description In today's dynamic and interconnected business landscape, fraud has emerged as a pervasive challenge across various industries, including banking, sales, and insurance. While credit card fraud remains one of the most prevalent forms of illicit activities, the spectrum of fraudulent practices has expanded to include identity theft and cyber-attacks. The inherent complexity of fraud, coupled with the constantly evolving and sophisticated strategies employed by fraudsters, necessitates innovative and adaptive solutions for detection. Addressing the multifaceted nature of fraud requires a nuanced understanding of the underlying patterns and anomalies within datasets. Among the diverse industries grappling with this issue, the financial sector is particularly vulnerable, as fraudulent transactions can have severe financial repercussions and erode trust in financial systems. In response to this pervasive challenge, this project aims to contribute to the arsenal of tools available for fraud detection. By focusing on the domain of online transactions, specifically utilizing the "Credit Card Fraud Detection" dataset available on Kaggle (https://www.kaggle.com/datasets/mishra5001/credit-card), we endeavor to employ statistical methods to discern patterns indicative of suspicious operations. This project is not merely an academic exercise but a pragmatic endeavor aimed at empowering business analysts to make informed decisions that safeguard their organizations against financial losses and reputational damage. The proactive identification of features or combinations of features that are common to fraudulent transactions holds the key to timely detection and mitigation. Armed with such insights, companies can strengthen their fraud prevention mechanisms, enhancing their ability to thwart fraudulent actions before they inflict substantial harm. In essence, this project serves as a valuable asset for business analysts, providing them with practical experience in leveraging data analytics for fraud detection and prevention – a skill set increasingly indispensable in today's dynamic business environment. 3. About the dataset The "Credit Card Fraud Detection" dataset, available on Kaggle (https://www.kaggle.com/datasets/mishra5001/credit-card), serves as a valuable resource for exploring and understanding fraudulent activities in online credit card transactions. The dataset contains a wealth of information related to credit card transactions, encompassing diverse features such as applicant demographics, loan details, credit history, property information, document verification, and social surroundings. The variety of features enables a comprehensive analysis of different aspects related to fraud detection. Imbalanced Class Distribution: One of the notable characteristics of the dataset is its highly imbalanced class distribution. Specifically, only a small percentage (7%) of the transactions are labeled as fraudulent, while the majority belong to the non-fraudulent class. Imbalanced datasets present a common challenge in fraud detection, as models might exhibit a bias toward the majority class. Special attention and techniques are required to address this imbalance. Real-World Implications: The imbalanced nature of the dataset reflects the real-world scenario where fraudulent transactions are infrequent but have significant consequences. Analyzing and addressing class imbalance contributes to the creation of robust fraud detection models capable of identifying and preventing illicit activities. 4. Designing system 4.1. Distribution of Fraud and Non-Fraud Instances In the real-world landscape of financial transactions, fraudulent activities consistently represent a small fraction of the entire data space. This characteristic is mirrored in the "Credit Card Fraud Detection" dataset, where instances of fraud transactions account for approximately 7% of the entire dataset. The scarcity of fraudulent cases, while reflective of the genuine nature of these occurrences, adds a layer of complexity to the task of building effective fraud detection models. The imbalanced distribution of classes, with the majority of transactions being non-fraudulent, underscores the need for sophisticated analytical approaches. As fraud detection algorithms are trained to discern patterns and anomalies, the inherent rarity of fraudulent instances necessitates a heightened sensitivity to subtle signals that might indicate potential illicit activities. 4.2. Feature selection In the pursuit of refining the dataset for robust model training and analysis, a comprehensive feature selection process was undertaken, guided by the overarching objective of enhancing the quality and interpretability of the dataset. One critical facet of this process involved addressing the presence of missing values within the dataset. Columns with a significant proportion of missing data, exceeding a predetermined threshold of 60%, were identified and subsequently removed. This strategic decision aimed to alleviate potential biases stemming from incomplete information, thereby contributing to a more reliable and comprehensive dataset for subsequent analyses. In addition to mitigating missing values, a deliberate effort was made to streamline the dataset's structure and reduce dimensionality through the consolidation of related columns. Specifically, a group of binary indicator columns, ranging from FLAG_DOCUMENT_2 to FLAG_DOCUMENT_21, were amalgamated into a singular variable named FLAG_DOCUMENT. This consolidation not only served to simplify the dataset but also encapsulated the collective information conveyed by the individual flags, providing a more cohesive representation of the underlying features. Recognizing the significance of addressing multicollinearity to ensure model stability and prevent redundancy, an in-depth analysis of attribute correlations was conducted. Highly correlated attributes, indicative of redundant or overlapping information, were systematically identified and pruned from the dataset. This meticulous curation of features aimed not only to enhance the dataset's discriminative power but also to foster a more efficient and streamlined input for subsequent model training and evaluation. In summary, the feature selection process undertaken in this phase of the analysis represents a judicious effort to refine the dataset, striking a balance between preserving valuable information and mitigating potential sources of bias or redundancy. These strategic decisions lay the foundation for subsequent stages of the project, where the curated dataset will be utilized for developing and fine-tuning machine learning models for fraud detection. 4.3. Positive targets on each values of categorical attributes In the pursuit of a granular understanding of our machine learning model's performance, a comprehensive analysis was conducted by calculating true positive targets for each distinct value within categorical attributes. This approach goes beyond the conventional evaluation metrics, offering a nuanced view of the model's efficacy in identifying positive instances across specific subgroups. By breaking down the true positive rates at the categorical level, we gained valuable insights into the discriminative power of our model, identifying categories that significantly contribute to positive outcomes. This method not only aids in highlighting the strengths of the model in accurately predicting positives but also sheds light on potential weaknesses, especially in scenarios with imbalanced datasets or where certain subcategories hold particular importance. The subsequent visualization of these true positive rates provides a clear and interpretable representation of the model's performance across diverse categories, facilitating informed decision-making and targeted improvements in our classification task. ● Findings: 1. Geographic Discrepancies: Registration vs. Work/Live Regions Customers with different city/region registrations and work/live regions show higher fraudulent tendencies. 2. Family Dynamics: Children and Loan Repayment Applicants with more children face increased challenges in loan repayment, leading to a higher likelihood of fraud. 3. Socioeconomic Influence: Standard of Living and Fraudulent Transactions Residents in higher standard areas/regions exhibit a decreased likelihood of engaging in fraudulent transactions. 4. Loan Type Disparities: Cash vs. Revolving Loans Applicants for Cash loans are more prone to fraud compared to those applying for Revolving loans. 5. Gender Disparity: Male vs. Female Fraud Likelihood Males are more likely to commit fraud than females among loan applicants. Meanwhile, business man and student, have no trouble paying the loans. 6. Property Ownership: Limited Impact on Fraud Incidence Ownership of cars or properties does not significantly affect the likelihood of committing fraud. 7. Employment Status: Unemployment and Maternity Leave Risks Unemployed or maternity leave applicants demonstrate a significantly higher rate of committing fraud. 8. Academic Background: Influence on Fraud Rates Lower academic qualifications correlate with higher fraud rates, with a notable decrease as educational levels rise. Academic degree holders display a mere 1.8% chance of fraud. 9. Occupational Disparities: High vs. Low-Skilled Jobs Low-skilled jobs, such as security staff, waiters/barmen, and laborers, are associated with almost double the chance of fraud compared to high-skilled IT workers. 10. Industry Sectors: Fraud Likelihood Variation Certain sectors, including transport and industry, pose a higher risk of fraudulent transactions compared to the educational sector, particularly in university settings. 4.4. Removing attributes that are highly correlated In order to enhance the robustness of our analytical model and mitigate the potential for multicollinearity, a careful examination of attribute correlations was undertaken, leading to the removal of highly correlated features. The identification and subsequent elimination of these correlated attributes were pivotal steps in refining the dataset for our analysis. Multicollinearity, where two or more attributes are strongly correlated, can introduce instability and challenges in accurately assessing the individual impact of each attribute on the outcome. By strategically removing these highly correlated attributes, we aimed to ensure that our model remains more interpretable and generalizable, ultimately improving its predictive performance. This process of attribute selection contributes to a more streamlined and efficient model, reducing redundancy and enhancing the model's capacity to discern meaningful patterns within the data. The resulting dataset, pruned of correlated attributes, sets the foundation for a more accurate and reliable analysis of the factors influencing the outcomes under consideration. 4.5. Correlation of Continuous Variables with TARGET. The calculation of the correlation between continuous variables and the target variable is a pivotal aspect of our analysis, shedding light on the relationships that exist between individual attributes and the target outcome. Through this process, we seek to quantify the degree and direction of the linear association between each continuous variable and the target variable. A positive correlation implies that as the continuous variable increases, the likelihood of a positive outcome in the target variable also increases, while a negative correlation suggests an inverse relationship. This correlation analysis serves as a crucial step in identifying influential features that may significantly impact the target variable. By understanding the strength and nature of these relationships, we gain valuable insights into the factors that play a role in determining the target outcome. These correlation coefficients not only guide feature selection but also contribute to the overall interpretability and predictive power of our model. This comprehensive examination of continuous variables in relation to the target variable lays the groundwork for a nuanced understanding of the underlying patterns within our dataset. 5. Classification Model In the pursuit of developing a robust and accurate predictive model, the preprocessed data underwent training using four distinct classification models: Logistic Regression, K-Nearest Neighbors, Decision Tree Classifier, and Gaussian Naive Bayes. Prior to model training, the dataset was subjected to oversampling through the Synthetic Minority Over-sampling Technique (SMOTE) following the data split, a measure aimed at addressing potential class imbalances. Additionally, missing value imputation was carried out post-data split to ensure the integrity of the training process. After comprehensive training and evaluation, it was observed that the Decision Tree Classifier exhibited the highest test accuracy among the models, reaching an impressive 0.82 accuracy. This outcome underscores the effectiveness of the Decision Tree model in capturing complex relationships within the data and highlights its potential as a strong candidate for further refinement and deployment in predictive tasks. Train accuracy Test accuracy Logistic Regression 0.56 0.47 Gaussian Naive Bayes 0.61 0.60 K-Nearest Neighbors 0.89 0.67 Decision Tree Classifier 1.00 0.82 6. Conclusion The pursuit of effective fraud detection involves navigating the intricate landscape of imbalanced data, comprehensive feature selection, and rigorous model evaluation. In our exploration of fraud detection using the "Credit Card Fraud Detection" dataset, several key insights have emerged, shaping a comprehensive understanding of the factors influencing fraudulent activities. The initial acknowledgment of the imbalanced distribution of fraud and non-fraud instances emphasizes the necessity for nuanced analytical approaches. With fraudulent transactions constituting a mere 7% of the dataset, the challenge lies in developing models attuned to the subtleties that characterize these rare occurrences. A meticulous feature selection process was undertaken to refine the dataset, addressing missing values, consolidating related columns, and pruning highly correlated attributes. This strategic curation not only enhances the quality and interpretability of the dataset but also lays a robust foundation for subsequent model training. Our granular analysis of positive targets within categorical attributes provided a nuanced view of the model's efficacy across diverse subgroups. The findings unveiled geographic discrepancies, family dynamics, socioeconomic influences, loan type disparities, gender disparities, and various socio-demographic factors influencing fraud likelihood. The exploration of continuous variables' correlation with the target variable further enriched our understanding, highlighting influential features that contribute to the target outcome. Factors such as academic background, employment status, and industry sectors demonstrated varying degrees of impact on fraud likelihood. The culmination of our efforts involved training four classification models, with the Decision Tree Classifier emerging as the standout performer with an impressive test accuracy of 0.82. The model's adeptness in capturing complex relationships within the data positions it as a strong candidate for further refinement and deployment in practical fraud detection scenarios. In summary, our comprehensive approach, from data preprocessing to model evaluation, has provided valuable insights into the dynamics of fraud within financial transactions. By addressing the challenges posed by imbalanced data and employing sophisticated analytical techniques, our findings pave the way for the development of robust fraud detection systems capable of navigating the intricacies of real-world financial landscapes.