Exploring the MovieLens 100k dataset with SGD, autograd, and the surprise package.

Image for post
Image for post
Photo by Charles Deluvio on Unsplash

By Gavin Smith and XuanKhanh Nguyen

This project was the third project for my machine learning class this semester. The project aims to train a machine learning algorithm using MovieLens 100k dataset for movie recommendation by optimizing the model's predictive power. We were given a clean preprocessed version of the MovieLens 100k dataset with 943 users' ratings of 1682 movies. The input to our prediction system is a (user id, movie id) pair. Our predictor's output will be a scalar rating y in range (1,5) — a rating of 1 is the worst possible, a rating of 5 is the best. Our main task is to predict the ratings of all user-movie pairs. …


Bag-of-word feature representations and word embedding feature representations.

Image for post
Image for post

By Gavin Smith and XuanKhanh Nguyen

This project was the second project for my machine learning class this semester. We were given a dataset of several thousand single-sentence reviews collected from three domains: imdb.com, amazon.com, yelp.com. Each review consists of a sentence and a binary label indicating the sentence's emotional sentiment (1 for positive feelings; 0 for negative feelings). All the provided reviews in the training and test set were scraped from websites whose assumed audience is primarily English speakers. There are 2400 input, output pairs in the training set with 4510 unique words and 600 inputs in the test set with 1921 uniques words. Our main task is to develop a binary classifier that can correctly identify a new sentence's sentiment. …


Exploring the MNIST1 and FASHION MNIST2 dataset with Logistic Regression and Random Forest

Image for post
Image for post

By Gavin Smith and XuanKhanh Nguyen

This semester, I took a Machine Learning class at Tufts University. This was one of my favorite Data Science courses I have taken thus far. It taught me how to tell if machine learning is solving a problem. And most importantly, it made me a better Data Science person.

We were given three projects throughout the semester. Each project has a structure problem and an open-ended problem. The open-ended specification you could imagine. …


Choose the correct graph or chart style for the task you want your audience to accomplish

Image for post
Image for post
Photo by Morgan Housel on Unsplash

According to the World Economic Forum, the world produces 2.5 quintillion bytes of data every day. With so much data, it’s become increasingly difficult to manage and make sense of it all. It would be impossible for any person to wade through data line-by-line and see distinct patterns and make observations.

Data visualization is one of the data science processes; that is, a framework for approaching data science tasks. After data is collected, processed, and modeled, the relationships need to be visualized for the conclusions.

We use data visualization as a technique to communicate insights from data through visual representation. Our main goal is to distill large datasets into visual graphics to allow for a straightforward understanding of complex relationships within the data. …


Exploratory Data Analysis on World Happiness Report.

Image for post
Image for post
Photo by Freddy Do on Unsplash

What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?

I’m not the only person who believed the purpose of life is happiness. If you look around, most people are pursuing happiness in their lives.

On March 20th, the world celebrates the International Day of Happiness. The 2020 report ranked 156 countries by how happy their citizens perceive themselves based on their evaluations of their own lives. The rankings of national happiness are based on a Cantril ladder survey. Nationally representative samples of respondents are asked to think of a ladder, the best possible life for them being a 10, and the worst possible experience is a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The report correlates the results with various life factors. …


The simplified explanation of the two traversals algorithm.

Image for post
Image for post
Photo by Christian Lambert on Unsplash

When it comes to learning, there are generally two approaches: we can go wide and try to cover as much of the spectrum of a field as possible, or we can go deep and try to get specific with the topic that we are learning. Most good learners know that, to some extent, everything we learn in life — from algorithms to necessary life skills — involves some combination of these two approaches. …


When to use supervised learning or unsupervised learning?

Image for post
Image for post
Photo by Julian O’hayon on Unsplash

If we don’t know what the objective of the machine learning algorithm is, we may fail to build an accurate model. Knowing the types of Machine learning algorithms is essential. It helps us to see a bigger picture of machine learning, what is the goal of all the things that are being done in the field and especially, put us in a better position to break down a real problem and design a machine learning system.

The goal of most machine learning algorithms is to construct a model or a hypothesis. All machine learning models categorize as either supervised or unsupervised. …


Basic plots, include code samples.

Image for post
Image for post
Photo by Giorgio Trovato on Unsplash

Matplotlib is a plotting library for the Python programming language. The most used module of Matplotib is Pyplot which provides an interface like Matlab but instead, it uses Python and it is open source.

In this note, we will focus on basic Matplotlib to help visualize our data. This is not a comprehensive list but contains common types of data visualization formats. Let’s hop to it!

The structure of this note:

  1. Anatomy of Matplotlib Figure
  2. Start with Pyplot
  3. Chart Types

Anatomy of Matplotlib Figure


How to perform multiple linear regression in Python using sklearn?

Image for post
Image for post
Photo by Isaac Benhesed on Unsplash

Linear regression is a standard statistical data analysis technique. We use linear regression to determine the direct relationship between a dependent variable and one or more independent variables. The dependent variable must be measured on a continuous measurement scale, and the independent variable(s) can be measured on either a categorical or continuous measurement scale.

In linear regression, we want to draw a line that comes closest to the data by finding the slope and intercept, which define the line and minimize regression errors. There are two types of linear regression: simple linear regression and multiple linear regression. …


A quick guide to the basics of the Python Numpy library, including code samples.

Image for post
Image for post
Photo by Charles Deluvio on Unsplash

NumPy is the library that gives Python its ability to work with data at speed. Numpy has several advantages over data cleaning and manipulation. It allows for efficient operations on the data structures often used in machine learning: vectors, matrices, and tensors.

When I first learned NumPy, I had trouble remembering all the functions and methods that needed. So I put together the most frequently used Numpy operations. I sometimes come back to this note to refresh my memory. And I am glad if it helps you on your journey too.

The structure of this note:

  1. N-dimensional arrays
  2. Array shape…

About

XuanKhanh Nguyen

Interests: Data Science, Machine Learning, AI, Stats, Python | Minimalist | A fan of odd things.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store