{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Y7yjuUwxtdIg" }, "source": [ "# 基于K-Means的社交媒体病毒性预测\n", "\n", "## 背景介绍\n", "\n", "Artificial intelligence is commonly used in various trade circles to automate processes, gather insights on business, and speed up processes. You will use Python to study the usage of artificial intelligence in real-life scenarios - how AI actually impacts industries. \n", "\n", "Social media is part and parcel of everyone's life nowadays. Artificial intelligence can be effectively used to analyze the trends in social media. \n", "\n", "In this notebook, we will focus on how to use a K-Means model to predict the virality of social media posts.\n", "\n", "## Context\n", "\n", "We will be working with the dataset of articles published by Mashable (a popular social article sharing platform) that is uploaded at [UCI](http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity). We will divide the set of articles into clusters using a K-Means model such that articles within a cluster would have a chance of similar popularity.\n", "\n", "\n", "### 知识点:K-Means\n", "\n", "K-Means is a simple algorithm that divides a dataset into groups such that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).\n", "\n", "\n", "## 打开csv 文件\n", "\n", "We will use the [scikit-learn](https://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/) to work with our dataset. Scikit-learn is a very useful machine learning library that provides efficient tools for predictive data analysis. Pandas is a popular Python library for data science. It offers powerful and flexible data structures to make data manipulation and analysis easier.\n", "\n", "\n", "## 包含模块\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 71 }, "colab_type": "code", "id": "c54ZY1leww-2", "outputId": "ef3ed5c1-e7f5-423b-d197-373d7dcbd3c2" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np \n", "import matplotlib.pyplot as plt\n", "%matplotlib inline \n", "import seaborn as sns\n", "sns.set(\"talk\",\"darkgrid\",font_scale=1,font=\"sans-serif\",color_codes=True)\n", "from sklearn import metrics\n", "from sklearn.decomposition import PCA\n", "from sklearn.cluster import KMeans" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ra_0mvhQxF41" }, "source": [ "### 导入数据集\n", "\n", "The dataset contains a set of Mashable articles. Let us visualize the dataset.\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 309 }, "colab_type": "code", "id": "483s82z9xFIw", "outputId": "c4c2dfe4-b80f-4f17-e718-71bca7ef0495" }, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "timedelta | \n", "n_tokens_title | \n", "n_tokens_content | \n", "n_unique_tokens | \n", "n_non_stop_words | \n", "n_non_stop_unique_tokens | \n", "num_hrefs | \n", "num_self_hrefs | \n", "num_imgs | \n", "... | \n", "min_positive_polarity | \n", "max_positive_polarity | \n", "avg_negative_polarity | \n", "min_negative_polarity | \n", "max_negative_polarity | \n", "title_subjectivity | \n", "title_sentiment_polarity | \n", "abs_title_subjectivity | \n", "abs_title_sentiment_polarity | \n", "shares | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "http://mashable.com/2013/01/07/amazon-instant-... | \n", "731.0 | \n", "12.0 | \n", "219.0 | \n", "0.663594 | \n", "1.0 | \n", "0.815385 | \n", "4.0 | \n", "2.0 | \n", "1.0 | \n", "... | \n", "0.100000 | \n", "0.7 | \n", "-0.350000 | \n", "-0.600 | \n", "-0.200000 | \n", "0.500000 | \n", "-0.187500 | \n", "0.000000 | \n", "0.187500 | \n", "593 | \n", "
1 | \n", "http://mashable.com/2013/01/07/ap-samsung-spon... | \n", "731.0 | \n", "9.0 | \n", "255.0 | \n", "0.604743 | \n", "1.0 | \n", "0.791946 | \n", "3.0 | \n", "1.0 | \n", "1.0 | \n", "... | \n", "0.033333 | \n", "0.7 | \n", "-0.118750 | \n", "-0.125 | \n", "-0.100000 | \n", "0.000000 | \n", "0.000000 | \n", "0.500000 | \n", "0.000000 | \n", "711 | \n", "
2 | \n", "http://mashable.com/2013/01/07/apple-40-billio... | \n", "731.0 | \n", "9.0 | \n", "211.0 | \n", "0.575130 | \n", "1.0 | \n", "0.663866 | \n", "3.0 | \n", "1.0 | \n", "1.0 | \n", "... | \n", "0.100000 | \n", "1.0 | \n", "-0.466667 | \n", "-0.800 | \n", "-0.133333 | \n", "0.000000 | \n", "0.000000 | \n", "0.500000 | \n", "0.000000 | \n", "1500 | \n", "
3 | \n", "http://mashable.com/2013/01/07/astronaut-notre... | \n", "731.0 | \n", "9.0 | \n", "531.0 | \n", "0.503788 | \n", "1.0 | \n", "0.665635 | \n", "9.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.136364 | \n", "0.8 | \n", "-0.369697 | \n", "-0.600 | \n", "-0.166667 | \n", "0.000000 | \n", "0.000000 | \n", "0.500000 | \n", "0.000000 | \n", "1200 | \n", "
4 | \n", "http://mashable.com/2013/01/07/att-u-verse-apps/ | \n", "731.0 | \n", "13.0 | \n", "1072.0 | \n", "0.415646 | \n", "1.0 | \n", "0.540890 | \n", "19.0 | \n", "19.0 | \n", "20.0 | \n", "... | \n", "0.033333 | \n", "1.0 | \n", "-0.220192 | \n", "-0.500 | \n", "-0.050000 | \n", "0.454545 | \n", "0.136364 | \n", "0.045455 | \n", "0.136364 | \n", "505 | \n", "
5 rows × 61 columns
\n", "