Open In Colab

Pré requisitos:

O resultado fica bem fiel. Para usar, necessita de:

  • Notebook do Colab aberto
  • Noções de Python
  • Conexão com a interet
  • Url da API que deseja usar

1 - Importando as bibliotecas:

Duas blibliotecas são necessárias aqui. Pandas e Numpy.

import pandas as pd
import numpy as np

Aqui eu abri o csv obtido no Kaggle e rankeio de acordo com a popularidade das músicas.

df = pd.read_csv('spotify.csv', index_col=0)
df.sort_values('song_popularity', ascending=False, inplace=True)
df.head(5)
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key liveness loudness audio_mode speechiness tempo time_signature audio_valence
1757 Party In The U.S.A. nao_sei 0.8220000000000001kg 0.519mol/L 0.36 0.0 10 0.177 -8.575 0 0.105 97.42 4 0.7 NaN
7574 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
11777 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
4301 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
14444 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329

2 - Inspeção Dataset:

print(df.shape)
(18835, 15)
df.head(5)
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key liveness loudness audio_mode speechiness tempo time_signature audio_valence
1757 Party In The U.S.A. nao_sei 0.8220000000000001kg 0.519mol/L 0.36 0.0 10 0.177 -8.575 0 0.105 97.42 4 0.7 NaN
7574 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
11777 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
4301 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
14444 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
## Verificando Tipo de Dados e Valores Não Nulos
## Inicialmente não possuimos dados nulo
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18835 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_name         18835 non-null  object 
 1   song_popularity   18835 non-null  object 
 2   song_duration_ms  18835 non-null  object 
 3   acousticness      18835 non-null  object 
 4   danceability      18835 non-null  object 
 5   energy            18835 non-null  object 
 6   instrumentalness  18835 non-null  object 
 7   key               18835 non-null  float64
 8   liveness          18835 non-null  object 
 9   loudness          18835 non-null  object 
 10  audio_mode        18835 non-null  object 
 11  speechiness       18835 non-null  object 
 12  tempo             18835 non-null  object 
 13  time_signature    18835 non-null  object 
 14  audio_valence     18834 non-null  float64
dtypes: float64(2), object(13)
memory usage: 2.3+ MB
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe
## Aqui observamos apenas duas colunas pois os formatos das outras esta como Object e assim ele não consegue calcular as agregações necessárias.
df.describe()
key audio_valence
count 18835.000000 18834.000000
mean 5.288674 0.527958
std 3.614624 0.244635
min 0.000000 0.000000
25% 2.000000 0.335000
50% 5.000000 0.526500
75% 8.000000 0.725000
max 11.000000 0.984000

3 - Removendo duplicadas:

duplicados = df[df.duplicated()]
print(duplicados)
                               song_name  ... audio_valence
11777             I Love It (& Lil Pump)  ...         0.329
4301              I Love It (& Lil Pump)  ...         0.329
14444             I Love It (& Lil Pump)  ...         0.329
1229              I Love It (& Lil Pump)  ...         0.329
3443              I Love It (& Lil Pump)  ...         0.329
...                                  ...  ...           ...
14292  Get Dripped (feat. Playboi Carti)  ...         0.904
7273                       John Madden 2  ...         0.409
6514                        THIS OLE BOY  ...         0.764
14312    Transformer (feat. Nicki Minaj)  ...         0.287
7275                     Prince Charming  ...         0.605

[3903 rows x 15 columns]
## Exemplo de uso em um cenário onde vc pode ter diversos valores iguais mas a combinação que não pode se repetir é em duas chaves especificas.
print(df[df.duplicated(subset=['song_name','audio_valence'])])
                             song_name  ... audio_valence
11777           I Love It (& Lil Pump)  ...        0.3290
4301            I Love It (& Lil Pump)  ...        0.3290
14444           I Love It (& Lil Pump)  ...        0.3290
1229            I Love It (& Lil Pump)  ...        0.3290
3443            I Love It (& Lil Pump)  ...        0.3290
...                                ...  ...           ...
7273                     John Madden 2  ...        0.4090
6514                      THIS OLE BOY  ...        0.7640
14312  Transformer (feat. Nicki Minaj)  ...        0.2870
7275                   Prince Charming  ...        0.6050
7939                           99 Pace  ...        0.0689

[4161 rows x 15 columns]
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(inplace=True) 
print(df.shape)
df.head(5)
(14932, 15)
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key liveness loudness audio_mode speechiness tempo time_signature audio_valence
1757 Party In The U.S.A. nao_sei 0.8220000000000001kg 0.519mol/L 0.36 0.0 10 0.177 -8.575 0 0.105 97.42 4 0.7 NaN
7574 I Love It (& Lil Pump) 99 127946 0.0114kg 0.901mol/L 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
17588 Taki Taki (with Selena Gomez, Ozuna & Cardi B) 98 212500 0.153kg 0.841mol/L 0.7979999999999999 3.33e-06 1.000 0.0618 -4.206 0 0.229 95.948 4 0.591
17394 Promises (with Sam Smith) 98 213309 0.0119kg 0.7809999999999999mol/L 0.768 4.91e-06 11.000 0.325 -5.9910000000000005 1 0.0394 123.07 4 0.486
12665 Eastside (with Halsey & Khalid) 98 173799 0.555kg 0.56mol/L 0.68 0.0 6.000 0.116 -7.648 0 0.321 89.391 4 0.319

4 - Validando consistência:

Como vimos anteriormente temos campos que seriam númericos porém possuem texto e um texto que não condiz com o nome da coluna, aqui temos métricas de kg e mol/L

def remove_text (df, columns, text):
    for col in columns:
        df[col] = df[col].str.strip(text)
remove_text(df, ['acousticness', 'danceability'], 'mol/L')
remove_text(df, ['song_duration_ms', 'acousticness'], 'kg')
df.head(5)
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key liveness loudness audio_mode speechiness tempo time_signature audio_valence
1757 Party In The U.S.A. nao_sei 0.8220000000000001 0.519 0.36 0.0 10 0.177 -8.575 0 0.105 97.42 4 0.7 NaN
7574 I Love It (& Lil Pump) 99 127946 0.0114 0.901 0.522 0.0 2.000 0.259 -8.304 1 0.33 104.053 4 0.329
17588 Taki Taki (with Selena Gomez, Ozuna & Cardi B) 98 212500 0.153 0.841 0.7979999999999999 3.33e-06 1.000 0.0618 -4.206 0 0.229 95.948 4 0.591
17394 Promises (with Sam Smith) 98 213309 0.0119 0.7809999999999999 0.768 4.91e-06 11.000 0.325 -5.9910000000000005 1 0.0394 123.07 4 0.486
12665 Eastside (with Halsey & Khalid) 98 173799 0.555 0.56 0.68 0.0 6.000 0.116 -7.648 0 0.321 89.391 4 0.319

5 - Transformações DataType:

## Doc - https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html?highlight=astype#pandas.Series.astype
def to_type(df, columns, type):
    for col in columns:
        print(col)
        df[col] = df[col].astype(type)

numerical_cols = ['song_duration_ms', 'acousticness', 'danceability',
                  'energy', 'instrumentalness', 'liveness', 'loudness',
                  'speechiness', 'tempo', 'audio_valence']
 
categorical_cols = ['song_popularity', 'key', 'audio_mode', 'time_signature']

to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category')
song_duration_ms
acousticness
danceability
energy
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-8fa9f2911c24> in <module>()
     12 categorical_cols = ['song_popularity', 'key', 'audio_mode', 'time_signature']
     13 
---> 14 to_type(df, numerical_cols, 'float')
     15 to_type(df, categorical_cols, 'category')

<ipython-input-15-8fa9f2911c24> in to_type(df, columns, type)
      4     for col in columns:
      5         print(col)
----> 6         df[col] = df[col].astype(type)
      7 
      8 numerical_cols = ['song_duration_ms', 'acousticness', 'danceability',

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5546         else:
   5547             # else, only a single dtype is given
-> 5548             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5549             return self._constructor(new_data).__finalize__(self, method="astype")
   5550 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    602         self, dtype, copy: bool = False, errors: str = "raise"
    603     ) -> "BlockManager":
--> 604         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    605 
    606     def convert(

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    407                 applied = b.apply(f, **kwargs)
    408             else:
--> 409                 applied = getattr(b, f)(**kwargs)
    410             result_blocks = _extend_blocks(applied, result_blocks)
    411 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    593             vals1d = values.ravel()
    594             try:
--> 595                 values = astype_nansafe(vals1d, dtype, copy=True)
    596             except (ValueError, TypeError):
    597                 # e.g. astype_nansafe can fail on object-dtype of strings

/usr/local/lib/python3.7/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    995     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    996         # Explicit copy, or required since NumPy can't view from / to object.
--> 997         return arr.astype(dtype, copy=True)
    998 
    999     return arr.view(dtype)

ValueError: could not convert string to float: 'nao_sei'
df = df.replace(['nao_sei'], np.nan)
to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category')
song_duration_ms
acousticness
danceability
energy
instrumentalness
liveness
loudness
speechiness
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-60fe30a932ff> in <module>()
----> 1 to_type(df, numerical_cols, 'float')
      2 to_type(df, categorical_cols, 'category')

<ipython-input-15-8fa9f2911c24> in to_type(df, columns, type)
      4     for col in columns:
      5         print(col)
----> 6         df[col] = df[col].astype(type)
      7 
      8 numerical_cols = ['song_duration_ms', 'acousticness', 'danceability',

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5546         else:
   5547             # else, only a single dtype is given
-> 5548             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5549             return self._constructor(new_data).__finalize__(self, method="astype")
   5550 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    602         self, dtype, copy: bool = False, errors: str = "raise"
    603     ) -> "BlockManager":
--> 604         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    605 
    606     def convert(

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    407                 applied = b.apply(f, **kwargs)
    408             else:
--> 409                 applied = getattr(b, f)(**kwargs)
    410             result_blocks = _extend_blocks(applied, result_blocks)
    411 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    593             vals1d = values.ravel()
    594             try:
--> 595                 values = astype_nansafe(vals1d, dtype, copy=True)
    596             except (ValueError, TypeError):
    597                 # e.g. astype_nansafe can fail on object-dtype of strings

/usr/local/lib/python3.7/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    995     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    996         # Explicit copy, or required since NumPy can't view from / to object.
--> 997         return arr.astype(dtype, copy=True)
    998 
    999     return arr.view(dtype)

ValueError: could not convert string to float: '0.nao_sei'
df['speechiness'] = df['speechiness'].replace(['0.nao_sei'], np.nan)
to_type(df, numerical_cols, 'float')
to_type(df, categorical_cols, 'category')
song_duration_ms
acousticness
danceability
energy
instrumentalness
liveness
loudness
speechiness
tempo
audio_valence
song_popularity
key
audio_mode
time_signature
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14932 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   song_name         14932 non-null  object  
 1   song_popularity   14931 non-null  category
 2   song_duration_ms  14932 non-null  float64 
 3   acousticness      14932 non-null  float64 
 4   danceability      14932 non-null  float64 
 5   energy            14931 non-null  float64 
 6   instrumentalness  14930 non-null  float64 
 7   key               14932 non-null  category
 8   liveness          14928 non-null  float64 
 9   loudness          14931 non-null  float64 
 10  audio_mode        14931 non-null  category
 11  speechiness       14931 non-null  float64 
 12  tempo             14931 non-null  float64 
 13  time_signature    14931 non-null  category
 14  audio_valence     14931 non-null  float64 
dtypes: category(4), float64(10), object(1)
memory usage: 1.4+ MB
## Uma forma de validação é verificar a quantidade de elementos em cada uma das categorias. 
for col in categorical_cols:
  print(f'{col}')
  print(df[col].value_counts().sort_values())
song_popularity
99       1
100      1
98       4
97       4
96       5
      ... 
54     324
53     325
55     345
58     347
52     355
Name: song_popularity, Length: 101, dtype: int64
key
0.177       1
3.0       433
10.0     1045
8.0      1047
6.0      1048
4.0      1084
11.0     1223
5.0      1257
2.0      1399
9.0      1410
1.0      1596
7.0      1654
0.0      1735
Name: key, dtype: int64
audio_mode
0.105       1
0        5496
1        9434
Name: audio_mode, dtype: int64
time_signature
2800000000        1
0.7               1
0                 3
1                67
5               195
3               684
4             13980
Name: time_signature, dtype: int64
df['key'] = df['key'].replace([0.177], np.nan)
df['audio_mode'] = df['audio_mode'].replace(['0.105'], np.nan)
df['time_signature'] = df['time_signature'].replace(['0.7', '2800000000'], np.nan)

A partir de agora, temos um dataset com o minimo de consistencia e sem valores duplicados

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14932 entries, 1757 to 9956
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   song_name         14932 non-null  object  
 1   song_popularity   14931 non-null  category
 2   song_duration_ms  14932 non-null  float64 
 3   acousticness      14932 non-null  float64 
 4   danceability      14932 non-null  float64 
 5   energy            14931 non-null  float64 
 6   instrumentalness  14930 non-null  float64 
 7   key               14931 non-null  category
 8   liveness          14928 non-null  float64 
 9   loudness          14931 non-null  float64 
 10  audio_mode        14930 non-null  category
 11  speechiness       14931 non-null  float64 
 12  tempo             14931 non-null  float64 
 13  time_signature    14929 non-null  category
 14  audio_valence     14931 non-null  float64 
dtypes: category(4), float64(10), object(1)
memory usage: 1.4+ MB
df.isna().sum()
song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              1
instrumentalness    2
key                 1
liveness            4
loudness            1
audio_mode          2
speechiness         1
tempo               1
time_signature      3
audio_valence       1
dtype: int64
df[df[numerical_cols]<0].count()
song_name               0
song_popularity         0
song_duration_ms        1
acousticness            0
danceability            0
energy                  0
instrumentalness        0
key                     0
liveness                1
loudness            14923
audio_mode              0
speechiness             0
tempo                   0
time_signature          0
audio_valence           0
dtype: int64

6 - Remoção de Colunas:

Algumas colunas podem ser consideradas desnecessárias para nossa análise, isso porque elas não nos passam informações relevantes a respeito do que queremos descobrir, ou até mesmo porque possuem tantos dados faltantes que mais atrapalham do que ajudam. Nesses casos uma forma rápida e fácil de solucionar esse problema seria excluí-las.

Aqui eliminaremos apenas uma a nivel de experimentação.

## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df.drop(['liveness'], axis=1)
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key loudness audio_mode speechiness tempo time_signature audio_valence
1757 Party In The U.S.A. NaN 0.822 0.51900 0.360 0.000 10.000000 NaN 0.000 NaN 97.4200 4.000 NaN NaN
7574 I Love It (& Lil Pump) 99 127946.000 0.01140 0.901 0.522 0.000000 2.0 -8.304 1 0.3300 104.053 4 0.329
17588 Taki Taki (with Selena Gomez, Ozuna & Cardi B) 98 212500.000 0.15300 0.841 0.798 0.000003 1.0 -4.206 0 0.2290 95.948 4 0.591
17394 Promises (with Sam Smith) 98 213309.000 0.01190 0.781 0.768 0.000005 11.0 -5.991 1 0.0394 123.070 4 0.486
12665 Eastside (with Halsey & Khalid) 98 173799.000 0.55500 0.560 0.680 0.000000 6.0 -7.648 0 0.3210 89.391 4 0.319
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11278 María 0 161986.000 0.90600 0.843 0.483 0.005230 3.0 -14.776 1 0.0638 141.295 4 0.964
12923 Unfuck The World 0 250213.000 0.00142 0.574 0.831 0.010800 7.0 -5.576 0 0.0325 101.988 4 0.518
11282 Kimbya (feat. Manny Roman) 0 261590.000 0.49600 0.418 0.958 0.058300 7.0 -5.678 1 0.0728 123.639 4 0.676
12905 Mad World 0 174253.000 0.00002 0.298 0.931 0.404000 2.0 -6.185 1 0.1300 135.970 4 0.404
9956 All in My Feelings 0 187123.000 0.51100 0.459 0.476 0.000000 2.0 -5.277 1 0.0467 139.624 4 0.247

14932 rows × 14 columns

df.drop(columns=['liveness'], inplace=True)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-9052eefc4426> in <module>()
      1 ## Ou podemos deletar diretamente passando o parametro columns
----> 2 df.drop(columns=['liveness'], inplace=True)

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4172             level=level,
   4173             inplace=inplace,
-> 4174             errors=errors,
   4175         )
   4176 

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3887         for axis, labels in axes.items():
   3888             if labels is not None:
-> 3889                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3890 
   3891         if inplace:

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
   3921                 new_axis = axis.drop(labels, level=level, errors=errors)
   3922             else:
-> 3923                 new_axis = axis.drop(labels, errors=errors)
   3924             result = self.reindex(**{axis_name: new_axis})
   3925 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   5285         if mask.any():
   5286             if errors != "ignore":
-> 5287                 raise KeyError(f"{labels[mask]} not found in axis")
   5288             indexer = indexer[~mask]
   5289         return self.delete(indexer)

KeyError: "['liveness'] not found in axis"

7 - Dados faltantes Missing Values:

Em algumas situações, podemos ter muitas informações incompletas no nosso df. Essas informações faltantes podem prejudicar nossa análise e outras etapas que dependem dela e do pré-processamento, portanto, precisamos removê-los ou substituir esses valores por outros. O fluxo a seguir pode auxiliar na decisão e trazer sugestões de como tratar cada caso.

alt text

Para dados que não são séries temporais, nossa primeira opção é substitui-los pela média da coluna, entretanto, às vezes, a média pode ter sido afetada pelos valores destoantes da coluna (outliers), então podemos substituir também pela moda ou mediana.

Podemos fazer isso com a função .fillna que preenche todos os campos com dados ausentes. Vamos criar alguns loops como exemplo. O primeiro passa por algumas colunas e substitui os valores faltantes pela moda:

df.isna().sum()
song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              1
instrumentalness    2
key                 1
loudness            1
audio_mode          2
speechiness         1
tempo               1
time_signature      3
audio_valence       1
dtype: int64
## Doc .fillna - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

for column in ['acousticness', 'speechiness']:
    df[column].fillna(df[column].mode()[0], inplace=True)
for column in ['song_duration_ms',  'danceability', 'energy', 
                'loudness', 'audio_valence']:
    df[column].fillna(df[column].median(), inplace=True)
df.isna().sum()
song_name           0
song_popularity     1
song_duration_ms    0
acousticness        0
danceability        0
energy              0
instrumentalness    2
key                 1
loudness            0
audio_mode          2
speechiness         0
tempo               1
time_signature      3
audio_valence       0
dtype: int64
## Doc - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna
df.dropna(inplace=True)
df.isna().sum()
song_name           0
song_popularity     0
song_duration_ms    0
acousticness        0
danceability        0
energy              0
instrumentalness    0
key                 0
loudness            0
audio_mode          0
speechiness         0
tempo               0
time_signature      0
audio_valence       0
dtype: int64

Conclusão

Ao final temos o dataset pronto para a análise exploratória, aqui ainda não tratamos outliers pois dependendo do cenário podemos fazer uso deles.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14925 entries, 7574 to 9956
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   song_name         14925 non-null  object  
 1   song_popularity   14925 non-null  category
 2   song_duration_ms  14925 non-null  float64 
 3   acousticness      14925 non-null  float64 
 4   danceability      14925 non-null  float64 
 5   energy            14925 non-null  float64 
 6   instrumentalness  14925 non-null  float64 
 7   key               14925 non-null  category
 8   loudness          14925 non-null  float64 
 9   audio_mode        14925 non-null  category
 10  speechiness       14925 non-null  float64 
 11  tempo             14925 non-null  float64 
 12  time_signature    14925 non-null  category
 13  audio_valence     14925 non-null  float64 
dtypes: category(4), float64(9), object(1)
memory usage: 1.3+ MB
df.head()
song_name song_popularity song_duration_ms acousticness danceability energy instrumentalness key loudness audio_mode speechiness tempo time_signature audio_valence
7574 I Love It (& Lil Pump) 99 127946.0 0.0114 0.901 0.522 0.000000 2.0 -8.304 1 0.3300 104.053 4 0.329
17588 Taki Taki (with Selena Gomez, Ozuna & Cardi B) 98 212500.0 0.1530 0.841 0.798 0.000003 1.0 -4.206 0 0.2290 95.948 4 0.591
17394 Promises (with Sam Smith) 98 213309.0 0.0119 0.781 0.768 0.000005 11.0 -5.991 1 0.0394 123.070 4 0.486
12665 Eastside (with Halsey & Khalid) 98 173799.0 0.5550 0.560 0.680 0.000000 6.0 -7.648 0 0.3210 89.391 4 0.319
17618 In My Feelings 98 217925.0 0.0589 0.835 0.626 0.000060 1.0 -5.833 1 0.1250 91.030 4 0.350