Open In Colab

Exercícios de manipulação de dados - Parte 1

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises
Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.
cd data-manipulation-exercises/
/content/data-manipulation-exercises
ls
datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb
import pandas as pd

Passo 1. Importando os dados

Carregue os dados salvos no arquivo datasets/users_dataset.csv.

Esse arquivo possui um conjunto de dados de trabalhadores com 5 colunas separadas pelo símbolo "|" (pipe) e 943 linhas.

Dica: não se esqueça do argumento sep quando importar os dados.

df = pd.read_csv('datasets/users_dataset.csv', sep='|')

Passo 2. Mostre as 15 primeiras linhas do dataset.

df.head(15)
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
5 6 42 M executive 98101
6 7 57 M administrator 91344
7 8 36 M administrator 05201
8 9 29 M student 01002
9 10 53 M lawyer 90703
10 11 39 F other 30329
11 12 28 F other 06405
12 13 47 M educator 29206
13 14 45 M scientist 55106
14 15 49 F educator 97301

Passo 3. Mostre as 10 últimas linhas

users.tail(10)
user_id age gender occupation zip_code
933 934 61 M engineer 22902
934 935 42 M doctor 66221
935 936 24 M other 32789
936 937 48 M educator 98072
937 938 38 F technician 55038
938 939 26 F student 33319
939 940 32 M administrator 02215
940 941 20 M student 97229
941 942 48 F librarian 78209
942 943 22 M student 77841

Passo 4. Qual o número de linhas e colunas do DataFrame?

df.shape
(943, 5)

Passo 5. Mostre o nome de todas as colunas.

users.columns
Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')

Passo 6. Qual o tipo de dado de cada coluna?

df.dtypes
user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Passo 7. Mostre os dados da coluna occupation.

df['occupation']
0         technician
1              other
2             writer
3         technician
4              other
           ...      
938          student
939    administrator
940          student
941        librarian
942          student
Name: occupation, Length: 943, dtype: object

Passo 8. Quantas ocupações diferentes existem neste dataset?

print(df['occupation'].unique())
len(df['occupation'].unique())
['technician' 'other' 'writer' 'executive' 'administrator' 'student'
 'lawyer' 'educator' 'scientist' 'entertainment' 'programmer' 'librarian'
 'homemaker' 'artist' 'engineer' 'marketing' 'none' 'healthcare' 'retired'
 'salesman' 'doctor']
21

Passo 9. Qual a ocupação mais frequente?

df['occupation'].value_counts().head(1)
student    196
Name: occupation, dtype: int64

Passo 10. Qual a idade média dos usuários?

df['age'].mean()
34.05196182396607

Passo 11. Utilize o método describe() para obter diversas informações a respeito do DataFrame.

users.describe()
user_id age
count 943.000000 943.000000
mean 472.000000 34.051962
std 272.364951 12.192740
min 1.000000 7.000000
25% 236.500000 25.000000
50% 472.000000 31.000000
75% 707.500000 43.000000
max 943.000000 73.000000

Exercícios de manipulação de dados - Parte 2

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises
Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.
cd data-manipulation-exercises/
/content/data-manipulation-exercises/data-manipulation-exercises
ls
datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb
import pandas as pd

Tarefa 1. Importe o dataset e salve os dados em um dataframe

crime = pd.read_csv('datasets/US_Crime_Rates_1960_2014.csv')

Tarefa 2. Qual o tipo de dados em cada coluna?

crime.dtypes
Year                  int64
Population            int64
Total                 int64
Violent               int64
Property              int64
Murder                int64
Forcible_Rape         int64
Robbery               int64
Aggravated_assault    int64
Burglary              int64
Larceny_Theft         int64
Vehicle_Theft         int64
dtype: object

Tarefa 3. Converta o tipo de dado da coluna Year para o tipo datetime

print(crime['Year'].head())
crime['Year'] = pd.to_datetime(crime['Year'], format='%M')

crime['Year'].dtypes
0   1960-01-01
1   1961-01-01
2   1962-01-01
3   1963-01-01
4   1964-01-01
Name: Year, dtype: datetime64[ns]
dtype('<M8[ns]')

Tarefa 4. Configure a coluna Year como index do DataFrame.

crime.set_index('Year', inplace=True)
crime.head()
Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961-01-01 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962-01-01 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963-01-01 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964-01-01 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800

Tarefa 5. Remova a coluna Total do DataFrame.

crime.drop(columns='Total', inplace=True)
crime.columns
Index(['Population', 'Violent', 'Property', 'Murder', 'Forcible_Rape',
       'Robbery', 'Aggravated_assault', 'Burglary', 'Larceny_Theft',
       'Vehicle_Theft'],
      dtype='object')

Tarefa 6. Encontre o número de roubos de carro do ano de 1978.

crime[crime.index == '1978']['Vehicle_Theft']
Year
1978-01-01    1004100
Name: Vehicle_Theft, dtype: int64

Tarefa 7. Retorne a linha do ano em que houve o maior número de assasinatos.

crime[crime['Murder'] == crime['Murder'].max()]
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1991-01-01 252177000 1911770 12961100 24700 106590 687730 1092740 3157200 8142200 1661700

Tarefa 8. Retorne o número de assassinatos do ano em que foi registrado o menor número de roubo de carros.

crime[crime['Vehicle_Theft'] == crime['Vehicle_Theft'].min()]['Murder']
Year
1960-01-01    9110
Name: Murder, dtype: int64

Exercícios de manipulação de dados - Parte 3

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises
Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.
ls
data-manipulation-exercises/      Manipulacao_de_Dados_Ex_02.ipynb
datasets/                         Manipulacao_de_Dados_Ex_03.ipynb
Manipulacao_de_Dados_Ex_01.ipynb  README.md
cd data-manipulation-exercises/
/content/data-manipulation-exercises/data-manipulation-exercises/data-manipulation-exercises
ls
datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb
import pandas as pd

Tarefa 1. Importe o dataset e salve os dados em um dataframe

users = pd.read_csv('datasets/users_dataset.csv', sep='|')

Tarefa 2. Descubra qual a idade média por ocupação.

users.groupby('occupation')['age'].mean()
occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Tarefa 3. Para cada ocupação descubra a idade mínima e máxima.

users.groupby('occupation').agg({'age': ['min', 'max']})
age
min max
occupation
administrator 21 70
artist 19 48
doctor 28 64
educator 23 63
engineer 22 70
entertainment 15 50
executive 22 69
healthcare 22 62
homemaker 20 50
lawyer 21 53
librarian 23 69
marketing 24 55
none 11 55
other 13 64
programmer 20 63
retired 51 73
salesman 18 66
scientist 23 55
student 7 42
technician 21 55
writer 18 60

Tarefa 4. Para cada combinação de ocupação e sexo, calcule a idade média.

users.groupby(['occupation', 'gender'])['age'].mean()
occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.166667
               M         33.216667
retired        F         70.000000
               M         62.538462
salesman       F         27.000000
               M         38.555556
scientist      F         28.333333
               M         36.321429
student        F         20.750000
               M         22.669118
technician     F         38.000000
               M         32.961538
writer         F         37.631579
               M         35.346154
Name: age, dtype: float64

Tarefa 5. Para cada ocupação calcule a porcentagem de homens e mulheres.

df2 = users.groupby('occupation')['gender'].value_counts()
porcentagem = df2.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
porcentagem
occupation     gender
administrator  M          54.430380
               F          45.569620
artist         M          53.571429
               F          46.428571
doctor         M         100.000000
educator       M          72.631579
               F          27.368421
engineer       M          97.014925
               F           2.985075
entertainment  M          88.888889
               F          11.111111
executive      M          90.625000
               F           9.375000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         M          83.333333
               F          16.666667
librarian      F          56.862745
               M          43.137255
marketing      M          61.538462
               F          38.461538
none           M          55.555556
               F          44.444444
other          M          65.714286
               F          34.285714
programmer     M          90.909091
               F           9.090909
retired        M          92.857143
               F           7.142857
salesman       M          75.000000
               F          25.000000
scientist      M          90.322581
               F           9.677419
student        M          69.387755
               F          30.612245
technician     M          96.296296
               F           3.703704
writer         M          57.777778
               F          42.222222
Name: gender, dtype: float64