Exercícios de manipulação de dados - Parte 1

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises

Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.

cd data-manipulation-exercises/

/content/data-manipulation-exercises

ls

datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb

import pandas as pd

Passo 1. Importando os dados

Carregue os dados salvos no arquivo datasets/users_dataset.csv.

Esse arquivo possui um conjunto de dados de trabalhadores com 5 colunas separadas pelo símbolo "|" (pipe) e 943 linhas.

Dica: não se esqueça do argumento sep quando importar os dados.

df = pd.read_csv('datasets/users_dataset.csv', sep='|')

Passo 2. Mostre as 15 primeiras linhas do dataset.

df.head(15)

Passo 3. Mostre as 10 últimas linhas

users.tail(10)

Passo 4. Qual o número de linhas e colunas do DataFrame?

df.shape

(943, 5)

Passo 5. Mostre o nome de todas as colunas.

users.columns

Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')

Passo 6. Qual o tipo de dado de cada coluna?

df.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Passo 7. Mostre os dados da coluna occupation.

df['occupation']

0         technician
1              other
2             writer
3         technician
4              other
           ...      
938          student
939    administrator
940          student
941        librarian
942          student
Name: occupation, Length: 943, dtype: object

Passo 8. Quantas ocupações diferentes existem neste dataset?

print(df['occupation'].unique())
len(df['occupation'].unique())

['technician' 'other' 'writer' 'executive' 'administrator' 'student'
 'lawyer' 'educator' 'scientist' 'entertainment' 'programmer' 'librarian'
 'homemaker' 'artist' 'engineer' 'marketing' 'none' 'healthcare' 'retired'
 'salesman' 'doctor']

21

Passo 9. Qual a ocupação mais frequente?

df['occupation'].value_counts().head(1)

student    196
Name: occupation, dtype: int64

Passo 10. Qual a idade média dos usuários?

df['age'].mean()

34.05196182396607

Passo 11. Utilize o método describe() para obter diversas informações a respeito do DataFrame.

users.describe()

Exercícios de manipulação de dados - Parte 2

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises

Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.

cd data-manipulation-exercises/

/content/data-manipulation-exercises/data-manipulation-exercises

ls

datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb

import pandas as pd

Tarefa 1. Importe o dataset e salve os dados em um dataframe

crime = pd.read_csv('datasets/US_Crime_Rates_1960_2014.csv')

Tarefa 2. Qual o tipo de dados em cada coluna?

crime.dtypes

Year                  int64
Population            int64
Total                 int64
Violent               int64
Property              int64
Murder                int64
Forcible_Rape         int64
Robbery               int64
Aggravated_assault    int64
Burglary              int64
Larceny_Theft         int64
Vehicle_Theft         int64
dtype: object

Tarefa 3. Converta o tipo de dado da coluna Year para o tipo datetime

print(crime['Year'].head())
crime['Year'] = pd.to_datetime(crime['Year'], format='%M')

crime['Year'].dtypes

0   1960-01-01
1   1961-01-01
2   1962-01-01
3   1963-01-01
4   1964-01-01
Name: Year, dtype: datetime64[ns]

dtype('<M8[ns]')

Tarefa 4. Configure a coluna Year como index do DataFrame.

crime.set_index('Year', inplace=True)
crime.head()

Tarefa 5. Remova a coluna Total do DataFrame.

crime.drop(columns='Total', inplace=True)
crime.columns

Index(['Population', 'Violent', 'Property', 'Murder', 'Forcible_Rape',
       'Robbery', 'Aggravated_assault', 'Burglary', 'Larceny_Theft',
       'Vehicle_Theft'],
      dtype='object')

Tarefa 6. Encontre o número de roubos de carro do ano de 1978.

crime[crime.index == '1978']['Vehicle_Theft']

Year
1978-01-01    1004100
Name: Vehicle_Theft, dtype: int64

Tarefa 7. Retorne a linha do ano em que houve o maior número de assasinatos.

crime[crime['Murder'] == crime['Murder'].max()]

Tarefa 8. Retorne o número de assassinatos do ano em que foi registrado o menor número de roubo de carros.

crime[crime['Vehicle_Theft'] == crime['Vehicle_Theft'].min()]['Murder']

Year
1960-01-01    9110
Name: Murder, dtype: int64

Exercícios de manipulação de dados - Parte 3

! git clone https://github.com/Mario-RJunior/data-manipulation-exercises

Cloning into 'data-manipulation-exercises'...
remote: Enumerating objects: 31, done.
remote: Total 31 (delta 0), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (31/31), done.

ls

data-manipulation-exercises/      Manipulacao_de_Dados_Ex_02.ipynb
datasets/                         Manipulacao_de_Dados_Ex_03.ipynb
Manipulacao_de_Dados_Ex_01.ipynb  README.md

cd data-manipulation-exercises/

/content/data-manipulation-exercises/data-manipulation-exercises/data-manipulation-exercises

ls

datasets/                         Manipulacao_de_Dados_Ex_02.ipynb  README.md
Manipulacao_de_Dados_Ex_01.ipynb  Manipulacao_de_Dados_Ex_03.ipynb

import pandas as pd

Tarefa 1. Importe o dataset e salve os dados em um dataframe

users = pd.read_csv('datasets/users_dataset.csv', sep='|')

Tarefa 2. Descubra qual a idade média por ocupação.

users.groupby('occupation')['age'].mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Tarefa 3. Para cada ocupação descubra a idade mínima e máxima.

users.groupby('occupation').agg({'age': ['min', 'max']})

Tarefa 4. Para cada combinação de ocupação e sexo, calcule a idade média.

users.groupby(['occupation', 'gender'])['age'].mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.166667
               M         33.216667
retired        F         70.000000
               M         62.538462
salesman       F         27.000000
               M         38.555556
scientist      F         28.333333
               M         36.321429
student        F         20.750000
               M         22.669118
technician     F         38.000000
               M         32.961538
writer         F         37.631579
               M         35.346154
Name: age, dtype: float64

Tarefa 5. Para cada ocupação calcule a porcentagem de homens e mulheres.

df2 = users.groupby('occupation')['gender'].value_counts()
porcentagem = df2.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
porcentagem

occupation     gender
administrator  M          54.430380
               F          45.569620
artist         M          53.571429
               F          46.428571
doctor         M         100.000000
educator       M          72.631579
               F          27.368421
engineer       M          97.014925
               F           2.985075
entertainment  M          88.888889
               F          11.111111
executive      M          90.625000
               F           9.375000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         M          83.333333
               F          16.666667
librarian      F          56.862745
               M          43.137255
marketing      M          61.538462
               F          38.461538
none           M          55.555556
               F          44.444444
other          M          65.714286
               F          34.285714
programmer     M          90.909091
               F           9.090909
retired        M          92.857143
               F           7.142857
salesman       M          75.000000
               F          25.000000
scientist      M          90.322581
               F           9.677419
student        M          69.387755
               F          30.612245
technician     M          96.296296
               F           3.703704
writer         M          57.777778
               F          42.222222
Name: gender, dtype: float64

	user_id	age
count	943.000000	943.000000
mean	472.000000	34.051962
std	272.364951	12.192740
min	1.000000	7.000000
25%	236.500000	25.000000
50%	472.000000	31.000000
75%	707.500000	43.000000
max	943.000000	73.000000

	Population	Total	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	179323175	3384200	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1961-01-01	182992000	3488000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
1962-01-01	185771000	3752200	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
1963-01-01	188483000	4109500	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
1964-01-01	191141000	4564600	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

	user_id	age	gender	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213
5	6	42	M	executive	98101
6	7	57	M	administrator	91344
7	8	36	M	administrator	05201
8	9	29	M	student	01002
9	10	53	M	lawyer	90703
10	11	39	F	other	30329
11	12	28	F	other	06405
12	13	47	M	educator	29206
13	14	45	M	scientist	55106
14	15	49	F	educator	97301

	user_id	age	gender	occupation	zip_code
933	934	61	M	engineer	22902
934	935	42	M	doctor	66221
935	936	24	M	other	32789
936	937	48	M	educator	98072
937	938	38	F	technician	55038
938	939	26	F	student	33319
939	940	32	M	administrator	02215
940	941	20	M	student	97229
941	942	48	F	librarian	78209
942	943	22	M	student	77841

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1991-01-01	252177000	1911770	12961100	24700	106590	687730	1092740	3157200	8142200	1661700

	age
	min	max
occupation
administrator	21	70
artist	19	48
doctor	28	64
educator	23	63
engineer	22	70
entertainment	15	50
executive	22	69
healthcare	22	62
homemaker	20	50
lawyer	21	53
librarian	23	69
marketing	24	55
none	11	55
other	13	64
programmer	20	63
retired	51	73
salesman	18	66
scientist	23	55
student	7	42
technician	21	55
writer	18	60