반응형
10 minutes to pandas¶
In [1]:
import numpy as np
import pandas as pd
In [10]:
dates = pd.date_range("20130101", periods=6)
dates
Out[10]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [11]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
Out[11]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.463106 | -0.950446 | 0.995642 | 0.336713 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 |
In [5]:
df2= pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
In [6]:
df2
Out[6]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
데이터타입 확인¶
In [7]:
df2.dtypes
Out[7]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
In [12]:
df.head()
Out[12]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.463106 | -0.950446 | 0.995642 | 0.336713 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 |
In [13]:
df.tail()
Out[13]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 |
인덱스와 컬럼을 나타내기¶
In [14]:
df.index
Out[14]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [15]:
df.columns
Out[15]:
Index(['A', 'B', 'C', 'D'], dtype='object')
통계 요약을 나타내는 함수¶
In [16]:
df.describe()
Out[16]:
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.526543 | 0.020171 | 0.284042 | -0.205292 |
std | 1.098828 | 1.252759 | 0.998350 | 1.156314 |
min | -0.463106 | -1.943062 | -1.152986 | -2.387156 |
25% | -0.276496 | -0.727833 | -0.160580 | -0.394574 |
50% | 0.241987 | 0.435690 | 0.166708 | 0.275072 |
75% | 0.924498 | 1.007441 | 0.867210 | 0.518062 |
max | 2.435713 | 1.110361 | 1.692784 | 0.623988 |
데이터 전치¶
In [17]:
df.T
Out[17]:
2013-01-01 | 2013-01-02 | 2013-01-03 | 2013-01-04 | 2013-01-05 | 2013-01-06 | |
---|---|---|---|---|---|---|
A | -0.463106 | -0.411407 | 1.114084 | 2.435713 | 0.355739 | 0.128236 |
B | -0.950446 | -0.059995 | -1.943062 | 1.110361 | 0.931375 | 1.032796 |
C | 0.995642 | 1.692784 | -1.152986 | 0.481915 | -0.164607 | -0.148499 |
D | 0.336713 | 0.578512 | -2.387156 | 0.623988 | -0.597242 | 0.213431 |
축을 중심으로 정렬¶
In [18]:
df.sort_index(axis=1, ascending=False)
Out[18]:
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | 0.336713 | 0.995642 | -0.950446 | -0.463106 |
2013-01-02 | 0.578512 | 1.692784 | -0.059995 | -0.411407 |
2013-01-03 | -2.387156 | -1.152986 | -1.943062 | 1.114084 |
2013-01-04 | 0.623988 | 0.481915 | 1.110361 | 2.435713 |
2013-01-05 | -0.597242 | -0.164607 | 0.931375 | 0.355739 |
2013-01-06 | 0.213431 | -0.148499 | 1.032796 | 0.128236 |
값을 중심으로 정렬¶
In [21]:
df.sort_values(by="B")
Out[21]:
A | B | C | D | |
---|---|---|---|---|
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-01 | -0.463106 | -0.950446 | 0.995642 | 0.336713 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
In [22]:
df["A"]
Out[22]:
2013-01-01 -0.463106
2013-01-02 -0.411407
2013-01-03 1.114084
2013-01-04 2.435713
2013-01-05 0.355739
2013-01-06 0.128236
Freq: D, Name: A, dtype: float64
[ ]를 통해 선택함으로써, 행을 잘라오기¶
In [23]:
df[0:3]
Out[23]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.463106 | -0.950446 | 0.995642 | 0.336713 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
In [24]:
df["20130102":"20130104"]
Out[24]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
인덱스를 통해 선택¶
In [25]:
df.loc[dates[0]]
Out[25]:
A -0.463106
B -0.950446
C 0.995642
D 0.336713
Name: 2013-01-01 00:00:00, dtype: float64
라벨을 이용해 여러개의 축을 선택¶
In [30]:
df.loc[:, ["A","B","C"]]
Out[30]:
A | B | C | |
---|---|---|---|
2013-01-01 | -0.463106 | -0.950446 | 0.995642 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 |
라벨 슬라이싱을 표시하면 두 끝점이 모두 포함¶
In [31]:
df.loc["20130102":"20130104",["A","B"]]
Out[31]:
A | B | |
---|---|---|
2013-01-02 | -0.411407 | -0.059995 |
2013-01-03 | 1.114084 | -1.943062 |
2013-01-04 | 2.435713 | 1.110361 |
반환된 객체의 차수 감소¶
In [36]:
df.loc["20130102",["A","B"]]
Out[36]:
A -0.411407
B -0.059995
Name: 2013-01-02 00:00:00, dtype: float64
원하는 값 선택¶
In [37]:
df.loc[dates[0], "A"]
Out[37]:
-0.463106090507621
at을 이용하는 방법¶
In [38]:
df.at[dates[0], "A"]
Out[38]:
-0.463106090507621
행번호를 이용한 선택¶
In [39]:
df.iloc[3]
Out[39]:
A 2.435713
B 1.110361
C 0.481915
D 0.623988
Name: 2013-01-04 00:00:00, dtype: float64
In [40]:
df.iloc[3:5, 0:2]
Out[40]:
A | B | |
---|---|---|
2013-01-04 | 2.435713 | 1.110361 |
2013-01-05 | 0.355739 | 0.931375 |
-> 앞의 숫자는 제외 [4번째 행과 5번째 행 선택 / 1번째열과 2번째 열 선택]¶
In [41]:
df.iloc[[1, 2, 4], [0, 2]]
Out[41]:
A | C | |
---|---|---|
2013-01-02 | -0.411407 | 1.692784 |
2013-01-03 | 1.114084 | -1.152986 |
2013-01-05 | 0.355739 | -0.164607 |
-> 1,2,4 행과 0, 2열¶
행을 명시적으로 자르는 경우¶
In [42]:
df.iloc[1:3, :]
Out[42]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
열을 명시적으로 자르는 경우¶
In [43]:
df.iloc[:,1:3]
Out[43]:
B | C | |
---|---|---|
2013-01-01 | -0.950446 | 0.995642 |
2013-01-02 | -0.059995 | 1.692784 |
2013-01-03 | -1.943062 | -1.152986 |
2013-01-04 | 1.110361 | 0.481915 |
2013-01-05 | 0.931375 | -0.164607 |
2013-01-06 | 1.032796 | -0.148499 |
명시적으로 값을 구하는 경우¶
In [44]:
df.iloc[1, 1]
Out[44]:
-0.059994644659198244
iat을 이용하는방법¶
In [45]:
df.iat[1,1]
Out[45]:
-0.059994644659198244
In [46]:
df[df["A"]>0]
Out[46]:
A | B | C | D | |
---|---|---|---|---|
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 |
조건에 맞는 값 선택¶
In [47]:
df[df>0]
Out[47]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | NaN | NaN | 0.995642 | 0.336713 |
2013-01-02 | NaN | NaN | 1.692784 | 0.578512 |
2013-01-03 | 1.114084 | NaN | NaN | NaN |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 |
2013-01-05 | 0.355739 | 0.931375 | NaN | NaN |
2013-01-06 | 0.128236 | 1.032796 | NaN | 0.213431 |
isin()을 이용한 필터링¶
In [48]:
df2=df.copy()
df2["E"]=["one","one","two","three","four","three"]
df2
Out[48]:
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | -0.463106 | -0.950446 | 0.995642 | 0.336713 | one |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 | one |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 | two |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 | three |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 | four |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 | three |
In [49]:
df2[df2["E"].isin(["two","four"])]
Out[49]:
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 | two |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 | four |
In [50]:
s1=pd.Series([1,2,3,4,5,6], index=pd.date_range("20130102",periods=6))
In [51]:
s1
Out[51]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [52]:
df["F"]=s1
In [55]:
df
Out[55]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | -0.950446 | 0.995642 | 0.336713 | NaN |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 | 2.0 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 | 3.0 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 | 4.0 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 | 5.0 |
인덱스를 이용한 값 설정¶
In [54]:
df.at[dates[0], "A"]=0
df
Out[54]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | -0.950446 | 0.995642 | 0.336713 | NaN |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 | 2.0 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 | 3.0 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 | 4.0 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 | 5.0 |
행번호를 이용한 값 설정¶
In [58]:
df.iat[0,1]=0
df
Out[58]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.995642 | 0.336713 | NaN |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 0.578512 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | -2.387156 | 2.0 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 0.623988 | 3.0 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | -0.597242 | 4.0 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 0.213431 | 5.0 |
넘파이 배열을 할당해서 값 설정¶
In [60]:
df.loc[:, "D"] = np.array([5] * len(df))
df
Out[60]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.995642 | 5 | NaN |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 5 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | 5 | 2.0 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 5 | 3.0 |
2013-01-05 | 0.355739 | 0.931375 | -0.164607 | 5 | 4.0 |
2013-01-06 | 0.128236 | 1.032796 | -0.148499 | 5 | 5.0 |
값 설정을 위한 where 연산자¶
In [61]:
df2= df.copy()
In [62]:
df2[df2>0]=-df2
In [63]:
df2
Out[63]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.995642 | -5 | NaN |
2013-01-02 | -0.411407 | -0.059995 | -1.692784 | -5 | -1.0 |
2013-01-03 | -1.114084 | -1.943062 | -1.152986 | -5 | -2.0 |
2013-01-04 | -2.435713 | -1.110361 | -0.481915 | -5 | -3.0 |
2013-01-05 | -0.355739 | -0.931375 | -0.164607 | -5 | -4.0 |
2013-01-06 | -0.128236 | -1.032796 | -0.148499 | -5 | -5.0 |
In [74]:
df1=df.reindex(index=dates[0:4], columns=list(df.columns)+["E"])
df1.loc[dates[0]:dates[1], "E"]=1
df1
Out[74]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.995642 | 5 | NaN | 1.0 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 5 | 1.0 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | 5 | 2.0 | NaN |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 5 | 3.0 | NaN |
결측 데이터 삭제¶
In [72]:
df1.dropna(how="any")
Out[72]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 5 | 1.0 | 1.0 |
In [73]:
df1
Out[73]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.995642 | 5 | NaN | 1.0 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 5 | 1.0 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | 5 | 2.0 | NaN |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 5 | 3.0 | NaN |
결측 데이터 대체¶
In [77]:
df1.fillna(value=5)
Out[77]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.995642 | 5 | 5.0 | 1.0 |
2013-01-02 | -0.411407 | -0.059995 | 1.692784 | 5 | 1.0 | 1.0 |
2013-01-03 | 1.114084 | -1.943062 | -1.152986 | 5 | 2.0 | 5.0 |
2013-01-04 | 2.435713 | 1.110361 | 0.481915 | 5 | 3.0 | 5.0 |
T/F로 결측치 확인¶
In [78]:
pd.isna(df1)
Out[78]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | False | False | False | False | True | False |
2013-01-02 | False | False | False | False | False | False |
2013-01-03 | False | False | False | False | False | True |
2013-01-04 | False | False | False | False | False | True |
반응형
'Computer Engineering > Big Data Analytics Using Python' 카테고리의 다른 글
[빅데이터 분석 프로젝트] 마크다운으로 타이타닉 탑승객의 생존률 보고서 작성하기 -Part 2 (0) | 2021.04.19 |
---|---|
[빅데이터 분석 프로젝트] 마크다운으로 팁 데이터 분석 보고서 작성하기 -Part 2 (0) | 2021.04.19 |
[빅데이터 분석] 6장. 데이터 탐색 (0) | 2021.04.13 |
[빅데이터 분석 프로젝트] 마크다운으로 따릉이 데이터 분석 보고서 작성하기 -Part 1 (0) | 2021.04.06 |
[에러잡기] UnicodeDecodeError가 뜰 때 (0) | 2021.04.06 |