일하는/AI, ML

spaCy

김논리 2021. 5. 25. 14:59

en.wikipedia.org/wiki/SpaCy

 

spaCy - Wikipedia

Not to be confused with Scapy. spaCy ( spay-SEE) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.[3][4] The library is published under the MIT license and its main develope

en.wikipedia.org

spaCy는 자연어 처리를 위한 Python 기반의 오픈 소스 라이브러리로, 다음과 같은 기능들을 제공한다.

  • Tokenization
  • Part-of-speech (POS) Tagging
  • Depedency Parsing
  • Lemmatization
  • Sentence Boundary Detection (SBD)
  • Named Entity Recognition (NER)
  • Similarity
  • Text Classification
  • Rule-based Matching
  • Training
  • Serialization

Istallation

다음 명령어를 이용하여 설치한다.

$ pip install spacy

 

spaCy 라이브러리에는 모델이 포함되어 있지 않으므로, 원하는 모델을 직접 다운로드 받아 사용해야 한다.

spacy.io/models

 

Trained Models & Pipelines · spaCy Models Documentation

Downloadable trained pipelines and weights for spaCy

spacy.io

 

한국어는 공식적으로 지원하고 있지 않으므로, 다국어(Multi-languaeg) 모델을 사용하거나 별도 한국어 모델을 생성해야 한다. 다국어 모델 다운로드 방법은 다음과 같다. (별도 모델 생성 방법은 추후 작성 예정)

$ python -m spacy download xx_ent_wiki_sm
$ python -m spacy download xx_sent_ud_sm

본 가이드에서는 spaCy의 기본적인 사용법을 익히기 위해 영어 모델을 사용하여 진행한다.

 

Tokenization

토큰화(Tokenize)는 텍스트를 단어, 문장 부호 등의 토큰으로 분류하는 과정을 말한다.

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)
tokenized = list(doc)
print(tokenized)

위와 같이 코드를 작성하고 실행하면, 다음과 같은 결과를 얻게된다. (리스트의 엘리먼트는 spacy.tokens.token.Token 클래스)

[Yuh, -, jung, Youn, won, the, Oscar, for, best, supporting, actress, for, her, performance, in, ", Minari, ", on, Sunday, and, made, history, by, becoming, the, first, Korean, actor, to, win, an, Academy, Award, .]

 

POS Tagging

각 token은 문장에서 각각 어떤 역할을 갖게 되는데, 이를 Part-of-Speech(POS)라고 한다. 

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)

str_format = "{:>10}"*8
print(str_format.format('Text', 'Lemma', 'POS', 'Tag', 'Dep', 'Shape', 'is alpha', 'is stop'))
print("=="*40)

for token in doc:
    print(str_format.format(token.text, token.lemma_, token.pos_, token.tag_, 
                            token.dep_, token.shape_, str(token.is_alpha), str(token.is_stop)))

token에는 다음과 같은 정보가 포함되어 있다.

  • Text: The original word text.
  • Lemma: The base form of the word.
  • POS: The simple part-of-speech tag.
  • Tag: The detailed part-of-speech tag.
  • Dep: Syntactic dependency, i.e. the relation between tokens.
  • Shape: The word shape – capitalisation, punctuation, digits.
  • is alpha: Is the token an alpha character?
  • is stop: Is the token part of a stop list, i.e. the most common words of the language?
      Text     Lemma       POS       Tag       Dep     Shape  is alpha   is stop
================================================================================
       Yuh       yuh      NOUN        NN  compound       Xxx      True     False
         -         -     PUNCT      HYPH     punct         -     False     False
      jung      jung     PROPN       NNP  compound      xxxx      True     False
      Youn      Youn     PROPN       NNP     nsubj      Xxxx      True     False
       won       win      VERB       VBD      ROOT       xxx      True     False
       the       the       DET        DT       det       xxx      True      True
     Oscar     Oscar     PROPN       NNP      dobj     Xxxxx      True     False
       for       for       ADP        IN      prep       xxx      True      True
      best      good       ADJ       JJS    advmod      xxxx      True     False
supporting   support      VERB       VBG      amod      xxxx      True     False
   actress   actress      NOUN        NN      pobj      xxxx      True     False
       for       for       ADP        IN      prep       xxx      True      True
       her       her      PRON      PRP$      poss       xxx      True      True
performanceperformance      NOUN        NN      pobj      xxxx      True     False
        in        in       ADP        IN      prep        xx      True      True
         "         "     PUNCT        ``     punct         "     False     False
    Minari    Minari     PROPN       NNP      pobj     Xxxxx      True     False
         "         "     PUNCT        ''     punct         "     False     False
        on        on       ADP        IN      prep        xx      True      True
    Sunday    Sunday     PROPN       NNP      pobj     Xxxxx      True     False
       and       and     CCONJ        CC        cc       xxx      True      True
      made      make      VERB       VBD      conj      xxxx      True      True
   history   history      NOUN        NN      dobj      xxxx      True     False
        by        by       ADP        IN      prep        xx      True      True
  becoming    become      VERB       VBG     pcomp      xxxx      True      True
       the       the       DET        DT       det       xxx      True      True
     first     first       ADJ        JJ      amod      xxxx      True      True
    Korean    korean       ADJ        JJ      amod     Xxxxx      True     False
     actor     actor      NOUN        NN      attr      xxxx      True     False
        to        to      PART        TO       aux        xx      True      True
       win       win      VERB        VB     advcl       xxx      True     False
        an        an       DET        DT       det        xx      True      True
   Academy   Academy     PROPN       NNP  compound     Xxxxx      True     False
     Award     Award     PROPN       NNP      dobj     Xxxxx      True     False
         .         .     PUNCT         .     punct         .     False     False

Stop Words

it, is, a, some, any, never, ... 와 같은 불용어(Stop words)도 spaCy 를 통해 문장에서 제거할 수도 있다.

import spacy

spacy.load('en_core_web_md')
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(stopwords)

다음과 같이 불용어를 확인할 수 있다.

{'never', 'see', 'himself', 'take', 'thereafter', 'thus', "'m", 'ourselves', 'had', 'to', 'upon', 'can', "'ve", 'herein', 'also', 'should', 'than', 'always', 'nobody', 'already', 'mine', 'thereby', 'under', 'whence', 'whither', 'latterly', '‘ve', 'did', 'itself', 'ours', 'top', 'another', 'while', 'nothing', 'although', "'s", 'her', 'up', 'whose', 'perhaps', 'fifty', 'otherwise', 'even', 'must', 'his', 'nine', 'seemed', 'more', 'twelve', 'somehow', 'with', 'do', '‘m', 'n‘t', 'therein', 'own', 'are', 'either', 'in', 'elsewhere', 'everything', 'get', 'from', 'same', 'into', "n't", '’m', 'anyhow', 'no', 'throughout', 'when', 'amongst', 'nowhere', 'of', 'they', 'beside', 'eleven', 'nor', 'eight', 'their', 'ca', 'us', 'within', 'except', 'used', 'me', 'our', 'side', 'a', 'being', 'by', 'everywhere', 'my', 'this', 'hereafter', 'made', '‘re', 'you', 'and', 'became', 'become', 'many', 'moreover', 'twenty', 'quite', 'bottom', 'without', 'using', 'may', 'via', 'hundred', 'someone', 'have', 'however', 'none', 'down', 'due', 'name', 'where', 'wherein', 'whereas', 'move', 'between', 'why', 'keep', 'mostly', 'then', 'seem', 'whereafter', 'toward', 'ever', 'sixty', 'sometime', 'out', 'full', 'does', 'we', 'nevertheless', 're', 'through', 'them', 'alone', 'above', 'whereby', 'fifteen', 'besides', 'he', 'too', 'beforehand', 'few', 'becomes', 'only', 'what', 'below', '‘ll', 'done', 'along', 'most', 'such', 'off', 'just', 'some', 'something', 'give', 'since', 'empty', 'hereupon', 'third', 'whereupon', 'thru', 'whole', 'n’t', 'is', 'latter', 'every', 'next', 'others', 'further', 'i', 'before', 'thereupon', 'yet', 'everyone', 'not', 'an', 'amount', 'be', 'regarding', 'neither', 'per', 'because', 'least', 'those', 'very', 'might', 'again', 'enough', 'whom', 'becoming', 'somewhere', 'various', 'two', 'anyone', 'seems', 'both', 'forty', 'back', 'go', 'first', 'now', 'yourselves', 'else', 'its', 'over', 'was', 'will', 'yourself', 'were', "'ll", 'former', 'call', 'here', 'if', 'the', "'d", '‘d', 'indeed', 'whether', 'once', 'part', 'ten', 'still', 'so', 'together', 'meanwhile', 'seeming', 'less', 'onto', 'could', 'make', 'anyway', 'whatever', 'would', 'formerly', 'yours', 'about', '’ve', 'themselves', 'thence', 'much', 'there', 'hereby', 'show', 'anywhere', 'whenever', 'across', 'well', 'wherever', 'these', 'namely', '‘s', 'often', 'cannot', 'until', 'whoever', '’d', 'been', 'one', 'herself', 'noone', 'doing', 'each', 'five', 'sometimes', 'almost', 'or', 'other', 'unless', 'how', 'several', 'three', 'though', 'last', 'any', 'four', 'please', '’ll', 'afterwards', 'during', 'all', 'at', 'has', 'who', 'that', 'front', 'myself', 'put', 'serious', 'your', 'towards', 'rather', 'which', 'she', 'among', 'it', 'against', 'after', 'as', 'hence', 'beyond', 'but', 'him', 'really', 'therefore', 'around', "'re", '’s', 'say', 'hers', 'on', 'am', 'anything', '’re', 'six', 'behind', 'for'}

token의 is_stop 정보를 이용하여 불용어를 제거할 수 있다.

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)

filtered = []
for word in doc:
    if not word.is_stop:
        filtered.append(word)

print(filtered)

다음과 같이 불용어가 제거된 doc을 얻을 수 있다.

[Yuh, -, jung, Youn, won, Oscar, best, supporting, actress, performance, ", Minari, ", Sunday, history, Korean, actor, win, Academy, Award, .]

구두점, 특수문자, 기호 등은 추가로 제거해야 한다.

Dependency Parsing

각 Token 들간의 의존관계를 고려하여 관련있는 단어들끼리 묶어 보자.

Noun Chunking

Doc 클래스의 noun_chunks 메소드를 호출하면, 자동으로 dependecy graph를 고려하여 noun phrase를 뽑아준다. (return 값은 Span 클래스의 generator)

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)

noun_chunks = doc.noun_chunks

print("=="*40)
str_format = "{:>30}{:>15}{:>15}{:>20}"
print(str_format.format('Text', 'Root Text', 'Root Dep', 'Root Head Text'))
print("=="*40)

for chunk in doc.noun_chunks:
    print(str_format.format(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text))

 

각 chunk는 다음과 같은 포함하고 있다

  • Text: The original noun chunk text.
  • Root Text: The original text of the word connecting the noun chunk to the rest of the parse.
  • Root Dep: Dependency relation connecting the root to its head.
  • Root Head Text: The text of the root token's head.
================================================================================
                          Text      Root Text       Root Dep      Root Head Text
================================================================================
                 Yuh-jung Youn           Youn          nsubj                 won
                     the Oscar          Oscar           dobj                 won
       best supporting actress        actress           pobj                 for
               her performance    performance           pobj                 for
                        Minari         Minari           pobj                  in
                        Sunday         Sunday           pobj                  on
                       history        history           dobj                made
        the first Korean actor          actor           attr            becoming
              an Academy Award          Award           dobj                 win

Navigating Parse Tree

 

deppendency graph(tree)를 이용하여 탐색을 수행할 수도 있다.

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)

for tok in doc:
    print(tok.text)
    children = list(tok.children)
    print('children:', children, 'head:', tok.head if tok.head != tok else "!this is root node")
    print("=="*28)

 

이 경우, won 단어가 root node가 되고, 이 root node는 Youn, Oscar, made 등의 단어를 children으로 갖게 된다.

Yuh
children: [] head: jung
==================================================
-
children: [] head: jung
==================================================
jung
children: [Yuh, -] head: Youn
==================================================
Youn
children: [jung] head: won
==================================================
won
children: [Youn, Oscar, for, on, and, made, .] head: !this is root node
==================================================
the
children: [] head: Oscar
==================================================
Oscar
children: [the] head: won
...

Named Entity Recognition (NER)

 

각각의 토큰이 주어진 문장에서 어떡 역할을 하는지는 품사 등으로 예측할 수 있지만, 어떤 단어가 지역을 나타내는지, 사람의 이름을 의미하는지, 제품을 의미하는지 알수 없다. 이를 알 수 있도록 spaCy 는 Named Entity Recognition 기능을 지원한다.

import spacy

nlp = spacy.load('en_core_web_md')
text = 'Yuh-jung Youn won the Oscar for best supporting actress for her performance in "Minari" on Sunday and made history by becoming the first Korean actor to win an Academy Award.'
doc = nlp(text)

print("="*40)
str_format = "{:>20}"*2
print(str_format.format('Text', 'NER'))
print("="*40)
for ent in doc.ents:
    print(str_format.format(ent.text, ent.label_))

 

다음과 같이 Entity가 추출된다.

========================================
                Text                 NER
========================================
       Yuh-jung Youn              PERSON
               Oscar         WORK_OF_ART
              Minari              PERSON
              Sunday                DATE
               first             ORDINAL
              Korean                NORP
    an Academy Award                 ORG

 

 

'일하는 > AI, ML' 카테고리의 다른 글

FastText to spaCy  (0) 2021.05.25
GTTS (Google Text to Speech)  (0) 2021.02.03
Rasa NLU Tutorial  (0) 2021.01.21
NLTK Tutorial  (0) 2021.01.21