Natural Language Processing in Power BI
This post can now be found here
With great Power BI comes great responsibility
It is time: Power BI + Python = Amazing!
It worked, it is relatively easy to convert the code to run within Power BI. The main thing that changes is the emphasis of using print() or exporting data in some ways things are easier in that you just need to update or create data frames so 🐼s is key... Pandas.
In terms of a step by step guide the first thing I did was get the code working in Jupyter as a proof that the code will execute successfully.
Following on from using spaCy.
When you add a Python Script to a Power BI Query it basically takes the data as it stands in the previous step and converts it to a data frame called dataset. This means that for the code I had for spaCy the name "df" needed to be changed to "dataset".
The other thing that needs to be different is what you do with the outputs of your code. I previously was printing out the values to test that it was working which does not work with what we need for Power BI. Another common way of using python is to output to a CSV or summarize in a visual but again this is not something that we want to do when using Python scripts with Power BI Queries (Python with visuals is a whole other conversation). The solution is to either update a the dataset data frame or to create a new dataframe that stores your outputs. In the case of NLP we are taking text and splitting it up into component parts as rows so outputting to a new dataframe is the easiest way to do this - the advantage with Power BI is that you can always load your data in as another table an join your Python powered query outputs back to it which is really useful.
I created a list called "data" to store the outputs of the for loop and then a last step to create a new data frame called "new_df". Hopefully the following comparison should make sense of the differences here. The other thing to mention is that I have added the dependencies above the code like you would in Jupiter but Power BI is like a standard Python Script so you need to all code needs to be in one step.
Previous Position of Speech Code
#create a for loop of the rows in the df dataframe
for idx, row in df.iterrows():
#checks to see if the value in text is a string i.e. contains data if so continue
if not isinstance(row['text'], str):
continue
#doc is the nlp results of the current text value
doc = nlp(row['text'])
#for loop for each token of the outputs
for token in doc:
#print id of row, token text and token pos code
print(row["id"],token.text,token.pos_)
Position of Speech Power BI Code
#pandas helps create the new dataframe
import pandas as pd
#spaCy is the NLP tool
import spacy
#this is the libary used in nlp
nlp = spacy.load("en_core_web_lg")
#this creates a new list called data
data = []
#create a for loop of the rows in the datset dataframe
for idx, row in dataset.iterrows():
#checks to see if the value in text is a string i.e. contains data if so continye
if not isinstance(row['text'], str):
continue
#doc is the nlp results of the current text value
doc = nlp(row['text'])
#for loop for each token of the outputs
for token in doc:
#creates a list value of id, token text, token position and lemma value
row_data = (row["id"],token.text,token.pos_,token.lemma_)
#appends the row data to the data list
data.append(row_data)
#create new data frame from data list with the column headings listed
new_df = pd.DataFrame(data, columns=['id','token_text','token_postag','token_lemma'])
Putting this in Power BI
The source I am using is on SharePoint and is a CSV but the advantage of using Power Query on this is it is very easy to pull in data from wherever you want. The data has been processed and converted including filtering out blanks (even though the code will also check for this).
Named Entity Recognition Power BI Code
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_lg")
data = []
for idx, row in dataset.iterrows():
if not isinstance(row['text'], str):
continue
doc = nlp(row['text'])
for e in doc.ents:
row_data = (row["id"],e.text,e.label_)
data.append(row_data)
new_df = pd.DataFrame(data, columns=['id','entity_text','entity_label'])






Comments
Post a Comment