SpacyDotNet

SpacyDotNet

SpacyDotNet is a .NET wrapper for the natural language library spaCy

Project scope and limitations

This project is not meant to be a complete and exhaustive implementation of all spaCy features and APIs. Altough it should be enough for basic tasks, think of it as a starting point, if you need to build a complex project using spaCy in .NET

Most of the basic features in Spacy101 section of the docs are available. All Containers classes are present (Doc, DocBin, Token, Span and Lexeme) with their basic properties/methods running. Also Vocab and StringStore in a limited form.

Nevertheless any developer should be ready to add the missing properties or classes in a straightforward manner.

Requirements

This project relies on Python.NET to interop with spaCy, which is written in Python/Cython.

It’s been tested under Windows 11 and Ubuntu Linux 20.04, using the following environment

  • .NET 9.0 / .NET Core 3.1
  • spaCy 3.8.5
  • Python 3.12
  • Python.NET: Latest official NuGet: 3.0.5

but it might work under different conditions:

  • It was previously tested on
  • It should work with spaCy 2.3.5 and any other spaCy version that changes only its minor/patch version number

Python.NET has been tested with Python releases 3.7 to 3.13

Setup

1) Create a Python virtual environment and install spaCy

It’s advised to create a virtual environment to install spaCy. Depending on the host system this is done in different ways. The spaCy official installation guide is fine

To run the examples, we’ll also need to install the correspoding language package (es_core_news_sm) as shown in the guide.

2) Check for Python shared library

Python.NET makes use of Python as a shared library. Sadly, seems like the shared library is not copied with recent versions of virtualenv and it’s not even distributed in some flavours of Linux/Python >= 3.8

While I don’t understand the rationale behind those changes, we should check the following:

Windows

Check whether python312.dll in located under <venv_root>\Scripts folder. Otherwise, go to your main Python folder and copy all dlls. In my case: python3.dll, python312.dll and the vcruntime140.dll

Linux

Check whether a libpython shared object is located under <venv_root>/bin folder.

If not, we first need to check if the shared object is present on our system. find_libpython can help with this task.

If library is nowhere to be found, it’s likely that installing python-dev package with the package manager of your favorite distribution will place the file in your system.

Once we locate the library, drop it to the bin folder. In my case, the file is named libpython3.12.so.1.0

Usage

SpaCyDotNet is built to be used as a library. However I provide an example project as a CLI program.

1) Compile and Build

If using the CLI to run .NET, (Linux), we should simply browse to Test/cs folder and compile the project with dotnet build. Under Visual Studio, just load Test.sln solution

2) Run the project

The program expects two parameters

  • interpreter: Name of Python shared library file. Usually python312.dll on Windows, libpython3.12.so on Linux and libpython3.12.dylib on Mac
  • venv: Location of the virtual environment created with a compatible python and spaCy versions

Run the example with dotnet run --interpreter <name_of_intepreter> --venv <path_to_virtualenv> or if using Visual Studio, set the command line in Project => Properties => Debug => Application arguments

In my case:

Linux

dotnet run --interpreter libpython3.12.so.1.0 --venv /home/user/Dev/venvSpaCyPy312

Windows

dotnet run --interpreter python312.dll --venv C:\Users\user\Dev\venvSpaCyPy312

Code comparison

I’ve tried to mimic spaCy API as much as possible, considering the different nature of both C# and Python languages

C# SpacyDotNet code

var nlp = spacy.Load("en_core_web_sm");
var doc = nlp.GetDocument("Apple is looking at buying U.K. startup for $1 billion");

foreach (Token token in doc.Tokens)
    Console.WriteLine($"{token.Text} {token.Lemma} {token.PoS} {token.Tag} {token.Dep} {token.Shape} {token.IsAlpha} {token.IsStop}");

Console.WriteLine("");
foreach (Span ent in doc.Ents)
    Console.WriteLine($"{ent.Text} {ent.StartChar} {ent.EndChar} {ent.Label}");

nlp = spacy.Load("en_core_web_md");
var tokens = nlp.GetDocument("dog cat banana afskfsd");

Console.WriteLine("");
foreach (Token token in tokens.Tokens)
    Console.WriteLine($"{token.Text} {token.HasVector} {token.VectorNorm}, {token.IsOov}");

tokens = nlp.GetDocument("dog cat banana");
Console.WriteLine("");
foreach (Token token1 in tokens.Tokens)
{
    foreach (Token token2 in tokens.Tokens)
        Console.WriteLine($"{token1.Text} {token2.Text} {token1.Similarity(token2) }");
}

doc = nlp.GetDocument("I love coffee");
Console.WriteLine("");
Console.WriteLine(doc.Vocab.Strings["coffee"]);
Console.WriteLine(doc.Vocab.Strings[3197928453018144401]);

Console.WriteLine("");
foreach (Token word in doc.Tokens)
{
    var lexeme = doc.Vocab[word.Text];
    Console.WriteLine($@"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} 
{lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}");
}

Python spaCy code

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

print("")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")

print("")
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

tokens = nlp("dog cat banana")
print("")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

doc = nlp("I love coffee")
print("")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

print("")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

Output

Output

Visit original content creator repository https://github.com/AMArostegui/SpacyDotNet

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *