The data structure for unstructured multimodal data
DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.
Read more on why should you use DocArray and comparison to alternatives.
DocArray was released under the open-source Apache License 2.0 in January 2022. It is currently a sandbox project under LF AI & Data Foundation.
Documentation
Install
Requires Python 3.7+
pip install docarray
or via Conda:
conda install -c conda-forge docarray
Commonly used features can be enabled via pip install "docarray[common]"
.
Get Started
DocArray consists of three simple concepts:
- Document: a data structure for easily representing nested, unstructured data.
- DocumentArray: a container for efficiently accessing, manipulating, and understanding multiple Documents.
- Dataclass: a high-level API for intuitively representing multimodal data.
Let's see DocArray in action with some examples.
Example 1: represent multimodal data in dataclass
The following news article card can be easily represented via docarray.dataclass
and type annotation:
Example 2: a 10-liners text matching
Let's search for top-5 similar sentences of she smiled too much in "Pride and Prejudice".
from docarray import Document, DocumentArray
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())
da.apply(Document.embed_feature_hashing, backend='process')
q = (
Document(text='she smiled too much')
.embed_feature_hashing()
.match(da, metric='jaccard', use_scipy=True)
)
print(q.matches[:5, ('text', 'scores__jaccard__value')])
[['but she smiled too much.',
'_little_, she might have fancied too _much_.',
'She perfectly remembered everything that had passed in',
'tolerably detached tone. While she spoke, an involuntary glance',
'much as she chooses. 1�7'],
[0.3333333333333333, 0.6666666666666666, 0.7, 0.7272727272727273, 0.75]]
Here the feature embedding is done by simple feature hashing and distance metric is Jaccard distance. You have better embeddings? Of course you do! We look forward to seeing your results!
Example 3: external storage for out-of-memory data
When your data is too big, storing in memory is probably not a good idea. DocArray supports multiple storage backends such as SQLite, Weaviate, Qdrant and ANNLite. They are all unified under the exact same user experience and API. Take the above snippet as an example, you only need to change one line to use SQLite:
da = DocumentArray(
(Document(text=s.strip()) for s in d.text.split('\n') if s.strip()),
storage='sqlite',
)
The code snippet can still run as-is. All APIs remain the same, the code after are then running in a "in-database" manner.
Besides saving memory, one can leverage storage backends for persistence, faster retrieval (e.g. on nearest-neighbour queries).
Example 4: a complete workflow of visual search
Let's use DocArray and the Totally Looks Like dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in /left
and /right
. Images that share the same filename are perceptually similar. For example:
left/00018.jpg | right/00018.jpg | left/00131.jpg | right/00131.jpg |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Our problem is given an image from /left
, can we find its most-similar image in /right
? (without looking at the filename of course).
Load images
First we load images. You can go to Totally Looks Like website, unzip and load images as below:
from docarray import DocumentArray
left_da = DocumentArray.from_files('left/*.jpg')
Or you can simply pull it from Jina Cloud:
left_da = DocumentArray.pull('demo-leftda', show_progress=True)
Note If you have more than 15GB of RAM and want to try using the whole dataset instead of just the first 1000 images, remove [:1000] when loading the files into the DocumentArrays left_da and right_da.
You will see a running progress bar to indicate the downloading process.
To get a feeling of the data you will handle, plot them in one sprite image. You will need to have matplotlib and torch installed to run this snippet:
left_da.plot_image_sprites()
Apply preprocessing
Let's do some standard computer vision pre-processing:
from docarray import Document
def preproc(d: Document):
return (
d.load_uri_to_image_tensor() # load
.set_image_tensor_normalization() # normalize color
.set_image_tensor_channel_axis(-1, 0)
) # switch color axis for the PyTorch model later
left_da.apply(preproc)
Did I mention apply
works in parallel?
Embed images
Now convert images into embeddings using a pretrained ResNet50:
import torchvision
model = torchvision.models.resnet50(pretrained=True) # load ResNet50
left_da.embed(model, device='cuda') # embed via GPU to speed up
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in .embed(...)
.
Visualize embeddings
You can visualize the embeddings via tSNE in an interactive embedding projector. You will need to have pydantic, uvicorn and fastapi installed to run this snippet:
left_da.plot_embeddings(image_sprites=True)
Fun is fun, but recall our goal is to match left images against right images and so far we have only handled the left. Let's repeat the same procedure for the right:
Pull from Cloud | Download, unzip, load from local |
---|---|
right_da = (
DocumentArray.pull('demo-rightda', show_progress=True)
.apply(preproc)
.embed(model, device='cuda')[:1000]
) |
right_da = (
DocumentArray.from_files('right/*.jpg')[:1000]
.apply(preproc)
.embed(model, device='cuda')
) |
Match nearest neighbours
We can now match the left to the right and take the top-9 results.
left_da.match(right_da, limit=9)
Let's inspect what's inside left_da
matches now:
for m in left_da[0].matches:
print(d.uri, m.uri, m.scores['cosine'].value)
left/02262.jpg right/03459.jpg 0.21102
left/02262.jpg right/02964.jpg 0.13871843
left/02262.jpg right/02103.jpg 0.18265384
left/02262.jpg right/04520.jpg 0.16477376
...
Or shorten the loop as one-liner using the element & attribute selector:
print(left_da['@m', ('uri', 'scores__cosine__value')])
Better see it.
(
DocumentArray(left_da[8].matches, copy=True)
.apply(
lambda d: d.set_image_tensor_channel_axis(
0, -1
).set_image_tensor_inv_normalization()
)
.plot_image_sprites()
)
What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.
Quantitative evaluation
Serious as you are, visual inspection is surely not enough. Let's calculate the recall@K. First we construct the groundtruth matches:
groundtruth = DocumentArray(
Document(uri=d.uri, matches=[Document(uri=d.uri.replace('left', 'right'))])
for d in left_da
)
Here we create a new DocumentArray with real matches by simply replacing the filename, e.g. left/00001.jpg
to right/00001.jpg
. That's all we need: if the predicted match has the identical uri
as the groundtruth match, then it is correct.
Now let's check recall rate from 1 to 5 over the full dataset:
for k in range(1, 6):
print(
f'recall@{k}',
left_da.evaluate(
groundtruth, hash_fn=lambda d: d.uri, metric='recall_at_k', k=k, max_rel=1
),
)
recall@1 0.02726063829787234
recall@2 0.03873005319148936
recall@3 0.04670877659574468
recall@4 0.052194148936170214
recall@5 0.0573470744680851
More metrics can be used such as precision_at_k
, ndcg_at_k
, hit_at_k
.
If you think a pretrained ResNet50 is good enough, let me tell you with Finetuner you could do much better in just 10 extra lines of code. Here is how.
Save results
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,
left_da.save('left_da.bin')
To reuse it, do left_da = DocumentArray.load('left_da.bin')
.
If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:
left_da.push('my_shared_da')
Now anyone who knows the token my_shared_da
can pull and work on it.
left_da = DocumentArray.pull('my_shared_da')
Intrigued? That's only scratching the surface of what DocArray is capable of. Read our docs to learn more.
Support
- Join our Slack community and chat with other community members about ideas.
DocArray is a trademark of LF AI Projects, LLC