File format
In short it’s binary encoded protobuf message prefixed with the size. For longer description please check Length-prefix framing for protocol buffers.
In most languages the file is read as stream. If you need to open and parse a file which holds lenght-prefixed protobuf messages in Python, you read internet and you find a code which read the whole file. It’s huge bottleneck. The hit in a performance between reading whole file then parsing byte after byte and using BufferedReader was in 1000x. Thus enjoy my little piece of code:
def ReadItm(fname, constructor, size_limit = 0):
''' Reads and parses a length prefixed protobuf messages from file.
The file MUST not be corrupted. The parsing is equivalent to parseDelimitedFrom.
'''
f = None
if fname.endswith('.gzip'):
f = gzip.open(fname, 'rb')
else:
f = open(fname, 'rb')
reader = BufferedReader(f)
bytes_read = 0
while size_limit<=0 or bytes_read<size_limit:
buffer = reader.peek(10)
if len(buffer) == 0:
break
(size, position) = decoder._DecodeVarint(buffer, 0)
reader.read(position)
itm = constructor()
itm.ParseFromString(reader.read(size))
bytes_read = bytes_read + position + size
yield itm
f.close()
No comments:
Post a Comment