概述
因为最近的工作需要使用到SyntaxNet,所以新建一篇博客,记录相关的学习过程。:)
github地址
继续学习
因为最近需要继续学习SyntaxNet,故而继续更新这篇博客。
找了半天的模型代码:171 qrxia@amax:~/TensorFlow/models/syntaxnet/
任务1:搞懂Google双隐含层是怎么实现的
如果有pre trained的embedding,就是用pre trained的embedding,否则就是用随机初始化: 1/sqrt(embedding_size) embedding_size: [64, 32, 32]
relu 的weight的初始化范围: -1e-4 ~ +1e-4, 正态分布
relu 的bias的初始化范围: -0.2 ~ +0.2, 正态分布
softmax (最后一层)的weight的初始化范围: 1e-4, 正态分布
softmax 的bias的初始化范围: 0!
1 | relu_init=1e-4, # 初始化weight,略有不同 |
任务2:执行流程
不看train pos, train local的部分,直接看train global部分。
工程执行入口: bazel-bin/syntaxnet/parser_eval(这是一个python程序,只不过没有.py后缀)
- 从Main()函数开始执行
bazel-bin/syntaxnet/parser_eval.runfiles/ # 这个里面还有文件?
bazel-bin/syntaxnet/parser_trainer.runfiles/main/syntaxnet # parser trainer位置 - function Train() # Train函数入口
parser = structured_graph_builder.StructuredGraphBuilder
_beam_size = 10
_max_steps = 25
_AddLearningRate(…) # Returns a learning rate that decays by 0.96 every decay_steps.
learning_rate=0.1decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
decay_steps=4000 - bazel-bin/syntaxnet/parser_trainer.runfiles/__main__/syntaxnet/ops/gen_parser_ops.py (machine generated)
beam parse reader是c++写的,
beam parse reader是由C++完成的代码,但是在文件gen_parser_ops里面是py 函数的形式存在着,如何通过beam parse reader获取features,state…目前还不得而知(暂时不看C++部分)!
cross entropy任务3:相关手册?
_op_def_lib.apply_op(…) # python 调用C++程序?
_op_def_lib =_InitOpDefLibrary() # line 468 来自于tensorflow的核心 tensorflow.core.framework import op_def_pb21
2
3
4
5
6
7
8
9
10
11
12tf.constant(value, dtype=None, shape=None, name='Const') # Creates a constant tensor.
tf.train.exponential_decay(...) # Applies exponential decay to the learning rate. global step ? @kiro
tf.logical_and(x, y, name=None) # Returns the truth value of x AND y element-wise.
tf.reduce_any(input_tensor, ...) # Computes the "logical or" of elements across dimensions of a tensor.
tf.while_loop(cond, body, loop_vars, ...) # Repeat body while the condition cond is true.
tf.nn.softmax_cross_entropy_with_logits(logits, labels, ...) # Computes softmax cross entropy between logits and labels.
class tf.train.MomentumOptimizer # Optimizer that implements the Momentum algorithm.
tf.train.Optimizer.get_slot(var, name) # Return a slot named name created for var by the Optimizer.
tf.reduce_sum(input_tensor, axis=None,) # Computes the sum of elements across dimensions of a tensor. Equivalent to np.sum
tf.div(x, y, name=None) # Returns x / y element-wise.
tf.nn.l2_loss(t, name=None) # Computes half the L2 norm of a tensor without the sqrt: output = sum(t ** 2) / 2
tf.add_n(inputs, name=None) # Adds all input tensors element-wise.
要点:cross entropy
1.tf.nn.softmax_cross_entropy_with_logits
类似于先应用softmax, 再应用cross_entropy
处理流程
摘抄了一些重要的信息。
break the text into words, run the POS tagger, run the parser, and then generate an ASCII version of the parse tree.
- Training the SyntaxNet POS Tagger
We process the sentences left-to-right. For any given word, we extract features of that word and a window around it, and use these as inputs to a feed-forward neural network classifier, which predicts a probability distribution over POS tags. Because we make decisions in left-to-right order, we also use prior decisions as features in subsequent ones.
run the trained model over our training, tuning, and dev (evaluation) sets. - Local Pretraining
- Global Training
模型的训练
有几点需要注意
- contex文件中缺少char-map
- 注释需要去掉,否则shell脚本不可执行
- POS tags 需要使用到CONLL格式的第4列
使用的数据:
- traing-corpus: /home/qrxia/data/ptb-data-wsj/wsj_02_21.train.conll07
- tuning-corpus: /home/qrxia/data/ptb-data-wsj/wsj_24.dev.conll07
- dev-corpus: /home/qrxia/data/ptb-data-wsj/wsj_22.dev.conll07cp3to4.conll
查看GPU的使用情况,每10s刷新一次显示:
1 | >>watch -n 10 nvidia-smi |
按照github上面的tutorial,使用SyntaxNet训练一个句法分析模型需要以下三步:
1.训练一个POStagger
按照tutorial的指引,非常方便就可以训练一个POStagger,其中,对以上三个进行evaluation:
training | tuning | dev |
---|---|---|
98.25% | 96.84 | 96.74% |
2.训练一个local模型,用来pre-training
eval metric如下:
training | tuning | dev |
---|---|---|
95.32% | 90.01% | 91.54% |
3.训练一个global模型
eval metric如下:
training | tuning | dev |
---|---|---|
95.44% | 91.03% | 92.67% |
代码阅读:
初步看,SyntaxNet的主要代码集中在models/syntaxnet/syntaxnet。
BUILD: 应该是指定bazel的如何编译文件
ps: 在阅读代码的过程中,会记录一些tensorflow的语法:)
parser_trainer.py
主要交代了:命令行参数及其一些默认的configuration
1 | tf.app.flags #argparser, which implements a subset of the functionality in python-gflags |
main():
1.compute lexicon (default: false) and load lexicon use “FeatureSize”
projectivize_training_set (default: false)
2.Train
lexicon_builder.cc
包含的几个类: LexiconBuilder FeatureSize
需要提取的几个TermFrequencyMap: words lcwords tags categories labels chars
其他的几个需要提取的: prefixes suffixes tag_to_category
embedding_feature_extractor.*
Class: ParserEmbeddingFeatureExtractor
提取特征相关的文件
feature_extractor.*
Generic feature extractor for extracting features from objects.
term_frequency_map.*
A mapping from strings to frequencies with save and load functionality.
Class: TermFrequencyMap, TagToCategoryMap
TagToCategoryMap: 从输出文件上来看,tag和category是一一对应的?
parser_transitions.*
Nothing?
sentence.proto
一个Sentence由docid, text, token组成,最大长度1000
token: word, start, end, head, tag, category, label
proto_io.h
Class: TextReader()
document_format.*
Class: DocumentFormat –A document format component converts a key/value pair from a record to one or more documents
key/value pair? Nothing?
graph_builder.py
Builds parser models.
Class: GreedyParser
1 | tf.name_scope() #sharing variables |
1 | dict.update(dict2) # The method update() adds dictionary dict2's key-values pairs in to dict. Python |
feature_endpoints是一连串这样子的数组。其中,local, feature_endpoints shape=(?,) dtype=int32
1 | ['\x08\x8a\x06' '\x08\x8f\x01' '\x08\xf7\r' '\x08\xe0\x11' |
Interesting: syntaxnet/bazel-syntaxnet/bazel-out/local-opt/genfiles/syntaxnet/ops/gen_parser_ops.py
reader_ops.cc
Class: GoldParseReader
1 | OP_REQUIRES_OK #如果想要测试一个函数返回的 Status 对象是否是一个错误, 可以使用 OP_REQUIRES_OK. 这些宏如果检测到错误, 会直接跳出函数, 终止函数执行. |
sentence_batch.*
Helper class to manage generating batches of preprocessed ParserState objects by reading in multiple sentences in parallel.
parser_state.*
Parser state for the transition-based dependency parser.
affix.*
Class: Affix, AffixTable
affix: 词缀
text_formats.cc
CONLL格式文件的定义
structured_graph_builder.py
Build structured parser models.
1 | tf.NoGradient(op_type) # Specifies that ops of type op_type do not have a defined gradient. |
parser_state.*
Parser state for the transition-based dependency parser.
parser_transitions.*
Transition system for the transition-based dependency parser.
问题:
Q1: 为什么train pos的时候,num_actions=45,而不是12?
因为在FeatureSize的代码中,有很明显的一行
1 | num_actions->scalar<int32>()() = transition_system->NumActions(label_map_->Size()) |
unpack_sparse_features.cc
Operator to unpack ids and weights stored in SparseFeatures proto.
Q2: 重大发现
以前不知道的一些h文件,通过egrep发现,在 bazel-genfiles/syntaxnet/ 文件夹下面