诸暨麻将添加redis
您不能選擇超過 %s 個話題 話題必須以字母或數字為開頭,可包含連接號 ('-') 且最長為 35 個字
 
 
 
 
 
 

1133 行
38 KiB

  1. // Protocol Buffers - Google's data interchange format
  2. // Copyright 2008 Google Inc. All rights reserved.
  3. // https://developers.google.com/protocol-buffers/
  4. //
  5. // Redistribution and use in source and binary forms, with or without
  6. // modification, are permitted provided that the following conditions are
  7. // met:
  8. //
  9. // * Redistributions of source code must retain the above copyright
  10. // notice, this list of conditions and the following disclaimer.
  11. // * Redistributions in binary form must reproduce the above
  12. // copyright notice, this list of conditions and the following disclaimer
  13. // in the documentation and/or other materials provided with the
  14. // distribution.
  15. // * Neither the name of Google Inc. nor the names of its
  16. // contributors may be used to endorse or promote products derived from
  17. // this software without specific prior written permission.
  18. //
  19. // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  20. // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  21. // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  22. // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  23. // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  24. // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  25. // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  26. // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  27. // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28. // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  29. // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30. // Author: kenton@google.com (Kenton Varda)
  31. // Based on original Protocol Buffers design by
  32. // Sanjay Ghemawat, Jeff Dean, and others.
  33. //
  34. // Here we have a hand-written lexer. At first you might ask yourself,
  35. // "Hand-written text processing? Is Kenton crazy?!" Well, first of all,
  36. // yes I am crazy, but that's beside the point. There are actually reasons
  37. // why I ended up writing this this way.
  38. //
  39. // The traditional approach to lexing is to use lex to generate a lexer for
  40. // you. Unfortunately, lex's output is ridiculously ugly and difficult to
  41. // integrate cleanly with C++ code, especially abstract code or code meant
  42. // as a library. Better parser-generators exist but would add dependencies
  43. // which most users won't already have, which we'd like to avoid. (GNU flex
  44. // has a C++ output option, but it's still ridiculously ugly, non-abstract,
  45. // and not library-friendly.)
  46. //
  47. // The next approach that any good software engineer should look at is to
  48. // use regular expressions. And, indeed, I did. I have code which
  49. // implements this same class using regular expressions. It's about 200
  50. // lines shorter. However:
  51. // - Rather than error messages telling you "This string has an invalid
  52. // escape sequence at line 5, column 45", you get error messages like
  53. // "Parse error on line 5". Giving more precise errors requires adding
  54. // a lot of code that ends up basically as complex as the hand-coded
  55. // version anyway.
  56. // - The regular expression to match a string literal looks like this:
  57. // kString = new RE("(\"([^\"\\\\]|" // non-escaped
  58. // "\\\\[abfnrtv?\"'\\\\0-7]|" // normal escape
  59. // "\\\\x[0-9a-fA-F])*\"|" // hex escape
  60. // "\'([^\'\\\\]|" // Also support single-quotes.
  61. // "\\\\[abfnrtv?\"'\\\\0-7]|"
  62. // "\\\\x[0-9a-fA-F])*\')");
  63. // Verifying the correctness of this line noise is actually harder than
  64. // verifying the correctness of ConsumeString(), defined below. I'm not
  65. // even confident that the above is correct, after staring at it for some
  66. // time.
  67. // - PCRE is fast, but there's still more overhead involved than the code
  68. // below.
  69. // - Sadly, regular expressions are not part of the C standard library, so
  70. // using them would require depending on some other library. For the
  71. // open source release, this could be really annoying. Nobody likes
  72. // downloading one piece of software just to find that they need to
  73. // download something else to make it work, and in all likelihood
  74. // people downloading Protocol Buffers will already be doing so just
  75. // to make something else work. We could include a copy of PCRE with
  76. // our code, but that obligates us to keep it up-to-date and just seems
  77. // like a big waste just to save 200 lines of code.
  78. //
  79. // On a similar but unrelated note, I'm even scared to use ctype.h.
  80. // Apparently functions like isalpha() are locale-dependent. So, if we used
  81. // that, then if this code is being called from some program that doesn't
  82. // have its locale set to "C", it would behave strangely. We can't just set
  83. // the locale to "C" ourselves since we might break the calling program that
  84. // way, particularly if it is multi-threaded. WTF? Someone please let me
  85. // (Kenton) know if I'm missing something here...
  86. //
  87. // I'd love to hear about other alternatives, though, as this code isn't
  88. // exactly pretty.
  89. #include <google/protobuf/io/tokenizer.h>
  90. #include <google/protobuf/stubs/common.h>
  91. #include <google/protobuf/stubs/logging.h>
  92. #include <google/protobuf/stubs/stringprintf.h>
  93. #include <google/protobuf/stubs/strutil.h>
  94. #include <google/protobuf/io/strtod.h>
  95. #include <google/protobuf/io/zero_copy_stream.h>
  96. #include <google/protobuf/stubs/stl_util.h>
  97. namespace google {
  98. namespace protobuf {
  99. namespace io {
  100. namespace {
  101. // As mentioned above, I don't trust ctype.h due to the presence of "locales".
  102. // So, I have written replacement functions here. Someone please smack me if
  103. // this is a bad idea or if there is some way around this.
  104. //
  105. // These "character classes" are designed to be used in template methods.
  106. // For instance, Tokenizer::ConsumeZeroOrMore<Whitespace>() will eat
  107. // whitespace.
  108. // Note: No class is allowed to contain '\0', since this is used to mark end-
  109. // of-input and is handled specially.
  110. #define CHARACTER_CLASS(NAME, EXPRESSION) \
  111. class NAME { \
  112. public: \
  113. static inline bool InClass(char c) { return EXPRESSION; } \
  114. }
  115. CHARACTER_CLASS(Whitespace, c == ' ' || c == '\n' || c == '\t' || c == '\r' ||
  116. c == '\v' || c == '\f');
  117. CHARACTER_CLASS(WhitespaceNoNewline,
  118. c == ' ' || c == '\t' || c == '\r' || c == '\v' || c == '\f');
  119. CHARACTER_CLASS(Unprintable, c<' ' && c> '\0');
  120. CHARACTER_CLASS(Digit, '0' <= c && c <= '9');
  121. CHARACTER_CLASS(OctalDigit, '0' <= c && c <= '7');
  122. CHARACTER_CLASS(HexDigit, ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||
  123. ('A' <= c && c <= 'F'));
  124. CHARACTER_CLASS(Letter,
  125. ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || (c == '_'));
  126. CHARACTER_CLASS(Alphanumeric, ('a' <= c && c <= 'z') ||
  127. ('A' <= c && c <= 'Z') ||
  128. ('0' <= c && c <= '9') || (c == '_'));
  129. CHARACTER_CLASS(Escape, c == 'a' || c == 'b' || c == 'f' || c == 'n' ||
  130. c == 'r' || c == 't' || c == 'v' || c == '\\' ||
  131. c == '?' || c == '\'' || c == '\"');
  132. #undef CHARACTER_CLASS
  133. // Given a char, interpret it as a numeric digit and return its value.
  134. // This supports any number base up to 36.
  135. inline int DigitValue(char digit) {
  136. if ('0' <= digit && digit <= '9') return digit - '0';
  137. if ('a' <= digit && digit <= 'z') return digit - 'a' + 10;
  138. if ('A' <= digit && digit <= 'Z') return digit - 'A' + 10;
  139. return -1;
  140. }
  141. // Inline because it's only used in one place.
  142. inline char TranslateEscape(char c) {
  143. switch (c) {
  144. case 'a':
  145. return '\a';
  146. case 'b':
  147. return '\b';
  148. case 'f':
  149. return '\f';
  150. case 'n':
  151. return '\n';
  152. case 'r':
  153. return '\r';
  154. case 't':
  155. return '\t';
  156. case 'v':
  157. return '\v';
  158. case '\\':
  159. return '\\';
  160. case '?':
  161. return '\?'; // Trigraphs = :(
  162. case '\'':
  163. return '\'';
  164. case '"':
  165. return '\"';
  166. // We expect escape sequences to have been validated separately.
  167. default:
  168. return '?';
  169. }
  170. }
  171. } // anonymous namespace
  172. ErrorCollector::~ErrorCollector() {}
  173. // ===================================================================
  174. Tokenizer::Tokenizer(ZeroCopyInputStream* input,
  175. ErrorCollector* error_collector)
  176. : input_(input),
  177. error_collector_(error_collector),
  178. buffer_(NULL),
  179. buffer_size_(0),
  180. buffer_pos_(0),
  181. read_error_(false),
  182. line_(0),
  183. column_(0),
  184. record_target_(NULL),
  185. record_start_(-1),
  186. allow_f_after_float_(false),
  187. comment_style_(CPP_COMMENT_STYLE),
  188. require_space_after_number_(true),
  189. allow_multiline_strings_(false) {
  190. current_.line = 0;
  191. current_.column = 0;
  192. current_.end_column = 0;
  193. current_.type = TYPE_START;
  194. Refresh();
  195. }
  196. Tokenizer::~Tokenizer() {
  197. // If we had any buffer left unread, return it to the underlying stream
  198. // so that someone else can read it.
  199. if (buffer_size_ > buffer_pos_) {
  200. input_->BackUp(buffer_size_ - buffer_pos_);
  201. }
  202. }
  203. // -------------------------------------------------------------------
  204. // Internal helpers.
  205. void Tokenizer::NextChar() {
  206. // Update our line and column counters based on the character being
  207. // consumed.
  208. if (current_char_ == '\n') {
  209. ++line_;
  210. column_ = 0;
  211. } else if (current_char_ == '\t') {
  212. column_ += kTabWidth - column_ % kTabWidth;
  213. } else {
  214. ++column_;
  215. }
  216. // Advance to the next character.
  217. ++buffer_pos_;
  218. if (buffer_pos_ < buffer_size_) {
  219. current_char_ = buffer_[buffer_pos_];
  220. } else {
  221. Refresh();
  222. }
  223. }
  224. void Tokenizer::Refresh() {
  225. if (read_error_) {
  226. current_char_ = '\0';
  227. return;
  228. }
  229. // If we're in a token, append the rest of the buffer to it.
  230. if (record_target_ != NULL && record_start_ < buffer_size_) {
  231. record_target_->append(buffer_ + record_start_,
  232. buffer_size_ - record_start_);
  233. record_start_ = 0;
  234. }
  235. const void* data = NULL;
  236. buffer_ = NULL;
  237. buffer_pos_ = 0;
  238. do {
  239. if (!input_->Next(&data, &buffer_size_)) {
  240. // end of stream (or read error)
  241. buffer_size_ = 0;
  242. read_error_ = true;
  243. current_char_ = '\0';
  244. return;
  245. }
  246. } while (buffer_size_ == 0);
  247. buffer_ = static_cast<const char*>(data);
  248. current_char_ = buffer_[0];
  249. }
  250. inline void Tokenizer::RecordTo(std::string* target) {
  251. record_target_ = target;
  252. record_start_ = buffer_pos_;
  253. }
  254. inline void Tokenizer::StopRecording() {
  255. // Note: The if() is necessary because some STL implementations crash when
  256. // you call string::append(NULL, 0), presumably because they are trying to
  257. // be helpful by detecting the NULL pointer, even though there's nothing
  258. // wrong with reading zero bytes from NULL.
  259. if (buffer_pos_ != record_start_) {
  260. record_target_->append(buffer_ + record_start_,
  261. buffer_pos_ - record_start_);
  262. }
  263. record_target_ = NULL;
  264. record_start_ = -1;
  265. }
  266. inline void Tokenizer::StartToken() {
  267. current_.type = TYPE_START; // Just for the sake of initializing it.
  268. current_.text.clear();
  269. current_.line = line_;
  270. current_.column = column_;
  271. RecordTo(&current_.text);
  272. }
  273. inline void Tokenizer::EndToken() {
  274. StopRecording();
  275. current_.end_column = column_;
  276. }
  277. // -------------------------------------------------------------------
  278. // Helper methods that consume characters.
  279. template <typename CharacterClass>
  280. inline bool Tokenizer::LookingAt() {
  281. return CharacterClass::InClass(current_char_);
  282. }
  283. template <typename CharacterClass>
  284. inline bool Tokenizer::TryConsumeOne() {
  285. if (CharacterClass::InClass(current_char_)) {
  286. NextChar();
  287. return true;
  288. } else {
  289. return false;
  290. }
  291. }
  292. inline bool Tokenizer::TryConsume(char c) {
  293. if (current_char_ == c) {
  294. NextChar();
  295. return true;
  296. } else {
  297. return false;
  298. }
  299. }
  300. template <typename CharacterClass>
  301. inline void Tokenizer::ConsumeZeroOrMore() {
  302. while (CharacterClass::InClass(current_char_)) {
  303. NextChar();
  304. }
  305. }
  306. template <typename CharacterClass>
  307. inline void Tokenizer::ConsumeOneOrMore(const char* error) {
  308. if (!CharacterClass::InClass(current_char_)) {
  309. AddError(error);
  310. } else {
  311. do {
  312. NextChar();
  313. } while (CharacterClass::InClass(current_char_));
  314. }
  315. }
  316. // -------------------------------------------------------------------
  317. // Methods that read whole patterns matching certain kinds of tokens
  318. // or comments.
  319. void Tokenizer::ConsumeString(char delimiter) {
  320. while (true) {
  321. switch (current_char_) {
  322. case '\0':
  323. AddError("Unexpected end of string.");
  324. return;
  325. case '\n': {
  326. if (!allow_multiline_strings_) {
  327. AddError("String literals cannot cross line boundaries.");
  328. return;
  329. }
  330. NextChar();
  331. break;
  332. }
  333. case '\\': {
  334. // An escape sequence.
  335. NextChar();
  336. if (TryConsumeOne<Escape>()) {
  337. // Valid escape sequence.
  338. } else if (TryConsumeOne<OctalDigit>()) {
  339. // Possibly followed by two more octal digits, but these will
  340. // just be consumed by the main loop anyway so we don't need
  341. // to do so explicitly here.
  342. } else if (TryConsume('x')) {
  343. if (!TryConsumeOne<HexDigit>()) {
  344. AddError("Expected hex digits for escape sequence.");
  345. }
  346. // Possibly followed by another hex digit, but again we don't care.
  347. } else if (TryConsume('u')) {
  348. if (!TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
  349. !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>()) {
  350. AddError("Expected four hex digits for \\u escape sequence.");
  351. }
  352. } else if (TryConsume('U')) {
  353. // We expect 8 hex digits; but only the range up to 0x10ffff is
  354. // legal.
  355. if (!TryConsume('0') || !TryConsume('0') ||
  356. !(TryConsume('0') || TryConsume('1')) ||
  357. !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
  358. !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
  359. !TryConsumeOne<HexDigit>()) {
  360. AddError(
  361. "Expected eight hex digits up to 10ffff for \\U escape "
  362. "sequence");
  363. }
  364. } else {
  365. AddError("Invalid escape sequence in string literal.");
  366. }
  367. break;
  368. }
  369. default: {
  370. if (current_char_ == delimiter) {
  371. NextChar();
  372. return;
  373. }
  374. NextChar();
  375. break;
  376. }
  377. }
  378. }
  379. }
  380. Tokenizer::TokenType Tokenizer::ConsumeNumber(bool started_with_zero,
  381. bool started_with_dot) {
  382. bool is_float = false;
  383. if (started_with_zero && (TryConsume('x') || TryConsume('X'))) {
  384. // A hex number (started with "0x").
  385. ConsumeOneOrMore<HexDigit>("\"0x\" must be followed by hex digits.");
  386. } else if (started_with_zero && LookingAt<Digit>()) {
  387. // An octal number (had a leading zero).
  388. ConsumeZeroOrMore<OctalDigit>();
  389. if (LookingAt<Digit>()) {
  390. AddError("Numbers starting with leading zero must be in octal.");
  391. ConsumeZeroOrMore<Digit>();
  392. }
  393. } else {
  394. // A decimal number.
  395. if (started_with_dot) {
  396. is_float = true;
  397. ConsumeZeroOrMore<Digit>();
  398. } else {
  399. ConsumeZeroOrMore<Digit>();
  400. if (TryConsume('.')) {
  401. is_float = true;
  402. ConsumeZeroOrMore<Digit>();
  403. }
  404. }
  405. if (TryConsume('e') || TryConsume('E')) {
  406. is_float = true;
  407. TryConsume('-') || TryConsume('+');
  408. ConsumeOneOrMore<Digit>("\"e\" must be followed by exponent.");
  409. }
  410. if (allow_f_after_float_ && (TryConsume('f') || TryConsume('F'))) {
  411. is_float = true;
  412. }
  413. }
  414. if (LookingAt<Letter>() && require_space_after_number_) {
  415. AddError("Need space between number and identifier.");
  416. } else if (current_char_ == '.') {
  417. if (is_float) {
  418. AddError(
  419. "Already saw decimal point or exponent; can't have another one.");
  420. } else {
  421. AddError("Hex and octal numbers must be integers.");
  422. }
  423. }
  424. return is_float ? TYPE_FLOAT : TYPE_INTEGER;
  425. }
  426. void Tokenizer::ConsumeLineComment(std::string* content) {
  427. if (content != NULL) RecordTo(content);
  428. while (current_char_ != '\0' && current_char_ != '\n') {
  429. NextChar();
  430. }
  431. TryConsume('\n');
  432. if (content != NULL) StopRecording();
  433. }
  434. void Tokenizer::ConsumeBlockComment(std::string* content) {
  435. int start_line = line_;
  436. int start_column = column_ - 2;
  437. if (content != NULL) RecordTo(content);
  438. while (true) {
  439. while (current_char_ != '\0' && current_char_ != '*' &&
  440. current_char_ != '/' && current_char_ != '\n') {
  441. NextChar();
  442. }
  443. if (TryConsume('\n')) {
  444. if (content != NULL) StopRecording();
  445. // Consume leading whitespace and asterisk;
  446. ConsumeZeroOrMore<WhitespaceNoNewline>();
  447. if (TryConsume('*')) {
  448. if (TryConsume('/')) {
  449. // End of comment.
  450. break;
  451. }
  452. }
  453. if (content != NULL) RecordTo(content);
  454. } else if (TryConsume('*') && TryConsume('/')) {
  455. // End of comment.
  456. if (content != NULL) {
  457. StopRecording();
  458. // Strip trailing "*/".
  459. content->erase(content->size() - 2);
  460. }
  461. break;
  462. } else if (TryConsume('/') && current_char_ == '*') {
  463. // Note: We didn't consume the '*' because if there is a '/' after it
  464. // we want to interpret that as the end of the comment.
  465. AddError(
  466. "\"/*\" inside block comment. Block comments cannot be nested.");
  467. } else if (current_char_ == '\0') {
  468. AddError("End-of-file inside block comment.");
  469. error_collector_->AddError(start_line, start_column,
  470. " Comment started here.");
  471. if (content != NULL) StopRecording();
  472. break;
  473. }
  474. }
  475. }
  476. Tokenizer::NextCommentStatus Tokenizer::TryConsumeCommentStart() {
  477. if (comment_style_ == CPP_COMMENT_STYLE && TryConsume('/')) {
  478. if (TryConsume('/')) {
  479. return LINE_COMMENT;
  480. } else if (TryConsume('*')) {
  481. return BLOCK_COMMENT;
  482. } else {
  483. // Oops, it was just a slash. Return it.
  484. current_.type = TYPE_SYMBOL;
  485. current_.text = "/";
  486. current_.line = line_;
  487. current_.column = column_ - 1;
  488. current_.end_column = column_;
  489. return SLASH_NOT_COMMENT;
  490. }
  491. } else if (comment_style_ == SH_COMMENT_STYLE && TryConsume('#')) {
  492. return LINE_COMMENT;
  493. } else {
  494. return NO_COMMENT;
  495. }
  496. }
  497. // -------------------------------------------------------------------
  498. bool Tokenizer::Next() {
  499. previous_ = current_;
  500. while (!read_error_) {
  501. ConsumeZeroOrMore<Whitespace>();
  502. switch (TryConsumeCommentStart()) {
  503. case LINE_COMMENT:
  504. ConsumeLineComment(NULL);
  505. continue;
  506. case BLOCK_COMMENT:
  507. ConsumeBlockComment(NULL);
  508. continue;
  509. case SLASH_NOT_COMMENT:
  510. return true;
  511. case NO_COMMENT:
  512. break;
  513. }
  514. // Check for EOF before continuing.
  515. if (read_error_) break;
  516. if (LookingAt<Unprintable>() || current_char_ == '\0') {
  517. AddError("Invalid control characters encountered in text.");
  518. NextChar();
  519. // Skip more unprintable characters, too. But, remember that '\0' is
  520. // also what current_char_ is set to after EOF / read error. We have
  521. // to be careful not to go into an infinite loop of trying to consume
  522. // it, so make sure to check read_error_ explicitly before consuming
  523. // '\0'.
  524. while (TryConsumeOne<Unprintable>() ||
  525. (!read_error_ && TryConsume('\0'))) {
  526. // Ignore.
  527. }
  528. } else {
  529. // Reading some sort of token.
  530. StartToken();
  531. if (TryConsumeOne<Letter>()) {
  532. ConsumeZeroOrMore<Alphanumeric>();
  533. current_.type = TYPE_IDENTIFIER;
  534. } else if (TryConsume('0')) {
  535. current_.type = ConsumeNumber(true, false);
  536. } else if (TryConsume('.')) {
  537. // This could be the beginning of a floating-point number, or it could
  538. // just be a '.' symbol.
  539. if (TryConsumeOne<Digit>()) {
  540. // It's a floating-point number.
  541. if (previous_.type == TYPE_IDENTIFIER &&
  542. current_.line == previous_.line &&
  543. current_.column == previous_.end_column) {
  544. // We don't accept syntax like "blah.123".
  545. error_collector_->AddError(
  546. line_, column_ - 2,
  547. "Need space between identifier and decimal point.");
  548. }
  549. current_.type = ConsumeNumber(false, true);
  550. } else {
  551. current_.type = TYPE_SYMBOL;
  552. }
  553. } else if (TryConsumeOne<Digit>()) {
  554. current_.type = ConsumeNumber(false, false);
  555. } else if (TryConsume('\"')) {
  556. ConsumeString('\"');
  557. current_.type = TYPE_STRING;
  558. } else if (TryConsume('\'')) {
  559. ConsumeString('\'');
  560. current_.type = TYPE_STRING;
  561. } else {
  562. // Check if the high order bit is set.
  563. if (current_char_ & 0x80) {
  564. error_collector_->AddError(
  565. line_, column_,
  566. StringPrintf("Interpreting non ascii codepoint %d.",
  567. static_cast<unsigned char>(current_char_)));
  568. }
  569. NextChar();
  570. current_.type = TYPE_SYMBOL;
  571. }
  572. EndToken();
  573. return true;
  574. }
  575. }
  576. // EOF
  577. current_.type = TYPE_END;
  578. current_.text.clear();
  579. current_.line = line_;
  580. current_.column = column_;
  581. current_.end_column = column_;
  582. return false;
  583. }
  584. namespace {
  585. // Helper class for collecting comments and putting them in the right places.
  586. //
  587. // This basically just buffers the most recent comment until it can be decided
  588. // exactly where that comment should be placed. When Flush() is called, the
  589. // current comment goes into either prev_trailing_comments or detached_comments.
  590. // When the CommentCollector is destroyed, the last buffered comment goes into
  591. // next_leading_comments.
  592. class CommentCollector {
  593. public:
  594. CommentCollector(std::string* prev_trailing_comments,
  595. std::vector<std::string>* detached_comments,
  596. std::string* next_leading_comments)
  597. : prev_trailing_comments_(prev_trailing_comments),
  598. detached_comments_(detached_comments),
  599. next_leading_comments_(next_leading_comments),
  600. has_comment_(false),
  601. is_line_comment_(false),
  602. can_attach_to_prev_(true) {
  603. if (prev_trailing_comments != NULL) prev_trailing_comments->clear();
  604. if (detached_comments != NULL) detached_comments->clear();
  605. if (next_leading_comments != NULL) next_leading_comments->clear();
  606. }
  607. ~CommentCollector() {
  608. // Whatever is in the buffer is a leading comment.
  609. if (next_leading_comments_ != NULL && has_comment_) {
  610. comment_buffer_.swap(*next_leading_comments_);
  611. }
  612. }
  613. // About to read a line comment. Get the comment buffer pointer in order to
  614. // read into it.
  615. std::string* GetBufferForLineComment() {
  616. // We want to combine with previous line comments, but not block comments.
  617. if (has_comment_ && !is_line_comment_) {
  618. Flush();
  619. }
  620. has_comment_ = true;
  621. is_line_comment_ = true;
  622. return &comment_buffer_;
  623. }
  624. // About to read a block comment. Get the comment buffer pointer in order to
  625. // read into it.
  626. std::string* GetBufferForBlockComment() {
  627. if (has_comment_) {
  628. Flush();
  629. }
  630. has_comment_ = true;
  631. is_line_comment_ = false;
  632. return &comment_buffer_;
  633. }
  634. void ClearBuffer() {
  635. comment_buffer_.clear();
  636. has_comment_ = false;
  637. }
  638. // Called once we know that the comment buffer is complete and is *not*
  639. // connected to the next token.
  640. void Flush() {
  641. if (has_comment_) {
  642. if (can_attach_to_prev_) {
  643. if (prev_trailing_comments_ != NULL) {
  644. prev_trailing_comments_->append(comment_buffer_);
  645. }
  646. can_attach_to_prev_ = false;
  647. } else {
  648. if (detached_comments_ != NULL) {
  649. detached_comments_->push_back(comment_buffer_);
  650. }
  651. }
  652. ClearBuffer();
  653. }
  654. }
  655. void DetachFromPrev() { can_attach_to_prev_ = false; }
  656. private:
  657. std::string* prev_trailing_comments_;
  658. std::vector<std::string>* detached_comments_;
  659. std::string* next_leading_comments_;
  660. std::string comment_buffer_;
  661. // True if any comments were read into comment_buffer_. This can be true even
  662. // if comment_buffer_ is empty, namely if the comment was "/**/".
  663. bool has_comment_;
  664. // Is the comment in the comment buffer a line comment?
  665. bool is_line_comment_;
  666. // Is it still possible that we could be reading a comment attached to the
  667. // previous token?
  668. bool can_attach_to_prev_;
  669. };
  670. } // namespace
  671. bool Tokenizer::NextWithComments(std::string* prev_trailing_comments,
  672. std::vector<std::string>* detached_comments,
  673. std::string* next_leading_comments) {
  674. CommentCollector collector(prev_trailing_comments, detached_comments,
  675. next_leading_comments);
  676. if (current_.type == TYPE_START) {
  677. // Ignore unicode byte order mark(BOM) if it appears at the file
  678. // beginning. Only UTF-8 BOM (0xEF 0xBB 0xBF) is accepted.
  679. if (TryConsume((char)0xEF)) {
  680. if (!TryConsume((char)0xBB) || !TryConsume((char)0xBF)) {
  681. AddError(
  682. "Proto file starts with 0xEF but not UTF-8 BOM. "
  683. "Only UTF-8 is accepted for proto file.");
  684. return false;
  685. }
  686. }
  687. collector.DetachFromPrev();
  688. } else {
  689. // A comment appearing on the same line must be attached to the previous
  690. // declaration.
  691. ConsumeZeroOrMore<WhitespaceNoNewline>();
  692. switch (TryConsumeCommentStart()) {
  693. case LINE_COMMENT:
  694. ConsumeLineComment(collector.GetBufferForLineComment());
  695. // Don't allow comments on subsequent lines to be attached to a trailing
  696. // comment.
  697. collector.Flush();
  698. break;
  699. case BLOCK_COMMENT:
  700. ConsumeBlockComment(collector.GetBufferForBlockComment());
  701. ConsumeZeroOrMore<WhitespaceNoNewline>();
  702. if (!TryConsume('\n')) {
  703. // Oops, the next token is on the same line. If we recorded a comment
  704. // we really have no idea which token it should be attached to.
  705. collector.ClearBuffer();
  706. return Next();
  707. }
  708. // Don't allow comments on subsequent lines to be attached to a trailing
  709. // comment.
  710. collector.Flush();
  711. break;
  712. case SLASH_NOT_COMMENT:
  713. return true;
  714. case NO_COMMENT:
  715. if (!TryConsume('\n')) {
  716. // The next token is on the same line. There are no comments.
  717. return Next();
  718. }
  719. break;
  720. }
  721. }
  722. // OK, we are now on the line *after* the previous token.
  723. while (true) {
  724. ConsumeZeroOrMore<WhitespaceNoNewline>();
  725. switch (TryConsumeCommentStart()) {
  726. case LINE_COMMENT:
  727. ConsumeLineComment(collector.GetBufferForLineComment());
  728. break;
  729. case BLOCK_COMMENT:
  730. ConsumeBlockComment(collector.GetBufferForBlockComment());
  731. // Consume the rest of the line so that we don't interpret it as a
  732. // blank line the next time around the loop.
  733. ConsumeZeroOrMore<WhitespaceNoNewline>();
  734. TryConsume('\n');
  735. break;
  736. case SLASH_NOT_COMMENT:
  737. return true;
  738. case NO_COMMENT:
  739. if (TryConsume('\n')) {
  740. // Completely blank line.
  741. collector.Flush();
  742. collector.DetachFromPrev();
  743. } else {
  744. bool result = Next();
  745. if (!result || current_.text == "}" || current_.text == "]" ||
  746. current_.text == ")") {
  747. // It looks like we're at the end of a scope. In this case it
  748. // makes no sense to attach a comment to the following token.
  749. collector.Flush();
  750. }
  751. return result;
  752. }
  753. break;
  754. }
  755. }
  756. }
  757. // -------------------------------------------------------------------
  758. // Token-parsing helpers. Remember that these don't need to report
  759. // errors since any errors should already have been reported while
  760. // tokenizing. Also, these can assume that whatever text they
  761. // are given is text that the tokenizer actually parsed as a token
  762. // of the given type.
  763. bool Tokenizer::ParseInteger(const std::string& text, uint64 max_value,
  764. uint64* output) {
  765. // Sadly, we can't just use strtoul() since it is only 32-bit and strtoull()
  766. // is non-standard. I hate the C standard library. :(
  767. // return strtoull(text.c_str(), NULL, 0);
  768. const char* ptr = text.c_str();
  769. int base = 10;
  770. if (ptr[0] == '0') {
  771. if (ptr[1] == 'x' || ptr[1] == 'X') {
  772. // This is hex.
  773. base = 16;
  774. ptr += 2;
  775. } else {
  776. // This is octal.
  777. base = 8;
  778. }
  779. }
  780. uint64 result = 0;
  781. for (; *ptr != '\0'; ptr++) {
  782. int digit = DigitValue(*ptr);
  783. if (digit < 0 || digit >= base) {
  784. // The token provided by Tokenizer is invalid. i.e., 099 is an invalid
  785. // token, but Tokenizer still think it's integer.
  786. return false;
  787. }
  788. if (digit > max_value || result > (max_value - digit) / base) {
  789. // Overflow.
  790. return false;
  791. }
  792. result = result * base + digit;
  793. }
  794. *output = result;
  795. return true;
  796. }
  797. double Tokenizer::ParseFloat(const std::string& text) {
  798. const char* start = text.c_str();
  799. char* end;
  800. double result = NoLocaleStrtod(start, &end);
  801. // "1e" is not a valid float, but if the tokenizer reads it, it will
  802. // report an error but still return it as a valid token. We need to
  803. // accept anything the tokenizer could possibly return, error or not.
  804. if (*end == 'e' || *end == 'E') {
  805. ++end;
  806. if (*end == '-' || *end == '+') ++end;
  807. }
  808. // If the Tokenizer had allow_f_after_float_ enabled, the float may be
  809. // suffixed with the letter 'f'.
  810. if (*end == 'f' || *end == 'F') {
  811. ++end;
  812. }
  813. GOOGLE_LOG_IF(DFATAL, end - start != text.size() || *start == '-')
  814. << " Tokenizer::ParseFloat() passed text that could not have been"
  815. " tokenized as a float: "
  816. << CEscape(text);
  817. return result;
  818. }
  819. // Helper to append a Unicode code point to a string as UTF8, without bringing
  820. // in any external dependencies.
  821. static void AppendUTF8(uint32 code_point, std::string* output) {
  822. uint32 tmp = 0;
  823. int len = 0;
  824. if (code_point <= 0x7f) {
  825. tmp = code_point;
  826. len = 1;
  827. } else if (code_point <= 0x07ff) {
  828. tmp = 0x0000c080 | ((code_point & 0x07c0) << 2) | (code_point & 0x003f);
  829. len = 2;
  830. } else if (code_point <= 0xffff) {
  831. tmp = 0x00e08080 | ((code_point & 0xf000) << 4) |
  832. ((code_point & 0x0fc0) << 2) | (code_point & 0x003f);
  833. len = 3;
  834. } else if (code_point <= 0x1fffff) {
  835. tmp = 0xf0808080 | ((code_point & 0x1c0000) << 6) |
  836. ((code_point & 0x03f000) << 4) | ((code_point & 0x000fc0) << 2) |
  837. (code_point & 0x003f);
  838. len = 4;
  839. } else {
  840. // UTF-16 is only defined for code points up to 0x10FFFF, and UTF-8 is
  841. // normally only defined up to there as well.
  842. StringAppendF(output, "\\U%08x", code_point);
  843. return;
  844. }
  845. tmp = ghtonl(tmp);
  846. output->append(reinterpret_cast<const char*>(&tmp) + sizeof(tmp) - len, len);
  847. }
  848. // Try to read <len> hex digits from ptr, and stuff the numeric result into
  849. // *result. Returns true if that many digits were successfully consumed.
  850. static bool ReadHexDigits(const char* ptr, int len, uint32* result) {
  851. *result = 0;
  852. if (len == 0) return false;
  853. for (const char* end = ptr + len; ptr < end; ++ptr) {
  854. if (*ptr == '\0') return false;
  855. *result = (*result << 4) + DigitValue(*ptr);
  856. }
  857. return true;
  858. }
  859. // Handling UTF-16 surrogate pairs. UTF-16 encodes code points in the range
  860. // 0x10000...0x10ffff as a pair of numbers, a head surrogate followed by a trail
  861. // surrogate. These numbers are in a reserved range of Unicode code points, so
  862. // if we encounter such a pair we know how to parse it and convert it into a
  863. // single code point.
  864. static const uint32 kMinHeadSurrogate = 0xd800;
  865. static const uint32 kMaxHeadSurrogate = 0xdc00;
  866. static const uint32 kMinTrailSurrogate = 0xdc00;
  867. static const uint32 kMaxTrailSurrogate = 0xe000;
  868. static inline bool IsHeadSurrogate(uint32 code_point) {
  869. return (code_point >= kMinHeadSurrogate) && (code_point < kMaxHeadSurrogate);
  870. }
  871. static inline bool IsTrailSurrogate(uint32 code_point) {
  872. return (code_point >= kMinTrailSurrogate) &&
  873. (code_point < kMaxTrailSurrogate);
  874. }
  875. // Combine a head and trail surrogate into a single Unicode code point.
  876. static uint32 AssembleUTF16(uint32 head_surrogate, uint32 trail_surrogate) {
  877. GOOGLE_DCHECK(IsHeadSurrogate(head_surrogate));
  878. GOOGLE_DCHECK(IsTrailSurrogate(trail_surrogate));
  879. return 0x10000 + (((head_surrogate - kMinHeadSurrogate) << 10) |
  880. (trail_surrogate - kMinTrailSurrogate));
  881. }
  882. // Convert the escape sequence parameter to a number of expected hex digits.
  883. static inline int UnicodeLength(char key) {
  884. if (key == 'u') return 4;
  885. if (key == 'U') return 8;
  886. return 0;
  887. }
  888. // Given a pointer to the 'u' or 'U' starting a Unicode escape sequence, attempt
  889. // to parse that sequence. On success, returns a pointer to the first char
  890. // beyond that sequence, and fills in *code_point. On failure, returns ptr
  891. // itself.
  892. static const char* FetchUnicodePoint(const char* ptr, uint32* code_point) {
  893. const char* p = ptr;
  894. // Fetch the code point.
  895. const int len = UnicodeLength(*p++);
  896. if (!ReadHexDigits(p, len, code_point)) return ptr;
  897. p += len;
  898. // Check if the code point we read is a "head surrogate." If so, then we
  899. // expect it to be immediately followed by another code point which is a valid
  900. // "trail surrogate," and together they form a UTF-16 pair which decodes into
  901. // a single Unicode point. Trail surrogates may only use \u, not \U.
  902. if (IsHeadSurrogate(*code_point) && *p == '\\' && *(p + 1) == 'u') {
  903. uint32 trail_surrogate;
  904. if (ReadHexDigits(p + 2, 4, &trail_surrogate) &&
  905. IsTrailSurrogate(trail_surrogate)) {
  906. *code_point = AssembleUTF16(*code_point, trail_surrogate);
  907. p += 6;
  908. }
  909. // If this failed, then we just emit the head surrogate as a code point.
  910. // It's bogus, but so is the string.
  911. }
  912. return p;
  913. }
  914. // The text string must begin and end with single or double quote
  915. // characters.
  916. void Tokenizer::ParseStringAppend(const std::string& text,
  917. std::string* output) {
  918. // Reminder: text[0] is always a quote character. (If text is
  919. // empty, it's invalid, so we'll just return).
  920. const size_t text_size = text.size();
  921. if (text_size == 0) {
  922. GOOGLE_LOG(DFATAL) << " Tokenizer::ParseStringAppend() passed text that could not"
  923. " have been tokenized as a string: "
  924. << CEscape(text);
  925. return;
  926. }
  927. // Reserve room for new string. The branch is necessary because if
  928. // there is already space available the reserve() call might
  929. // downsize the output.
  930. const size_t new_len = text_size + output->size();
  931. if (new_len > output->capacity()) {
  932. output->reserve(new_len);
  933. }
  934. // Loop through the string copying characters to "output" and
  935. // interpreting escape sequences. Note that any invalid escape
  936. // sequences or other errors were already reported while tokenizing.
  937. // In this case we do not need to produce valid results.
  938. for (const char* ptr = text.c_str() + 1; *ptr != '\0'; ptr++) {
  939. if (*ptr == '\\' && ptr[1] != '\0') {
  940. // An escape sequence.
  941. ++ptr;
  942. if (OctalDigit::InClass(*ptr)) {
  943. // An octal escape. May one, two, or three digits.
  944. int code = DigitValue(*ptr);
  945. if (OctalDigit::InClass(ptr[1])) {
  946. ++ptr;
  947. code = code * 8 + DigitValue(*ptr);
  948. }
  949. if (OctalDigit::InClass(ptr[1])) {
  950. ++ptr;
  951. code = code * 8 + DigitValue(*ptr);
  952. }
  953. output->push_back(static_cast<char>(code));
  954. } else if (*ptr == 'x') {
  955. // A hex escape. May zero, one, or two digits. (The zero case
  956. // will have been caught as an error earlier.)
  957. int code = 0;
  958. if (HexDigit::InClass(ptr[1])) {
  959. ++ptr;
  960. code = DigitValue(*ptr);
  961. }
  962. if (HexDigit::InClass(ptr[1])) {
  963. ++ptr;
  964. code = code * 16 + DigitValue(*ptr);
  965. }
  966. output->push_back(static_cast<char>(code));
  967. } else if (*ptr == 'u' || *ptr == 'U') {
  968. uint32 unicode;
  969. const char* end = FetchUnicodePoint(ptr, &unicode);
  970. if (end == ptr) {
  971. // Failure: Just dump out what we saw, don't try to parse it.
  972. output->push_back(*ptr);
  973. } else {
  974. AppendUTF8(unicode, output);
  975. ptr = end - 1; // Because we're about to ++ptr.
  976. }
  977. } else {
  978. // Some other escape code.
  979. output->push_back(TranslateEscape(*ptr));
  980. }
  981. } else if (*ptr == text[0] && ptr[1] == '\0') {
  982. // Ignore final quote matching the starting quote.
  983. } else {
  984. output->push_back(*ptr);
  985. }
  986. }
  987. }
  988. template <typename CharacterClass>
  989. static bool AllInClass(const std::string& s) {
  990. for (int i = 0; i < s.size(); ++i) {
  991. if (!CharacterClass::InClass(s[i])) return false;
  992. }
  993. return true;
  994. }
  995. bool Tokenizer::IsIdentifier(const std::string& text) {
  996. // Mirrors IDENTIFIER definition in Tokenizer::Next() above.
  997. if (text.size() == 0) return false;
  998. if (!Letter::InClass(text.at(0))) return false;
  999. if (!AllInClass<Alphanumeric>(text.substr(1))) return false;
  1000. return true;
  1001. }
  1002. } // namespace io
  1003. } // namespace protobuf
  1004. } // namespace google